[patch 09/12] Immediate Values - Documentation

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Ingo Molnar <mingo@elte.hu>, linux-kernel@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Adrian Bunk <bunk@stusta.de>, Andi Kleen <andi@firstfloor.org>,
	Christoph Hellwig <hch@infradead.org>,
	akpm@osdl.org, KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [patch 09/12] Immediate Values - Documentation
Date: Thu, 24 Sep 2009 09:26:35 -0400	[thread overview]
Message-ID: <20090924133400.355409744@polymtl.ca> (raw)
In-Reply-To: 20090924132626.485545323@polymtl.ca

[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 9146 bytes --]

Changelog:
- Remove imv_set_early (removed from API).
- Use imv_* instead of immediate_*.
- Remove non-ascii characters.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
CC: Adrian Bunk <bunk@stusta.de>
CC: Andi Kleen <andi@firstfloor.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: mingo@elte.hu
CC: akpm@osdl.org
CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/immediate.txt |  221 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2009-09-24 08:31:49.000000000 -0400
@@ -0,0 +1,221 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(imv_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMV(char, this_immediate);
+EXPORT_IMV_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use imv_set(this_immediate) to set the immediate value.
+
+Use imv_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _imv_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMV(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 tracepoints for irq entry/exit, it
+brings us to 15348 tracepoint sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Tracepoint impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a tracepoint upon syscall entry and syscall exit; tracepoints are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Tracepoints compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based tracepoints:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based tracepoints:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+tracepoints than with tracepoints. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+tracepoints are compiled in or out. It seems that it aligns the function's
+data better to compile-in the tracepoints in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based tracepoints, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based tracepoints (2 tracepoints) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and without
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09 +/- 11.39 - 88.16 +/- 1.35 = 555.93 +/- 11.46 cycles
+
+Getppid without memory pressure:
+1462.09 +/- 18.87 - 150.92 +/- 1.01 = 1311.17 +/- 18.90 cycles
+
+Getppid with memory pressure:
+17113.33 +/- 1655.92 - 578.22 +/- 269.51 = 16535.11 +/- 1677.71 cycles
+
+Therefore, if we add 2 tracepoints not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93 +/- 12.74 = 1111.86 +/- 25.48 cycles
+
+Therefore,
+
+1111.86 +/- 25.48 / 16535.11 +/- 1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672 +/- 0.0070 * 100% = 6.72 +/- 0.70 %
+
+We can therefore affirm that adding 2 tracepoints to getppid, on a system with
+high memory pressure, would have a performance hit of at least 6.0% on the
+system call time, all within the uncertainty limits of these tests. The same
+applies to other kernel code paths. The smaller those code paths are, the
+highest the impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the tracepoints, but I think it should also be
+considered as a replacement for many of the "read-mostly" static variables.

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-09-24 14:43 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-24 13:26 [patch 00/12] Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 01/12] x86: text_poke_early non static Mathieu Desnoyers
2009-09-24 13:26 ` [patch 02/12] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2009-09-25  4:20   ` Andrew Morton
2009-09-27 23:23     ` Mathieu Desnoyers
2009-09-28  1:23     ` Andi Kleen
2009-09-28 17:46       ` Andrew Morton
2009-09-28 18:03         ` Arjan van de Ven
2009-09-28 18:40           ` Mathieu Desnoyers
2009-09-28 19:54           ` Andi Kleen
2009-09-28 20:37             ` Arjan van de Ven
2009-09-28 21:32               ` H. Peter Anvin
2009-09-28 22:05                 ` Mathieu Desnoyers
2009-09-28 20:11         ` Andi Kleen
2009-09-28 21:16           ` Andrew Morton
2009-09-28 22:01             ` Mathieu Desnoyers
2009-09-24 13:26 ` [patch 03/12] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2009-09-24 13:26 ` [patch 04/12] Immediate Values - x86 Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 05/12] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 06/12] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2009-09-24 13:26 ` [patch 07/12] Sparc create asm.h Mathieu Desnoyers
2009-09-24 21:10   ` David Miller
2009-09-24 13:26 ` [patch 08/12] sparc64: Optimized immediate value implementation Mathieu Desnoyers
2009-09-24 13:26 ` Mathieu Desnoyers [this message]
2009-09-24 13:26 ` [patch 10/12] Immediate Values Support init Mathieu Desnoyers
2009-09-24 15:33   ` [patch 10.1/12] Immediate values fixes for modules Mathieu Desnoyers
2009-09-24 15:35   ` [patch 10.2/12] Fix Immediate Values x86_64 support old gcc Mathieu Desnoyers
2009-09-24 13:26 ` [patch 11/12] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2009-09-24 13:26 ` [patch 12/12] Tracepoints - " Mathieu Desnoyers
2009-09-24 14:51   ` Peter Zijlstra
2009-09-24 15:03     ` Mathieu Desnoyers
2009-09-24 15:06       ` Peter Zijlstra
2009-09-24 16:01         ` [RFC patch] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2009-09-24 21:59           ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090924133400.355409744@polymtl.ca \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@osdl.org \
    --cc=andi@firstfloor.org \
    --cc=bunk@stusta.de \
    --cc=hch@infradead.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.