[patch 19/26] Immediate Values - Documentation

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: akpm@linux-foundation.org, Ingo Molnar <mingo@elte.hu>,
	linux-kernel@vger.kernel.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>,
	Rusty Russell <rusty@rustcorp.com.au>
Subject: [patch 19/26] Immediate Values - Documentation
Date: Thu, 24 Jan 2008 15:27:25 -0500	[thread overview]
Message-ID: <20080124203340.502082115@polymtl.ca> (raw)
In-Reply-To: 20080124202706.250598537@polymtl.ca

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: immediate-values-documentation.patch --]
[-- Type: text/plain, Size: 8867 bytes --]

Changelog:
- Remove imv_set_early (removed from API).
- Use imv_* instead of immediate_*.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Rusty Russell <rusty@rustcorp.com.au>
---
 Documentation/immediate.txt |  221 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 221 insertions(+)

Index: linux-2.6-lttng/Documentation/immediate.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-lttng/Documentation/immediate.txt	2007-11-03 20:28:58.000000000 -0400
@@ -0,0 +1,221 @@
+		        Using the Immediate Values
+
+			    Mathieu Desnoyers
+
+
+This document introduces Immediate Values and their use.
+
+
+* Purpose of immediate values
+
+An immediate value is used to compile into the kernel variables that sit within
+the instruction stream. They are meant to be rarely updated but read often.
+Using immediate values for these variables will save cache lines.
+
+This infrastructure is specialized in supporting dynamic patching of the values
+in the instruction stream when multiple CPUs are running without disturbing the
+normal system behavior.
+
+Compiling code meant to be rarely enabled at runtime can be done using
+if (unlikely(imv_read(var))) as condition surrounding the code. The
+smallest data type required for the test (an 8 bits char) is preferred, since
+some architectures, such as powerpc, only allow up to 16 bits immediate values.
+
+
+* Usage
+
+In order to use the "immediate" macros, you should include linux/immediate.h.
+
+#include <linux/immediate.h>
+
+DEFINE_IMV(char, this_immediate);
+EXPORT_IMV_SYMBOL(this_immediate);
+
+
+And use, in the body of a function:
+
+Use imv_set(this_immediate) to set the immediate value.
+
+Use imv_read(this_immediate) to read the immediate value.
+
+The immediate mechanism supports inserting multiple instances of the same
+immediate. Immediate values can be put in inline functions, inlined static
+functions, and unrolled loops.
+
+If you have to read the immediate values from a function declared as __init or
+__exit, you should explicitly use _imv_read(), which will fall back on a
+global variable read. Failing to do so will leave a reference to the __init
+section after it is freed (it would generate a modpost warning).
+
+You can choose to set an initial static value to the immediate by using, for
+instance:
+
+DEFINE_IMV(long, myptr) = 10;
+
+
+* Optimization for a given architecture
+
+One can implement optimized immediate values for a given architecture by
+replacing asm-$ARCH/immediate.h.
+
+
+* Performance improvement
+
+
+  * Memory hit for a data-based branch
+
+Here are the results on a 3GHz Pentium 4:
+
+number of tests: 100
+number of branches per test: 100000
+memory hit cycles per iteration (mean): 636.611
+L1 cache hit cycles per iteration (mean): 89.6413
+instruction stream based test, cycles per iteration (mean): 85.3438
+Just getting the pointer from a modulo on a pseudo-random value, doing
+  nothing with it, cycles per iteration (mean): 77.5044
+
+So:
+Base case:                      77.50 cycles
+instruction stream based test:  +7.8394 cycles
+L1 cache hit based test:        +12.1369 cycles
+Memory load based test:         +559.1066 cycles
+
+So let's say we have a ping flood coming at
+(14014 packets transmitted, 14014 received, 0% packet loss, time 1826ms)
+7674 packets per second. If we put 2 markers for irq entry/exit, it
+brings us to 15348 markers sites executed per second.
+
+(15348 exec/s) * (559 cycles/exec) / (3G cycles/s) = 0.0029
+We therefore have a 0.29% slowdown just on this case.
+
+Compared to this, the instruction stream based test will cause a
+slowdown of:
+
+(15348 exec/s) * (7.84 cycles/exec) / (3G cycles/s) = 0.00004
+For a 0.004% slowdown.
+
+If we plan to use this for memory allocation, spinlock, and all sorts of
+very high event rate tracing, we can assume it will execute 10 to 100
+times more sites per second, which brings us to 0.4% slowdown with the
+instruction stream based test compared to 29% slowdown with the memory
+load based test on a system with high memory pressure.
+
+
+
+  * Markers impact under heavy memory load
+
+Running a kernel with my LTTng instrumentation set, in a test that
+generates memory pressure (from userspace) by trashing L1 and L2 caches
+between calls to getppid() (note: syscall_trace is active and calls
+a marker upon syscall entry and syscall exit; markers are disarmed).
+This test is done in user-space, so there are some delays due to IRQs
+coming and to the scheduler. (UP 2.6.22-rc6-mm1 kernel, task with -20
+nice level)
+
+My first set of results: Linear cache trashing, turned out not to be
+very interesting, because it seems like the linearity of the memset on a
+full array is somehow detected and it does not "really" trash the
+caches.
+
+Now the most interesting result: Random walk L1 and L2 trashing
+surrounding a getppid() call.
+
+- Markers compiled out (but syscall_trace execution forced)
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.033 cycles
+getppid: 1681.4 cycles
+With memory pressure
+Reading timestamps takes 102.938 cycles
+getppid: 15691.6 cycles
+
+
+- With the immediate values based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 108.006 cycles
+getppid: 1681.84 cycles
+With memory pressure
+Reading timestamps takes 100.291 cycles
+getppid: 11793 cycles
+
+
+- With global variables based markers:
+number of tests: 10000
+No memory pressure
+Reading timestamps takes 107.999 cycles
+getppid: 1669.06 cycles
+With memory pressure
+Reading timestamps takes 102.839 cycles
+getppid: 12535 cycles
+
+The result is quite interesting in that the kernel is slower without
+markers than with markers. I explain it by the fact that the data
+accessed is not laid out in the same manner in the cache lines when the
+markers are compiled in or out. It seems that it aligns the function's
+data better to compile-in the markers in this case.
+
+But since the interesting comparison is between the immediate values and
+global variables based markers, and because they share the same memory
+layout, except for the movl being replaced by a movz, we see that the
+global variable based markers (2 markers) adds 742 cycles to each system
+call (syscall entry and exit are traced and memory locations for both
+global variables lie on the same cache line).
+
+
+- Test redone with less iterations, but with error estimates
+
+10 runs of 100 iterations each: Tests done on a 3GHz P4. Here I run getppid with
+syscall trace inactive, comparing the case with memory pressure and without
+memory pressure. (sorry, my system is not setup to execute syscall_trace this
+time, but it will make the point anyway).
+
+No memory pressure
+Reading timestamps:     150.92 cycles,     std dev.    1.01 cycles
+getppid:               1462.09 cycles,     std dev.   18.87 cycles
+
+With memory pressure
+Reading timestamps:     578.22 cycles,     std dev.  269.51 cycles
+getppid:              17113.33 cycles,     std dev. 1655.92 cycles
+
+
+Now for memory read timing: (10 runs, branches per test: 100000)
+Memory read based branch:
+                       644.09 cycles,      std dev.   11.39 cycles
+L1 cache hit based branch:
+                        88.16 cycles,      std dev.    1.35 cycles
+
+
+So, now that we have the raw results, let's calculate:
+
+Memory read:
+644.09±11.39 - 88.16±1.35 = 555.93±11.46 cycles
+
+Getppid without memory pressure:
+1462.09±18.87 - 150.92±1.01 = 1311.17±18.90 cycles
+
+Getppid with memory pressure:
+17113.33±1655.92 - 578.22±269.51 = 16535.11±1677.71 cycles
+
+Therefore, if we add 2 markers not based on immediate values to the getppid
+code, which would add 2 memory reads, we would add
+2 * 555.93±12.74 = 1111.86±25.48 cycles
+
+Therefore,
+
+1111.86±25.48 / 16535.11±1677.71 = 0.0672
+ relative error: sqrt(((25.48/1111.86)^2)+((1677.71/16535.11)^2))
+                     = 0.1040
+ absolute error: 0.1040 * 0.0672 = 0.0070
+
+Therefore: 0.0672±0.0070 * 100% = 6.72±0.70 %
+
+We can therefore affirm that adding 2 markers to getppid, on a system with high
+memory pressure, would have a performance hit of at least 6.0% on the system
+call time, all within the uncertainty limits of these tests. The same applies to
+other kernel code paths. The smaller those code paths are, the highest the
+impact ratio will be.
+
+Therefore, not only is it interesting to use the immediate values to dynamically
+activate dormant code such as the markers, but I think it should also be
+considered as a replacement for many of the "read-mostly" static variables.

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2008-01-24 20:36 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-01-24 20:27 [patch 00/26] Instrumentation Support Enhancement (2.6.24-rc8-mm1) Mathieu Desnoyers
2008-01-24 20:27 ` [patch 01/26] Linux Kernel Markers Support for Proprierary Modules Mathieu Desnoyers
2008-01-24 22:19   ` Jon Masters
2008-01-24 20:27 ` [patch 02/26] Fix ARM to play nicely with generic Instrumentation menu Mathieu Desnoyers
2008-01-24 21:13   ` Russell King
2008-01-24 21:23     ` Mathieu Desnoyers
2008-01-24 22:17       ` Russell King
2008-01-24 20:27 ` [patch 03/26] Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig Mathieu Desnoyers
2008-01-24 21:00   ` Randy Dunlap
2008-01-24 21:05     ` Mathieu Desnoyers
2008-01-24 22:03   ` Mathieu Desnoyers
2008-01-24 23:05   ` Haavard Skinnemoen
2008-01-24 20:27 ` [patch 04/26] Kprobes - use a mutex to protect the instruction pages list Mathieu Desnoyers
2008-01-24 20:27 ` [patch 05/26] Kprobes - do not use kprobes mutex in arch code Mathieu Desnoyers
2008-01-24 20:27 ` [patch 06/26] Kprobes - declare kprobe_mutex static Mathieu Desnoyers
2008-01-24 20:27 ` [patch 07/26] Add INIT_ARRAY() to kernel.h Mathieu Desnoyers
2008-01-24 20:39   ` Jan Engelhardt
2008-01-24 20:54     ` [patch 07/26] Add INIT_ARRAY() to kernel.h (updated) Mathieu Desnoyers
2008-01-24 21:08       ` Jan Engelhardt
2008-01-24 21:18         ` Mathieu Desnoyers
2008-01-24 20:58   ` [patch 07/26] Add INIT_ARRAY() to kernel.h Randy Dunlap
2008-01-24 21:04     ` Mathieu Desnoyers
2008-01-24 22:02       ` Stefan Richter
2008-01-24 22:10         ` [patch 07/26] Add INIT_ARRAY() to kernel.h (update 2) Mathieu Desnoyers
2008-01-24 22:50           ` Alexey Dobriyan
2008-01-24 23:04       ` [patch 07/26] Add INIT_ARRAY() to kernel.h H. Peter Anvin
2008-01-25 13:14         ` Mathieu Desnoyers
2008-01-25  8:03       ` Jan Engelhardt
2008-01-24 23:03   ` H. Peter Anvin
2008-01-24 20:27 ` [patch 08/26] Text Edit Lock - Architecture Independent Code Mathieu Desnoyers
2008-01-24 20:27 ` [patch 09/26] Text Edit Lock - Alternative code for x86 Mathieu Desnoyers
2008-01-24 20:27 ` [patch 10/26] Text Edit Lock - kprobes architecture independent support Mathieu Desnoyers
2008-01-24 20:27 ` [patch 11/26] Text Edit Lock - kprobes x86 Mathieu Desnoyers
2008-01-24 20:27 ` [patch 12/26] Text Edit Lock - x86_32 standardize debug rodata Mathieu Desnoyers
2008-01-24 20:27 ` [patch 13/26] Text Edit Lock - x86_64 " Mathieu Desnoyers
2008-01-24 20:27 ` [patch 14/26] Immediate Values - Architecture Independent Code Mathieu Desnoyers
2008-01-24 20:27 ` [patch 15/26] Immediate Values - Kconfig menu in EMBEDDED Mathieu Desnoyers
2008-01-24 20:27 ` [patch 16/26] Immediate Values - x86 Optimization Mathieu Desnoyers
2008-01-24 20:27 ` [patch 17/26] Add text_poke and sync_core to powerpc Mathieu Desnoyers
2008-01-24 20:27 ` [patch 18/26] Immediate Values - Powerpc Optimization Mathieu Desnoyers
2008-01-24 20:27 ` Mathieu Desnoyers [this message]
2008-01-24 20:27 ` [patch 20/26] Scheduler Profiling - Use Immediate Values Mathieu Desnoyers
2008-01-24 20:27 ` [patch 21/26] Immediate Values - Move Kprobes x86 restore_interrupt to kdebug.h Mathieu Desnoyers
2008-01-24 20:27 ` [patch 22/26] Add __discard section to x86 Mathieu Desnoyers
2008-01-24 20:27 ` [patch 23/26] Immediate Values - x86 Optimization NMI and MCE support Mathieu Desnoyers
2008-01-24 20:27 ` [patch 24/26] Immediate Values - Powerpc Optimization NMI " Mathieu Desnoyers
2008-01-24 20:27 ` [patch 25/26] Immediate Values Use Arch NMI and MCE Support Mathieu Desnoyers
2008-01-24 20:27 ` [patch 26/26] Linux Kernel Markers - Use Immediate Values Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20080124203340.502082115@polymtl.ca \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox