linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] Allow inlined spinlocks again V3
@ 2009-08-12 18:39 Heiko Carstens
  2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt

Hi Andrew,

since there were no objections to this patch set, would you mind picking
it up for -mm? The patches are against current Linus' git head but also
apply on top of the latest mmotm patch set.

===

This patch set allows to have inlined spinlocks again.

The rationale behind this is that function calls on at least s390 are
expensive.

If one considers that server kernels are usually compiled with
!CONFIG_PREEMPT a simple spin_lock is just a compare and swap loop.
The extra overhead for a function call is significant.
With inlined spinlocks overall cpu usage gets reduced by 1%-5% on s390.
These numbers were taken with some network benchmarks. However I expect
any workload that calls frequently into the kernel and which grabs a few
locks to perform better.

The implementation is straight forward: move the function bodies of the
locking functions to static inline functions and place them in a header
file.
Dependent on CONFIG_SPINLOCK_INLINE generate out-of-line or inlined
locking functions.

The patches should be self explaining.

V2: rewritten from scratch - now also with readable code

V3: removed macro to generate out-of-line spinlock variants since that
    would break ctags. As requested by Arnd Bergmann.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 1/3] spinlock: move spinlock function bodies to header file
  2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
@ 2009-08-12 18:39 ` Heiko Carstens
  2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt

[-- Attachment #1: 01_spinlock_move.diff --]
[-- Type: text/plain, Size: 17156 bytes --]

From: Heiko Carstens <heiko.carstens@de.ibm.com>

Move spinlock function bodies to header file by creating a
static inline version of each variant.
Use the inline version on the out-of-line code. This shouldn't
make any difference besides that the spinlock code can now
be used to generate inlined spinlock code.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 include/linux/spinlock.h         |   18 +-
 include/linux/spinlock_api_smp.h |  263 +++++++++++++++++++++++++++++++++++++++
 kernel/spinlock.c                |  174 ++++---------------------
 3 files changed, 300 insertions(+), 155 deletions(-)

Index: linux-2.6/include/linux/spinlock_api_smp.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_api_smp.h
+++ linux-2.6/include/linux/spinlock_api_smp.h
@@ -60,4 +60,267 @@ void __lockfunc _read_unlock_irqrestore(
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 							__releases(lock);
 
+static inline int __spin_trylock(spinlock_t *lock)
+{
+	preempt_disable();
+	if (_raw_spin_trylock(lock)) {
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		return 1;
+	}
+	preempt_enable();
+	return 0;
+}
+
+static inline int __read_trylock(rwlock_t *lock)
+{
+	preempt_disable();
+	if (_raw_read_trylock(lock)) {
+		rwlock_acquire_read(&lock->dep_map, 0, 1, _RET_IP_);
+		return 1;
+	}
+	preempt_enable();
+	return 0;
+}
+
+static inline int __write_trylock(rwlock_t *lock)
+{
+	preempt_disable();
+	if (_raw_write_trylock(lock)) {
+		rwlock_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		return 1;
+	}
+	preempt_enable();
+	return 0;
+}
+
+/*
+ * If lockdep is enabled then we use the non-preemption spin-ops
+ * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
+ * not re-enabled during lock-acquire (which the preempt-spin-ops do):
+ */
+#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
+
+static inline void __read_lock(rwlock_t *lock)
+{
+	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+}
+
+static inline unsigned long __spin_lock_irqsave(spinlock_t *lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	/*
+	 * On lockdep we dont want the hand-coded irq-enable of
+	 * _raw_spin_lock_flags() code, because lockdep assumes
+	 * that interrupts are not re-enabled during lock-acquire:
+	 */
+#ifdef CONFIG_LOCKDEP
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+#else
+	_raw_spin_lock_flags(lock, &flags);
+#endif
+	return flags;
+}
+
+static inline void __spin_lock_irq(spinlock_t *lock)
+{
+	local_irq_disable();
+	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+}
+
+static inline void __spin_lock_bh(spinlock_t *lock)
+{
+	local_bh_disable();
+	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+}
+
+static inline unsigned long __read_lock_irqsave(rwlock_t *lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED_FLAGS(lock, _raw_read_trylock, _raw_read_lock,
+			     _raw_read_lock_flags, &flags);
+	return flags;
+}
+
+static inline void __read_lock_irq(rwlock_t *lock)
+{
+	local_irq_disable();
+	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+}
+
+static inline void __read_lock_bh(rwlock_t *lock)
+{
+	local_bh_disable();
+	preempt_disable();
+	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+}
+
+static inline unsigned long __write_lock_irqsave(rwlock_t *lock)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED_FLAGS(lock, _raw_write_trylock, _raw_write_lock,
+			     _raw_write_lock_flags, &flags);
+	return flags;
+}
+
+static inline void __write_lock_irq(rwlock_t *lock)
+{
+	local_irq_disable();
+	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+}
+
+static inline void __write_lock_bh(rwlock_t *lock)
+{
+	local_bh_disable();
+	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+}
+
+static inline void __spin_lock(spinlock_t *lock)
+{
+	preempt_disable();
+	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+}
+
+static inline void __write_lock(rwlock_t *lock)
+{
+	preempt_disable();
+	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+}
+
+#endif /* CONFIG_PREEMPT */
+
+static inline void __spin_unlock(spinlock_t *lock)
+{
+	spin_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_spin_unlock(lock);
+	preempt_enable();
+}
+
+static inline void __write_unlock(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_write_unlock(lock);
+	preempt_enable();
+}
+
+static inline void __read_unlock(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_read_unlock(lock);
+	preempt_enable();
+}
+
+static inline void __spin_unlock_irqrestore(spinlock_t *lock,
+					    unsigned long flags)
+{
+	spin_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_spin_unlock(lock);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+
+static inline void __spin_unlock_irq(spinlock_t *lock)
+{
+	spin_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_spin_unlock(lock);
+	local_irq_enable();
+	preempt_enable();
+}
+
+static inline void __spin_unlock_bh(spinlock_t *lock)
+{
+	spin_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_spin_unlock(lock);
+	preempt_enable_no_resched();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+}
+
+static inline void __read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_read_unlock(lock);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+
+static inline void __read_unlock_irq(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_read_unlock(lock);
+	local_irq_enable();
+	preempt_enable();
+}
+
+static inline void __read_unlock_bh(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_read_unlock(lock);
+	preempt_enable_no_resched();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+}
+
+static inline void __write_unlock_irqrestore(rwlock_t *lock,
+					     unsigned long flags)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_write_unlock(lock);
+	local_irq_restore(flags);
+	preempt_enable();
+}
+
+static inline void __write_unlock_irq(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_write_unlock(lock);
+	local_irq_enable();
+	preempt_enable();
+}
+
+static inline void __write_unlock_bh(rwlock_t *lock)
+{
+	rwlock_release(&lock->dep_map, 1, _RET_IP_);
+	_raw_write_unlock(lock);
+	preempt_enable_no_resched();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+}
+
+static inline int __spin_trylock_bh(spinlock_t *lock)
+{
+	local_bh_disable();
+	preempt_disable();
+	if (_raw_spin_trylock(lock)) {
+		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
+		return 1;
+	}
+	preempt_enable_no_resched();
+	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	return 0;
+}
+
 #endif /* __LINUX_SPINLOCK_API_SMP_H */
Index: linux-2.6/include/linux/spinlock.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock.h
+++ linux-2.6/include/linux/spinlock.h
@@ -143,15 +143,6 @@ static inline void smp_mb__after_lock(vo
  */
 #define spin_unlock_wait(lock)	__raw_spin_unlock_wait(&(lock)->raw_lock)
 
-/*
- * Pull the _spin_*()/_read_*()/_write_*() functions/declarations:
- */
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-# include <linux/spinlock_api_smp.h>
-#else
-# include <linux/spinlock_api_up.h>
-#endif
-
 #ifdef CONFIG_DEBUG_SPINLOCK
  extern void _raw_spin_lock(spinlock_t *lock);
 #define _raw_spin_lock_flags(lock, flags) _raw_spin_lock(lock)
@@ -380,4 +371,13 @@ extern int _atomic_dec_and_lock(atomic_t
  */
 #define spin_can_lock(lock)	(!spin_is_locked(lock))
 
+/*
+ * Pull the _spin_*()/_read_*()/_write_*() functions/declarations:
+ */
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+# include <linux/spinlock_api_smp.h>
+#else
+# include <linux/spinlock_api_up.h>
+#endif
+
 #endif /* __LINUX_SPINLOCK_H */
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -23,40 +23,19 @@
 
 int __lockfunc _spin_trylock(spinlock_t *lock)
 {
-	preempt_disable();
-	if (_raw_spin_trylock(lock)) {
-		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-		return 1;
-	}
-	
-	preempt_enable();
-	return 0;
+	return __spin_trylock(lock);
 }
 EXPORT_SYMBOL(_spin_trylock);
 
 int __lockfunc _read_trylock(rwlock_t *lock)
 {
-	preempt_disable();
-	if (_raw_read_trylock(lock)) {
-		rwlock_acquire_read(&lock->dep_map, 0, 1, _RET_IP_);
-		return 1;
-	}
-
-	preempt_enable();
-	return 0;
+	return __read_trylock(lock);
 }
 EXPORT_SYMBOL(_read_trylock);
 
 int __lockfunc _write_trylock(rwlock_t *lock)
 {
-	preempt_disable();
-	if (_raw_write_trylock(lock)) {
-		rwlock_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-		return 1;
-	}
-
-	preempt_enable();
-	return 0;
+	return __write_trylock(lock);
 }
 EXPORT_SYMBOL(_write_trylock);
 
@@ -69,129 +48,74 @@ EXPORT_SYMBOL(_write_trylock);
 
 void __lockfunc _read_lock(rwlock_t *lock)
 {
-	preempt_disable();
-	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+	__read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock);
 
 unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	preempt_disable();
-	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	/*
-	 * On lockdep we dont want the hand-coded irq-enable of
-	 * _raw_spin_lock_flags() code, because lockdep assumes
-	 * that interrupts are not re-enabled during lock-acquire:
-	 */
-#ifdef CONFIG_LOCKDEP
-	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
-#else
-	_raw_spin_lock_flags(lock, &flags);
-#endif
-	return flags;
+	return __spin_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irqsave);
 
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
-	local_irq_disable();
-	preempt_disable();
-	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+	__spin_lock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irq);
 
 void __lockfunc _spin_lock_bh(spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
-	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+	__spin_lock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_lock_bh);
 
 unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	preempt_disable();
-	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED_FLAGS(lock, _raw_read_trylock, _raw_read_lock,
-			     _raw_read_lock_flags, &flags);
-	return flags;
+	return __read_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_read_lock_irqsave);
 
 void __lockfunc _read_lock_irq(rwlock_t *lock)
 {
-	local_irq_disable();
-	preempt_disable();
-	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+	__read_lock_irq(lock);
 }
 EXPORT_SYMBOL(_read_lock_irq);
 
 void __lockfunc _read_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
-	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
+	__read_lock_bh(lock);
 }
 EXPORT_SYMBOL(_read_lock_bh);
 
 unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock)
 {
-	unsigned long flags;
-
-	local_irq_save(flags);
-	preempt_disable();
-	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED_FLAGS(lock, _raw_write_trylock, _raw_write_lock,
-			     _raw_write_lock_flags, &flags);
-	return flags;
+	return __write_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_write_lock_irqsave);
 
 void __lockfunc _write_lock_irq(rwlock_t *lock)
 {
-	local_irq_disable();
-	preempt_disable();
-	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+	__write_lock_irq(lock);
 }
 EXPORT_SYMBOL(_write_lock_irq);
 
 void __lockfunc _write_lock_bh(rwlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
-	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+	__write_lock_bh(lock);
 }
 EXPORT_SYMBOL(_write_lock_bh);
 
 void __lockfunc _spin_lock(spinlock_t *lock)
 {
-	preempt_disable();
-	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
+	__spin_lock(lock);
 }
-
 EXPORT_SYMBOL(_spin_lock);
 
 void __lockfunc _write_lock(rwlock_t *lock)
 {
-	preempt_disable();
-	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
+	__write_lock(lock);
 }
-
 EXPORT_SYMBOL(_write_lock);
 
 #else /* CONFIG_PREEMPT: */
@@ -320,121 +244,79 @@ EXPORT_SYMBOL(_spin_lock_nest_lock);
 
 void __lockfunc _spin_unlock(spinlock_t *lock)
 {
-	spin_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_spin_unlock(lock);
-	preempt_enable();
+	__spin_unlock(lock);
 }
 EXPORT_SYMBOL(_spin_unlock);
 
 void __lockfunc _write_unlock(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_write_unlock(lock);
-	preempt_enable();
+	__write_unlock(lock);
 }
 EXPORT_SYMBOL(_write_unlock);
 
 void __lockfunc _read_unlock(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_read_unlock(lock);
-	preempt_enable();
+	__read_unlock(lock);
 }
 EXPORT_SYMBOL(_read_unlock);
 
 void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
 {
-	spin_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_spin_unlock(lock);
-	local_irq_restore(flags);
-	preempt_enable();
+	__spin_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_spin_unlock_irqrestore);
 
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)
 {
-	spin_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_spin_unlock(lock);
-	local_irq_enable();
-	preempt_enable();
+	__spin_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_irq);
 
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)
 {
-	spin_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_spin_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_bh);
 
 void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_read_unlock(lock);
-	local_irq_restore(flags);
-	preempt_enable();
+	__read_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_read_unlock_irqrestore);
 
 void __lockfunc _read_unlock_irq(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_read_unlock(lock);
-	local_irq_enable();
-	preempt_enable();
+	__read_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_read_unlock_irq);
 
 void __lockfunc _read_unlock_bh(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_read_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__read_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_read_unlock_bh);
 
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_write_unlock(lock);
-	local_irq_restore(flags);
-	preempt_enable();
+	__write_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_write_unlock_irqrestore);
 
 void __lockfunc _write_unlock_irq(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_write_unlock(lock);
-	local_irq_enable();
-	preempt_enable();
+	__write_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_write_unlock_irq);
 
 void __lockfunc _write_unlock_bh(rwlock_t *lock)
 {
-	rwlock_release(&lock->dep_map, 1, _RET_IP_);
-	_raw_write_unlock(lock);
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
+	__write_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_write_unlock_bh);
 
 int __lockfunc _spin_trylock_bh(spinlock_t *lock)
 {
-	local_bh_disable();
-	preempt_disable();
-	if (_raw_spin_trylock(lock)) {
-		spin_acquire(&lock->dep_map, 0, 1, _RET_IP_);
-		return 1;
-	}
-
-	preempt_enable_no_resched();
-	local_bh_enable_ip((unsigned long)__builtin_return_address(0));
-	return 0;
+	return __spin_trylock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_trylock_bh);
 

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
  2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
@ 2009-08-12 18:39 ` Heiko Carstens
  2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens
  2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
  3 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt

[-- Attachment #1: 02_spinlock_inline.diff --]
[-- Type: text/plain, Size: 4678 bytes --]

From: Heiko Carstens <heiko.carstens@de.ibm.com>

Add new config option SPINLOCK_INLINE and some defines which depend
on it in order to generate inlined spinlock code instead of out-of-line
code.
Avoiding function calls for spinlocks gives 1%-5% less cpu usage on
network benchmarks on s390.

Architectures must select HAVE_SPINLOCK_INLINE_SUPPORT to enable this
config option.

Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 include/linux/spinlock_api_smp.h |   35 +++++++++++++++++++++++++++++++++++
 kernel/spinlock.c                |    4 ++++
 lib/Kconfig.debug                |   14 ++++++++++++++
 3 files changed, 53 insertions(+)

Index: linux-2.6/include/linux/spinlock_api_smp.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_api_smp.h
+++ linux-2.6/include/linux/spinlock_api_smp.h
@@ -19,6 +19,8 @@ int in_lock_functions(unsigned long addr
 
 #define assert_spin_locked(x)	BUG_ON(!spin_is_locked(x))
 
+#ifndef CONFIG_SPINLOCK_INLINE
+
 void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
 void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass)
 							__acquires(lock);
@@ -60,6 +62,39 @@ void __lockfunc _read_unlock_irqrestore(
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 							__releases(lock);
 
+#else /* CONFIG_HAVE_SPINLOCK_INLINE_SUPPORT */
+
+#define _spin_trylock(lock)	  __spin_trylock(lock)
+#define _read_trylock(lock)	  __read_trylock(lock)
+#define _write_trylock(lock)	  __write_trylock(lock)
+#define _read_lock(lock)	  __read_lock(lock)
+#define _spin_lock_irqsave(lock)  __spin_lock_irqsave(lock)
+#define _spin_lock_irq(lock)	  __spin_lock_irq(lock)
+#define _spin_lock_bh(lock)	  __spin_lock_bh(lock)
+#define _read_lock_irqsave(lock)  __read_lock_irqsave(lock)
+#define _read_lock_irq(lock)	  __read_lock_irq(lock)
+#define _read_lock_bh(lock)	  __read_lock_bh(lock)
+#define _write_lock_irqsave(lock) __write_lock_irqsave(lock)
+#define _write_lock_irq(lock)	  __write_lock_irq(lock)
+#define _write_lock_bh(lock)	  __write_lock_bh(lock)
+#define _spin_lock(lock)	  __spin_lock(lock)
+#define _write_lock(lock)	  __write_lock(lock)
+#define _spin_unlock(lock)	  __spin_unlock(lock)
+#define _write_unlock(lock)	  __write_unlock(lock)
+#define _read_unlock(lock)	  __read_unlock(lock)
+#define _spin_unlock_irq(lock)	  __spin_unlock_irq(lock)
+#define _spin_unlock_bh(lock)	  __spin_unlock_bh(lock)
+#define _read_unlock_irq(lock)	  __read_unlock_irq(lock)
+#define _read_unlock_bh(lock)	  __read_unlock_bh(lock)
+#define _write_unlock_irq(lock)	  __write_unlock_irq(lock)
+#define _write_unlock_bh(lock)	  __write_unlock_bh(lock)
+#define _spin_trylock_bh(lock)	  __spin_trylock_bh(lock)
+#define _spin_unlock_irqrestore(lock, flags)  __spin_unlock_irqrestore(lock, flags)
+#define _read_unlock_irqrestore(lock, flags)  __read_unlock_irqrestore(lock, flags)
+#define _write_unlock_irqrestore(lock, flags) __write_unlock_irqrestore(lock, flags)
+
+#endif /* CONFIG_HAVE_SPINLOCK_INLINE_SUPPORT */
+
 static inline int __spin_trylock(spinlock_t *lock)
 {
 	preempt_disable();
Index: linux-2.6/lib/Kconfig.debug
===================================================================
--- linux-2.6.orig/lib/Kconfig.debug
+++ linux-2.6/lib/Kconfig.debug
@@ -879,6 +879,20 @@ config SYSCTL_SYSCALL_CHECK
 	  to properly maintain and use. This enables checks that help
 	  you to keep things correct.
 
+config HAVE_SPINLOCK_INLINE_SUPPORT
+	bool
+
+config SPINLOCK_INLINE
+	bool "Inline spinlock code"
+	depends on HAVE_SPINLOCK_INLINE_SUPPORT
+	depends on !DEBUG_SPINLOCK
+	depends on SMP && !PREEMPT
+	help
+	  Select this option if you want to have inlined spinlock code instead
+	  of an out of line implementation.
+	  This will generate a larger kernel image. On some architectures this
+	  increases performance.
+
 source mm/Kconfig.debug
 source kernel/trace/Kconfig
 
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -21,6 +21,8 @@
 #include <linux/debug_locks.h>
 #include <linux/module.h>
 
+#ifndef CONFIG_SPINLOCK_INLINE
+
 int __lockfunc _spin_trylock(spinlock_t *lock)
 {
 	return __spin_trylock(lock);
@@ -320,6 +322,8 @@ int __lockfunc _spin_trylock_bh(spinlock
 }
 EXPORT_SYMBOL(_spin_trylock_bh);
 
+#endif /* CONFIG_HAVE_SPINLOCK_INLINE_SUPPORT */
+
 notrace int in_lock_functions(unsigned long addr)
 {
 	/* Linker adds these: start and end of __lockfunc functions */

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 3/3] spinlock: allow inlined spinlocks on s390
  2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
  2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
  2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
@ 2009-08-12 18:39 ` Heiko Carstens
  2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
  3 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-12 18:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt

[-- Attachment #1: 03_spinlock_s390.diff --]
[-- Type: text/plain, Size: 680 bytes --]

From: Heiko Carstens <heiko.carstens@de.ibm.com>

Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 arch/s390/Kconfig |    1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -93,6 +93,7 @@ config S390
 	select HAVE_KRETPROBES
 	select HAVE_KVM if 64BIT
 	select HAVE_ARCH_TRACEHOOK
+	select HAVE_SPINLOCK_INLINE_SUPPORT
 	select INIT_ALL_POSSIBLE
 	select HAVE_PERF_COUNTERS
 	select GENERIC_ATOMIC64 if !64BIT

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
                   ` (2 preceding siblings ...)
  2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens
@ 2009-08-13 18:11 ` Linus Torvalds
  2009-08-13 18:34   ` Ingo Molnar
  2009-08-14 12:34   ` Heiko Carstens
  3 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2009-08-13 18:11 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt



On Wed, 12 Aug 2009, Heiko Carstens wrote:
> 
> This patch set allows to have inlined spinlocks again.

I'm not convinced this is the right approach.

The thing is, CONFIG_SPINLOCK_INLINE doesn't look like the right thing to 
do. On x86, for example, functions calls are usually quite cheap, and it 
may well make sense to keep 'spin_lock()' out-of-line, especially once you 
start doing things like playing with the softirq counts or interrupts.

At the same time, the special case of spin *unlock* is generally just a 
single (small) instruction on x86, so that one almost certainly is worth 
inlining - at least as long as it's not one of the softirq ones.

In other words, I think your hammer is too large. If you want to inline 
the lock functions, you should _not_ make this a single "everything or 
nothing" thing.

> If one considers that server kernels are usually compiled with
> !CONFIG_PREEMPT a simple spin_lock is just a compare and swap loop.
> The extra overhead for a function call is significant.
> With inlined spinlocks overall cpu usage gets reduced by 1%-5% on s390.
> These numbers were taken with some network benchmarks. However I expect
> any workload that calls frequently into the kernel and which grabs a few
> locks to perform better.

Have you tried to determine which kinds of spinlocks you care most about? 
With networking, I guess the softirq-safe ones (and perhaps the irq ones) 
are pretty common, but maybe even there it's really just a couple of 
cases. 

Now, there is one advantage to inlining that I really like - and that I 
personally have missed from the days when we used to always do it: 
profiling. Inlining the spinlocks tends to do _wonders_ for showing 
exactly which spinlock ends up being the contention cases for certain 
loads.

So there are advantages outside of the "size vs speed" kind of things, but 
at the same time I do suspect that the disadvantages mean that we really 
should make this be something where an architecture can specify things at 
a finer granularity. Perhaps by using per-symbol defines (ie have the 
<asm/spinlock.h> file do things like

	#define __spin_lock_is_small
	#define __spin_unlock_is_small

kind of things for the particular sequences that are worth inlining.

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
@ 2009-08-13 18:34   ` Ingo Molnar
  2009-08-13 18:43     ` Ingo Molnar
  2009-08-14 12:34   ` Heiko Carstens
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-08-13 18:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Now, there is one advantage to inlining that I really like - and 
> that I personally have missed from the days when we used to 
> always do it: profiling. Inlining the spinlocks tends to do 
> _wonders_ for showing exactly which spinlock ends up being the 
> contention cases for certain loads. [...]

We've got a new tool to determine that kind of information though: 
perf + call-graph profiling.

This is how it works and how it can be used. Say we want to profile 
'bw_tcp' from lmbench. The traditional way is to record a flat 
profile:

 aldebaran:~/l> perf record -f ./bw_pipe
 Pipe bandwidth: 4607.33 MB/sec
 [ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ]

and then to view the flat profile:

 aldebaran:~/l> perf report
 # Samples: 109331
 #
 # Overhead  Command             Shared Object  Symbol
 # ........  .......  ........................  ......
 #
    39.61%  bw_pipe  [kernel]                  [k] copy_user_generic_string
    19.32%  bw_pipe  ./bw_pipe                 [.] enough_duration
     6.21%  bw_pipe  [kernel]                  [k] mutex_spin_on_owner
     3.54%  bw_pipe  ./bw_pipe                 [.] two_op
     3.54%  bw_pipe  ./bw_pipe                 [.] one_op
     2.79%  bw_pipe  [kernel]                  [k] get_page_from_freelist
     2.16%  bw_pipe  [kernel]                  [k] ia32_sysenter_target
     1.91%  bw_pipe  [kernel]                  [k] pipe_read
     1.83%  bw_pipe  [kernel]                  [k] pipe_write
     1.63%  bw_pipe  [kernel]                  [k] free_hot_cold_page
     1.21%  bw_pipe  [kernel]                  [k] put_page
     1.19%  bw_pipe  [kernel]                  [k] __alloc_pages_nodemask
     0.99%  bw_pipe  [kernel]                  [k] anon_pipe_buf_release
     0.74%  bw_pipe  [kernel]                  [k] get_pageblock_flags_group

Annoyingly enough, mutex_spin_on_owner() is in the profile but we 
have no idea what the parent function(s) are. (Plus there's no way 
we could reasonably inline mutex functions.) So we are stuck with 
guessing which of the many other functions above (or some other 
function which does not even get listed!) has that mutex spinning 
overhead.

But if you do the recording via call-graph traces:

 aldebaran:~/l> perf record -g -f ./bw_pipe
 Pipe bandwidth: 4432.75 MB/sec
 [ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ]

(note the -g/--graph option) And then view the profile via perf 
report, we get:

 aldebaran:~/l> perf report -S _spin_lock

 # symbol: _spin_lock
 # Samples: 724
 #
 # Overhead  Command             Shared Object
 # ........  .......  ........................
 #
    98.34%  lat_tcp  [kernel]                
                |          
                |--31.74%-- ipt_do_table
                |          |          
                |          |--26.99%-- ipt_local_in_hook
                |          |          nf_iterate
                |          |          nf_hook_slow
                |          |          ip_local_deliver
                |          |          ip_rcv_finish
                |          |          ip_rcv
                |          |          netif_receive_skb
                |          |          process_backlog
                |          |          net_rx_action
                |          |          __do_softirq
                |          |          call_softirq
                |          |          do_softirq
                |          |          _local_bh_enable_ip
                |          |          local_bh_enable
                |          |          dev_queue_xmit
                |          |          ip_finish_output2
                |          |          ip_finish_output
                |          |          ip_output
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |          |--24.34%-- ipt_local_in_hook
                |          |          nf_iterate
                |          |          nf_hook_slow
                |          |          ip_local_deliver
                |          |          ip_rcv_finish
                |          |          ip_rcv
                |          |          netif_receive_skb
                |          |          process_backlog
                |          |          net_rx_action
                |          |          __do_softirq
                |          |          call_softirq
                |          |          do_softirq
                |          |          _local_bh_enable_ip
                |          |          local_bh_enable
                |          |          dev_queue_xmit
                |          |          ip_finish_output2
                |          |          ip_finish_output
                |          |          ip_output
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |          |--16.81%-- ipt_pre_routing_hook
                |          |          nf_iterate
                |          |          nf_hook_slow
                |          |          ip_rcv
                |          |          netif_receive_skb
                |          |          process_backlog
                |          |          net_rx_action
                |          |          __do_softirq
                |          |          call_softirq
                |          |          do_softirq
                |          |          _local_bh_enable_ip
                |          |          local_bh_enable
                |          |          dev_queue_xmit
                |          |          ip_finish_output2
                |          |          ip_finish_output
                |          |          ip_output
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |          |--11.95%-- ipt_post_routing_hook
                |          |          nf_iterate
                |          |          nf_hook_slow
                |          |          ip_output
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |          |--10.18%-- ipt_local_out_hook
                |          |          nf_iterate
                |          |          nf_hook_slow
                |          |          __ip_local_out
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |           --9.73%-- ipt_local_hook
                |                     nf_iterate
                |                     nf_hook_slow
                |                     __ip_local_out
                |                     ip_local_out
                |                     ip_queue_xmit
                |                     tcp_transmit_skb
                |                     tcp_write_xmit
                |                     __tcp_push_pending_frames
                |                     tcp_push
                |                     tcp_sendmsg
                |                     __sock_sendmsg
                |                     sock_aio_write
                |                     do_sync_write
                |                     vfs_write
                |                     sys_write
                |                     sysenter_dispatch
                |                     0xf7f50430
                |                     0x5
                |          
                |--31.46%-- __mod_timer
                |          mod_timer
                |          sk_reset_timer
                |          |          
                |          |--55.36%-- tcp_send_delayed_ack
                |          |          __tcp_ack_snd_check
                |          |          tcp_rcv_established
                |          |          tcp_v4_do_rcv
                |          |          tcp_prequeue_process
                |          |          tcp_recvmsg
                |          |          sock_common_recvmsg
                |          |          __sock_recvmsg
                |          |          sock_aio_read
                |          |          do_sync_read
                |          |          vfs_read
                |          |          sys_read
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |           --44.64%-- tcp_v4_rcv
                |                     ip_local_deliver_finish
                |                     ip_local_deliver
                |                     ip_rcv_finish
                |                     ip_rcv
                |                     netif_receive_skb
                |                     process_backlog
                |                     net_rx_action
                |                     __do_softirq
                |                     call_softirq
                |                     do_softirq
                |                     _local_bh_enable_ip
                |                     local_bh_enable
                |                     dev_queue_xmit
                |                     ip_finish_output2
                |                     ip_finish_output
                |                     ip_output
                |                     ip_local_out
                |                     ip_queue_xmit
                |                     tcp_transmit_skb
                |                     tcp_write_xmit
                |                     __tcp_push_pending_frames
                |                     tcp_push
                |                     tcp_sendmsg
                |                     __sock_sendmsg
                |                     sock_aio_write
                |                     do_sync_write
                |                     vfs_write
                |                     sys_write
                |                     sysenter_dispatch
                |                     0xf7f50430
                |                     0x5
                |          
                |--21.21%-- task_rq_lock
                |          try_to_wake_up
                |          default_wake_function
                |          |          
                |          |--99.34%-- autoremove_wake_function
                |          |          __wake_up_common
                |          |          __wake_up
                |          |          tcp_v4_rcv
                |          |          ip_local_deliver_finish
                |          |          ip_local_deliver
                |          |          ip_rcv_finish
                |          |          ip_rcv
                |          |          netif_receive_skb
                |          |          process_backlog
                |          |          net_rx_action
                |          |          __do_softirq
                |          |          call_softirq
                |          |          do_softirq
                |          |          _local_bh_enable_ip
                |          |          local_bh_enable
                |          |          dev_queue_xmit
                |          |          ip_finish_output2
                |          |          ip_finish_output
                |          |          ip_output
                |          |          ip_local_out
                |          |          ip_queue_xmit
                |          |          tcp_transmit_skb
                |          |          tcp_write_xmit
                |          |          __tcp_push_pending_frames
                |          |          tcp_push
                |          |          tcp_sendmsg
                |          |          __sock_sendmsg
                |          |          sock_aio_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |           --0.66%-- pollwake
                |                     __wake_up_common
                |                     __wake_up
                |                     perf_counter_wakeup
                |                     perf_pending_counter
                |                     perf_counter_do_pending
                |                     smp_perf_pending_interrupt
                |                     perf_pending_interrupt
                |                     0xf7f50430
                |                     0x5
                |          
                |--5.06%-- tcp_v4_rcv
                |          ip_local_deliver_finish
                |          ip_local_deliver
                |          ip_rcv_finish
                |          ip_rcv
                |          netif_receive_skb
                |          process_backlog
                |          net_rx_action
                |          __do_softirq
                |          call_softirq
                |          do_softirq
                |          _local_bh_enable_ip
                |          local_bh_enable
                |          dev_queue_xmit
                |          ip_finish_output2
                |          ip_finish_output
                |          ip_output
                |          ip_local_out
                |          ip_queue_xmit
                |          tcp_transmit_skb
                |          tcp_write_xmit
                |          __tcp_push_pending_frames
                |          tcp_push
                |          tcp_sendmsg
                |          __sock_sendmsg
                |          sock_aio_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          sysenter_dispatch
                |          0xf7f50430
                |          0x5
                |          
                |--4.07%-- __perf_counter_sched_out
                |          perf_counter_task_sched_out
                |          schedule
                |          schedule_timeout
                |          sk_wait_data
                |          tcp_recvmsg
                |          sock_common_recvmsg
                |          __sock_recvmsg
                |          sock_aio_read
                |          do_sync_read
                |          vfs_read
                |          sys_read
                |          sysenter_dispatch
                |          0xf7f50430
                |          0x5
                |          
                |--1.40%-- ipt_local_hook
                |          nf_iterate
                |          nf_hook_slow
                |          __ip_local_out
                |          ip_local_out
                |          ip_queue_xmit
                |          tcp_transmit_skb
                |          tcp_write_xmit
                |          __tcp_push_pending_frames
                |          tcp_push
                |          tcp_sendmsg
                |          __sock_sendmsg
                |          sock_aio_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          sysenter_dispatch
                |          0xf7f50430
                |          0x5
                |          
                |--1.26%-- mod_timer
                |          sk_reset_timer
                |          |          
                |          |--66.67%-- tcp_send_delayed_ack
                |          |          __tcp_ack_snd_check
                |          |          tcp_rcv_established
                |          |          tcp_v4_do_rcv
                |          |          tcp_prequeue_process
                |          |          tcp_recvmsg
                |          |          sock_common_recvmsg
                |          |          __sock_recvmsg
                |          |          sock_aio_read
                |          |          do_sync_read
                |          |          vfs_read
                |          |          sys_read
                |          |          sysenter_dispatch
                |          |          0xf7f50430
                |          |          0x5
                |          |          
                |           --33.33%-- tcp_v4_rcv
                |                     ip_local_deliver_finish
                |                     ip_local_deliver
                |                     ip_rcv_finish
                |                     ip_rcv
                |                     netif_receive_skb
                |                     process_backlog
                |                     net_rx_action
                |                     __do_softirq
                |                     call_softirq
                |                     do_softirq
                |                     _local_bh_enable_ip
                |                     local_bh_enable
                |                     dev_queue_xmit
                |                     ip_finish_output2
                |                     ip_finish_output
                |                     ip_output
                |                     ip_local_out
                |                     ip_queue_xmit
                |                     tcp_transmit_skb
                |                     tcp_write_xmit
                |                     __tcp_push_pending_frames
                |                     tcp_push
                |                     tcp_sendmsg
                |                     __sock_sendmsg
                |                     sock_aio_write
                |                     do_sync_write
                |                     vfs_write
                |                     sys_write
                |                     sysenter_dispatch
                |                     0xf7f50430
                |                     0x5
                |          
                |--0.70%-- ipt_local_in_hook
                |          nf_iterate
                |          nf_hook_slow
                |          ip_local_deliver
                |          ip_rcv_finish
                |          ip_rcv
                |          netif_receive_skb
                |          process_backlog
                |          net_rx_action
                |          __do_softirq
                |          call_softirq
                |          do_softirq
                |          _local_bh_enable_ip
                |          local_bh_enable
                |          dev_queue_xmit
                |          ip_finish_output2
                |          ip_finish_output
                |          ip_output
                |          ip_local_out
                |          ip_queue_xmit
                |          tcp_transmit_skb
                |          tcp_write_xmit
                |          __tcp_push_pending_frames
                |          tcp_push
                |          tcp_sendmsg
                |          __sock_sendmsg
                |          sock_aio_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          sysenter_dispatch
                |          0xf7f50430
                |          0x5
                |          
                |--0.56%-- perf_counter_task_sched_out
                |          schedule
                |          schedule_timeout
                |          sk_wait_data
                |          tcp_recvmsg
                |          sock_common_recvmsg
                |          __sock_recvmsg
                |          sock_aio_read
                |          do_sync_read
                |          vfs_read
                |          sys_read
                |          sysenter_dispatch
                |          0xf7f50430
                |          0x5
                 --2.53%-- [...]

     1.66%     perf  [kernel]                
                |          
                |--25.00%-- journal_stop
                |          __ext3_journal_stop
                |          ext3_writeback_write_end
                |          generic_perform_write
                |          generic_file_buffered_write
                |          __generic_file_aio_write_nolock
                |          generic_file_aio_write
                |          ext3_file_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          system_call_fastpath
                |          __write_nocancel
                |          
                |--16.67%-- __perf_counter_sched_out
                |          perf_counter_task_sched_out
                |          schedule
                |          schedule_hrtimeout_range
                |          poll_schedule_timeout
                |          do_poll
                |          do_sys_poll
                |          sys_poll
                |          system_call_fastpath
                |          __GI___poll
                |          
                |--8.33%-- dup_mmap
                |          dup_mm
                |          copy_process
                |          do_fork
                |          sys_clone
                |          stub_clone
                |          fork
                |          __cmd_record
                |          
                |--8.33%-- hrtimer_interrupt
                |          smp_apic_timer_interrupt
                |          apic_timer_interrupt
                |          __write_nocancel
                |          
                |--8.33%-- journal_start
                |          ext3_journal_start_sb
                |          ext3_write_begin
                |          generic_perform_write
                |          generic_file_buffered_write
                |          __generic_file_aio_write_nolock
                |          generic_file_aio_write
                |          ext3_file_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          system_call_fastpath
                |          __write_nocancel
                |          
                |--8.33%-- create_empty_buffers
                |          __block_prepare_write
                |          block_write_begin
                |          ext3_write_begin
                |          generic_perform_write
                |          generic_file_buffered_write
                |          __generic_file_aio_write_nolock
                |          generic_file_aio_write
                |          ext3_file_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          system_call_fastpath
                |          __write_nocancel
                |          
                |--8.33%-- perf_counter_task_sched_out
                |          schedule
                |          schedule_hrtimeout_range
                |          poll_schedule_timeout
                |          do_poll
                |          do_sys_poll
                |          sys_poll
                |          system_call_fastpath
                |          __GI___poll
                |          
                |--8.33%-- ext3_alloc_branch
                |          ext3_get_blocks_handle
                |          ext3_get_block
                |          __block_prepare_write
                |          block_write_begin
                |          ext3_write_begin
                |          generic_perform_write
                |          generic_file_buffered_write
                |          __generic_file_aio_write_nolock
                |          generic_file_aio_write
                |          ext3_file_write
                |          do_sync_write
                |          vfs_write
                |          sys_write
                |          system_call_fastpath
                |          __write_nocancel
                |          
                 --8.33%-- start_this_handle
                           journal_start
                           ext3_journal_start_sb
                           ext3_write_begin
                           generic_perform_write
                           generic_file_buffered_write
                           __generic_file_aio_write_nolock
                           generic_file_aio_write
                           ext3_file_write
                           do_sync_write
                           vfs_write
                           sys_write
                           system_call_fastpath
                           __write_nocancel

the graph above (which is the sub-graph of the full graph, limited 
via the -S/--symbols option to all callchains that intersect with 
the _spin_lock symbol) details all the various code-paths that make 
use of _spin_lock(), and the weights (and relative weights) of 
those codepaths.

Note that this works for just about any locking primitive, and note 
how it's recursive. If we inlined spin_lock we'd only know this 
__mod_timer overhead:

                |--31.46%-- __mod_timer

but we wouldnt know the second level (which is interesting too):

                |          |--55.36%-- tcp_send_delayed_ack
                |           --44.64%-- tcp_v4_rcv

All in one, i think we should detach the issue of instrumentation 
and profiling from the issue of whether to inline or not. We should 
inline where it makes things faster. Our profiling tools are able 
to deal with it, independently of whether we inline or not.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-13 18:34   ` Ingo Molnar
@ 2009-08-13 18:43     ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2009-08-13 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt


* Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > Now, there is one advantage to inlining that I really like - and 
> > that I personally have missed from the days when we used to 
> > always do it: profiling. Inlining the spinlocks tends to do 
> > _wonders_ for showing exactly which spinlock ends up being the 
> > contention cases for certain loads. [...]
> 
> We've got a new tool to determine that kind of information though: 
> perf + call-graph profiling.
> 
> This is how it works and how it can be used. Say we want to profile 
> 'bw_tcp' from lmbench. The traditional way is to record a flat 
> profile:
> 
>  aldebaran:~/l> perf record -f ./bw_pipe
>  Pipe bandwidth: 4607.33 MB/sec
>  [ perf record: Captured and wrote 2.503 MB perf.data (~109368 samples) ]
> 
> and then to view the flat profile:
> 
>  aldebaran:~/l> perf report
>  # Samples: 109331
>  #
>  # Overhead  Command             Shared Object  Symbol
>  # ........  .......  ........................  ......
>  #
>     39.61%  bw_pipe  [kernel]                  [k] copy_user_generic_string
>     19.32%  bw_pipe  ./bw_pipe                 [.] enough_duration
>      6.21%  bw_pipe  [kernel]                  [k] mutex_spin_on_owner
>      3.54%  bw_pipe  ./bw_pipe                 [.] two_op
>      3.54%  bw_pipe  ./bw_pipe                 [.] one_op
>      2.79%  bw_pipe  [kernel]                  [k] get_page_from_freelist
>      2.16%  bw_pipe  [kernel]                  [k] ia32_sysenter_target
>      1.91%  bw_pipe  [kernel]                  [k] pipe_read
>      1.83%  bw_pipe  [kernel]                  [k] pipe_write
>      1.63%  bw_pipe  [kernel]                  [k] free_hot_cold_page
>      1.21%  bw_pipe  [kernel]                  [k] put_page
>      1.19%  bw_pipe  [kernel]                  [k] __alloc_pages_nodemask
>      0.99%  bw_pipe  [kernel]                  [k] anon_pipe_buf_release
>      0.74%  bw_pipe  [kernel]                  [k] get_pageblock_flags_group
> 
> Annoyingly enough, mutex_spin_on_owner() is in the profile but we 
> have no idea what the parent function(s) are. (Plus there's no way 
> we could reasonably inline mutex functions.) So we are stuck with 
> guessing which of the many other functions above (or some other 
> function which does not even get listed!) has that mutex spinning 
> overhead.
> 
> But if you do the recording via call-graph traces:
> 
>  aldebaran:~/l> perf record -g -f ./bw_pipe
>  Pipe bandwidth: 4432.75 MB/sec
>  [ perf record: Captured and wrote 7.550 MB perf.data (~329886 samples) ]
> 
> (note the -g/--graph option) And then view the profile via perf 
> report, we get:

sorry - mail got trimmed due to a cut&paste mishap. Here's the 
missing bw_pipe profile:

 # Samples: 81007
 #
 # Overhead  Command             Shared Object  Symbol
 # ........  .......  ........................  ......
 #

     7.60%  bw_pipe  [kernel]                  [k] mutex_spin_on_owner
                |          
                |--99.97%-- __mutex_lock_common
                |          __mutex_lock_slowpath
                |          mutex_lock
                |          |          
                |          |--98.46%-- pipe_write
                |          |          do_sync_write
                |          |          vfs_write
                |          |          sys_write
                |          |          sysenter_dispatch
                |          |          0xf7f4f430
                |          |          0x600000007
                |          |          
                |           --1.54%-- pipe_lock
                |                     pipe_wait
                |                     pipe_write
                |                     do_sync_write
                |                     vfs_write
                |                     sys_write
                |                     sysenter_dispatch
                |                     0xf7f4f430
                |                     0x600000007
                 --0.03%-- [...]

it shows that the mutex overhead is distributed between these two 
functions:

                |          |--98.46%-- pipe_write
                |           --1.54%-- pipe_lock


I also wanted to show a more complex example, how spin-lock 
overhead gets distributed in the networking stack, in a
"lat_tcp localhost" workload:

> 
>  aldebaran:~/l> perf report -S _spin_lock
> 
>  # symbol: _spin_lock
>  # Samples: 724
>  #
>  # Overhead  Command             Shared Object
>  # ........  .......  ........................
>  #
>     98.34%  lat_tcp  [kernel]                
>                 |          
>                 |--31.74%-- ipt_do_table
>                 |          |          
>                 |          |--26.99%-- ipt_local_in_hook
>                 |          |          nf_iterate
>                 |          |          nf_hook_slow
>                 |          |          ip_local_deliver
>                 |          |          ip_rcv_finish
>                 |          |          ip_rcv
>                 |          |          netif_receive_skb
>                 |          |          process_backlog
>                 |          |          net_rx_action
>                 |          |          __do_softirq
>                 |          |          call_softirq
>                 |          |          do_softirq
>                 |          |          _local_bh_enable_ip
>                 |          |          local_bh_enable
>                 |          |          dev_queue_xmit
>                 |          |          ip_finish_output2
>                 |          |          ip_finish_output
>                 |          |          ip_output
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |          |--24.34%-- ipt_local_in_hook
>                 |          |          nf_iterate
>                 |          |          nf_hook_slow
>                 |          |          ip_local_deliver
>                 |          |          ip_rcv_finish
>                 |          |          ip_rcv
>                 |          |          netif_receive_skb
>                 |          |          process_backlog
>                 |          |          net_rx_action
>                 |          |          __do_softirq
>                 |          |          call_softirq
>                 |          |          do_softirq
>                 |          |          _local_bh_enable_ip
>                 |          |          local_bh_enable
>                 |          |          dev_queue_xmit
>                 |          |          ip_finish_output2
>                 |          |          ip_finish_output
>                 |          |          ip_output
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |          |--16.81%-- ipt_pre_routing_hook
>                 |          |          nf_iterate
>                 |          |          nf_hook_slow
>                 |          |          ip_rcv
>                 |          |          netif_receive_skb
>                 |          |          process_backlog
>                 |          |          net_rx_action
>                 |          |          __do_softirq
>                 |          |          call_softirq
>                 |          |          do_softirq
>                 |          |          _local_bh_enable_ip
>                 |          |          local_bh_enable
>                 |          |          dev_queue_xmit
>                 |          |          ip_finish_output2
>                 |          |          ip_finish_output
>                 |          |          ip_output
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |          |--11.95%-- ipt_post_routing_hook
>                 |          |          nf_iterate
>                 |          |          nf_hook_slow
>                 |          |          ip_output
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |          |--10.18%-- ipt_local_out_hook
>                 |          |          nf_iterate
>                 |          |          nf_hook_slow
>                 |          |          __ip_local_out
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |           --9.73%-- ipt_local_hook
>                 |                     nf_iterate
>                 |                     nf_hook_slow
>                 |                     __ip_local_out
>                 |                     ip_local_out
>                 |                     ip_queue_xmit
>                 |                     tcp_transmit_skb
>                 |                     tcp_write_xmit
>                 |                     __tcp_push_pending_frames
>                 |                     tcp_push
>                 |                     tcp_sendmsg
>                 |                     __sock_sendmsg
>                 |                     sock_aio_write
>                 |                     do_sync_write
>                 |                     vfs_write
>                 |                     sys_write
>                 |                     sysenter_dispatch
>                 |                     0xf7f50430
>                 |                     0x5
>                 |          
>                 |--31.46%-- __mod_timer
>                 |          mod_timer
>                 |          sk_reset_timer
>                 |          |          
>                 |          |--55.36%-- tcp_send_delayed_ack
>                 |          |          __tcp_ack_snd_check
>                 |          |          tcp_rcv_established
>                 |          |          tcp_v4_do_rcv
>                 |          |          tcp_prequeue_process
>                 |          |          tcp_recvmsg
>                 |          |          sock_common_recvmsg
>                 |          |          __sock_recvmsg
>                 |          |          sock_aio_read
>                 |          |          do_sync_read
>                 |          |          vfs_read
>                 |          |          sys_read
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |           --44.64%-- tcp_v4_rcv
>                 |                     ip_local_deliver_finish
>                 |                     ip_local_deliver
>                 |                     ip_rcv_finish
>                 |                     ip_rcv
>                 |                     netif_receive_skb
>                 |                     process_backlog
>                 |                     net_rx_action
>                 |                     __do_softirq
>                 |                     call_softirq
>                 |                     do_softirq
>                 |                     _local_bh_enable_ip
>                 |                     local_bh_enable
>                 |                     dev_queue_xmit
>                 |                     ip_finish_output2
>                 |                     ip_finish_output
>                 |                     ip_output
>                 |                     ip_local_out
>                 |                     ip_queue_xmit
>                 |                     tcp_transmit_skb
>                 |                     tcp_write_xmit
>                 |                     __tcp_push_pending_frames
>                 |                     tcp_push
>                 |                     tcp_sendmsg
>                 |                     __sock_sendmsg
>                 |                     sock_aio_write
>                 |                     do_sync_write
>                 |                     vfs_write
>                 |                     sys_write
>                 |                     sysenter_dispatch
>                 |                     0xf7f50430
>                 |                     0x5
>                 |          
>                 |--21.21%-- task_rq_lock
>                 |          try_to_wake_up
>                 |          default_wake_function
>                 |          |          
>                 |          |--99.34%-- autoremove_wake_function
>                 |          |          __wake_up_common
>                 |          |          __wake_up
>                 |          |          tcp_v4_rcv
>                 |          |          ip_local_deliver_finish
>                 |          |          ip_local_deliver
>                 |          |          ip_rcv_finish
>                 |          |          ip_rcv
>                 |          |          netif_receive_skb
>                 |          |          process_backlog
>                 |          |          net_rx_action
>                 |          |          __do_softirq
>                 |          |          call_softirq
>                 |          |          do_softirq
>                 |          |          _local_bh_enable_ip
>                 |          |          local_bh_enable
>                 |          |          dev_queue_xmit
>                 |          |          ip_finish_output2
>                 |          |          ip_finish_output
>                 |          |          ip_output
>                 |          |          ip_local_out
>                 |          |          ip_queue_xmit
>                 |          |          tcp_transmit_skb
>                 |          |          tcp_write_xmit
>                 |          |          __tcp_push_pending_frames
>                 |          |          tcp_push
>                 |          |          tcp_sendmsg
>                 |          |          __sock_sendmsg
>                 |          |          sock_aio_write
>                 |          |          do_sync_write
>                 |          |          vfs_write
>                 |          |          sys_write
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |           --0.66%-- pollwake
>                 |                     __wake_up_common
>                 |                     __wake_up
>                 |                     perf_counter_wakeup
>                 |                     perf_pending_counter
>                 |                     perf_counter_do_pending
>                 |                     smp_perf_pending_interrupt
>                 |                     perf_pending_interrupt
>                 |                     0xf7f50430
>                 |                     0x5
>                 |          
>                 |--5.06%-- tcp_v4_rcv
>                 |          ip_local_deliver_finish
>                 |          ip_local_deliver
>                 |          ip_rcv_finish
>                 |          ip_rcv
>                 |          netif_receive_skb
>                 |          process_backlog
>                 |          net_rx_action
>                 |          __do_softirq
>                 |          call_softirq
>                 |          do_softirq
>                 |          _local_bh_enable_ip
>                 |          local_bh_enable
>                 |          dev_queue_xmit
>                 |          ip_finish_output2
>                 |          ip_finish_output
>                 |          ip_output
>                 |          ip_local_out
>                 |          ip_queue_xmit
>                 |          tcp_transmit_skb
>                 |          tcp_write_xmit
>                 |          __tcp_push_pending_frames
>                 |          tcp_push
>                 |          tcp_sendmsg
>                 |          __sock_sendmsg
>                 |          sock_aio_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          sysenter_dispatch
>                 |          0xf7f50430
>                 |          0x5
>                 |          
>                 |--4.07%-- __perf_counter_sched_out
>                 |          perf_counter_task_sched_out
>                 |          schedule
>                 |          schedule_timeout
>                 |          sk_wait_data
>                 |          tcp_recvmsg
>                 |          sock_common_recvmsg
>                 |          __sock_recvmsg
>                 |          sock_aio_read
>                 |          do_sync_read
>                 |          vfs_read
>                 |          sys_read
>                 |          sysenter_dispatch
>                 |          0xf7f50430
>                 |          0x5
>                 |          
>                 |--1.40%-- ipt_local_hook
>                 |          nf_iterate
>                 |          nf_hook_slow
>                 |          __ip_local_out
>                 |          ip_local_out
>                 |          ip_queue_xmit
>                 |          tcp_transmit_skb
>                 |          tcp_write_xmit
>                 |          __tcp_push_pending_frames
>                 |          tcp_push
>                 |          tcp_sendmsg
>                 |          __sock_sendmsg
>                 |          sock_aio_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          sysenter_dispatch
>                 |          0xf7f50430
>                 |          0x5
>                 |          
>                 |--1.26%-- mod_timer
>                 |          sk_reset_timer
>                 |          |          
>                 |          |--66.67%-- tcp_send_delayed_ack
>                 |          |          __tcp_ack_snd_check
>                 |          |          tcp_rcv_established
>                 |          |          tcp_v4_do_rcv
>                 |          |          tcp_prequeue_process
>                 |          |          tcp_recvmsg
>                 |          |          sock_common_recvmsg
>                 |          |          __sock_recvmsg
>                 |          |          sock_aio_read
>                 |          |          do_sync_read
>                 |          |          vfs_read
>                 |          |          sys_read
>                 |          |          sysenter_dispatch
>                 |          |          0xf7f50430
>                 |          |          0x5
>                 |          |          
>                 |           --33.33%-- tcp_v4_rcv
>                 |                     ip_local_deliver_finish
>                 |                     ip_local_deliver
>                 |                     ip_rcv_finish
>                 |                     ip_rcv
>                 |                     netif_receive_skb
>                 |                     process_backlog
>                 |                     net_rx_action
>                 |                     __do_softirq
>                 |                     call_softirq
>                 |                     do_softirq
>                 |                     _local_bh_enable_ip
>                 |                     local_bh_enable
>                 |                     dev_queue_xmit
>                 |                     ip_finish_output2
>                 |                     ip_finish_output
>                 |                     ip_output
>                 |                     ip_local_out
>                 |                     ip_queue_xmit
>                 |                     tcp_transmit_skb
>                 |                     tcp_write_xmit
>                 |                     __tcp_push_pending_frames
>                 |                     tcp_push
>                 |                     tcp_sendmsg
>                 |                     __sock_sendmsg
>                 |                     sock_aio_write
>                 |                     do_sync_write
>                 |                     vfs_write
>                 |                     sys_write
>                 |                     sysenter_dispatch
>                 |                     0xf7f50430
>                 |                     0x5
>                 |          
>                 |--0.70%-- ipt_local_in_hook
>                 |          nf_iterate
>                 |          nf_hook_slow
>                 |          ip_local_deliver
>                 |          ip_rcv_finish
>                 |          ip_rcv
>                 |          netif_receive_skb
>                 |          process_backlog
>                 |          net_rx_action
>                 |          __do_softirq
>                 |          call_softirq
>                 |          do_softirq
>                 |          _local_bh_enable_ip
>                 |          local_bh_enable
>                 |          dev_queue_xmit
>                 |          ip_finish_output2
>                 |          ip_finish_output
>                 |          ip_output
>                 |          ip_local_out
>                 |          ip_queue_xmit
>                 |          tcp_transmit_skb
>                 |          tcp_write_xmit
>                 |          __tcp_push_pending_frames
>                 |          tcp_push
>                 |          tcp_sendmsg
>                 |          __sock_sendmsg
>                 |          sock_aio_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          sysenter_dispatch
>                 |          0xf7f50430
>                 |          0x5
>                 |          
>                 |--0.56%-- perf_counter_task_sched_out
>                 |          schedule
>                 |          schedule_timeout
>                 |          sk_wait_data
>                 |          tcp_recvmsg
>                 |          sock_common_recvmsg
>                 |          __sock_recvmsg
>                 |          sock_aio_read
>                 |          do_sync_read
>                 |          vfs_read
>                 |          sys_read
>                 |          sysenter_dispatch
>                 |          0xf7f50430
>                 |          0x5
>                  --2.53%-- [...]
> 
>      1.66%     perf  [kernel]                
>                 |          
>                 |--25.00%-- journal_stop
>                 |          __ext3_journal_stop
>                 |          ext3_writeback_write_end
>                 |          generic_perform_write
>                 |          generic_file_buffered_write
>                 |          __generic_file_aio_write_nolock
>                 |          generic_file_aio_write
>                 |          ext3_file_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          system_call_fastpath
>                 |          __write_nocancel
>                 |          
>                 |--16.67%-- __perf_counter_sched_out
>                 |          perf_counter_task_sched_out
>                 |          schedule
>                 |          schedule_hrtimeout_range
>                 |          poll_schedule_timeout
>                 |          do_poll
>                 |          do_sys_poll
>                 |          sys_poll
>                 |          system_call_fastpath
>                 |          __GI___poll
>                 |          
>                 |--8.33%-- dup_mmap
>                 |          dup_mm
>                 |          copy_process
>                 |          do_fork
>                 |          sys_clone
>                 |          stub_clone
>                 |          fork
>                 |          __cmd_record
>                 |          
>                 |--8.33%-- hrtimer_interrupt
>                 |          smp_apic_timer_interrupt
>                 |          apic_timer_interrupt
>                 |          __write_nocancel
>                 |          
>                 |--8.33%-- journal_start
>                 |          ext3_journal_start_sb
>                 |          ext3_write_begin
>                 |          generic_perform_write
>                 |          generic_file_buffered_write
>                 |          __generic_file_aio_write_nolock
>                 |          generic_file_aio_write
>                 |          ext3_file_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          system_call_fastpath
>                 |          __write_nocancel
>                 |          
>                 |--8.33%-- create_empty_buffers
>                 |          __block_prepare_write
>                 |          block_write_begin
>                 |          ext3_write_begin
>                 |          generic_perform_write
>                 |          generic_file_buffered_write
>                 |          __generic_file_aio_write_nolock
>                 |          generic_file_aio_write
>                 |          ext3_file_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          system_call_fastpath
>                 |          __write_nocancel
>                 |          
>                 |--8.33%-- perf_counter_task_sched_out
>                 |          schedule
>                 |          schedule_hrtimeout_range
>                 |          poll_schedule_timeout
>                 |          do_poll
>                 |          do_sys_poll
>                 |          sys_poll
>                 |          system_call_fastpath
>                 |          __GI___poll
>                 |          
>                 |--8.33%-- ext3_alloc_branch
>                 |          ext3_get_blocks_handle
>                 |          ext3_get_block
>                 |          __block_prepare_write
>                 |          block_write_begin
>                 |          ext3_write_begin
>                 |          generic_perform_write
>                 |          generic_file_buffered_write
>                 |          __generic_file_aio_write_nolock
>                 |          generic_file_aio_write
>                 |          ext3_file_write
>                 |          do_sync_write
>                 |          vfs_write
>                 |          sys_write
>                 |          system_call_fastpath
>                 |          __write_nocancel
>                 |          
>                  --8.33%-- start_this_handle
>                            journal_start
>                            ext3_journal_start_sb
>                            ext3_write_begin
>                            generic_perform_write
>                            generic_file_buffered_write
>                            __generic_file_aio_write_nolock
>                            generic_file_aio_write
>                            ext3_file_write
>                            do_sync_write
>                            vfs_write
>                            sys_write
>                            system_call_fastpath
>                            __write_nocancel
> 
> the graph above (which is the sub-graph of the full graph, limited 
> via the -S/--symbols option to all callchains that intersect with 
> the _spin_lock symbol) details all the various code-paths that make 
> use of _spin_lock(), and the weights (and relative weights) of 
> those codepaths.
> 
> Note that this works for just about any locking primitive, and note 
> how it's recursive. If we inlined spin_lock we'd only know this 
> __mod_timer overhead:
> 
>                 |--31.46%-- __mod_timer
> 
> but we wouldnt know the second level (which is interesting too):
> 
>                 |          |--55.36%-- tcp_send_delayed_ack
>                 |           --44.64%-- tcp_v4_rcv
> 
> All in one, i think we should detach the issue of instrumentation 
> and profiling from the issue of whether to inline or not. We should 
> inline where it makes things faster. Our profiling tools are able 
> to deal with it, independently of whether we inline or not.
> 
> 	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
  2009-08-13 18:34   ` Ingo Molnar
@ 2009-08-14 12:34   ` Heiko Carstens
  2009-08-14 16:04     ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Heiko Carstens @ 2009-08-14 12:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt

On Thu, Aug 13, 2009 at 11:11:22AM -0700, Linus Torvalds wrote:
> On Wed, 12 Aug 2009, Heiko Carstens wrote:
> > If one considers that server kernels are usually compiled with
> > !CONFIG_PREEMPT a simple spin_lock is just a compare and swap loop.
> > The extra overhead for a function call is significant.
> > With inlined spinlocks overall cpu usage gets reduced by 1%-5% on s390.
> > These numbers were taken with some network benchmarks. However I expect
> > any workload that calls frequently into the kernel and which grabs a few
> > locks to perform better.
> 
> Have you tried to determine which kinds of spinlocks you care most about? 
> With networking, I guess the softirq-safe ones (and perhaps the irq ones) 
> are pretty common, but maybe even there it's really just a couple of 
> cases.

It's all spinlock* variants that are frequently called (see also below).
 
> [...]
> So there are advantages outside of the "size vs speed" kind of things, but 
> at the same time I do suspect that the disadvantages mean that we really 
> should make this be something where an architecture can specify things at 
> a finer granularity. Perhaps by using per-symbol defines (ie have the 
> <asm/spinlock.h> file do things like
> 
> 	#define __spin_lock_is_small
> 	#define __spin_unlock_is_small
> 
> kind of things for the particular sequences that are worth inlining.

Ok, that makes sense. I will post updated patches.

FWIW:
All of this started when we compared a 2.6.27 based distro kernel
to a 2.6.16 based one. It turned out that 2.6.27 burns a lot more cpu
cycles than 2.6.16 does for the same workloads.

As an example: we took a simple one connection ping pong network load.
(client sends packet to server and waits until packet from server
 returns and game starts again). It uses more cpu cycles on 2.6.27.

Using ftrace we figured out that on 2.6.27 it takes more than 500 function
calls on the client for a single roundtrip while on 2.6.16 it took only
~230 calls.
Changing s390 from its own NOHZ code to the generic version did cost
quite a few cycles. The idle entry and exit paths are much more expensive
now. That's why you've seen Martin posting NOHZ/timer related patches lately.

But also quite a few functions that were inlined in 2.6.16 are out-of-line
in 2.6.27. Reverting those changes did improve performance as well.

Since out of the ~500 function calls 60 are spin_[un]lock* calls it seemed
worthwile inlining them. Which also improved things again.

Here are the function traces for 2.6.27 and 2.6.16:
(the timestamp output has been modified)

For 2.6.27:

# tracer: ftrace
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
            perl-2581  [00] 28991519113889978 [+  125]: SyS_write <-sysc_noemu
            perl-2581  [00] 28991519113890103 [+  156]: fget_light <-SyS_write
            perl-2581  [00] 28991519113890259 [+  125]: vfs_write <-SyS_write
            perl-2581  [00] 28991519113890384 [+  156]: rw_verify_area <-vfs_write
            perl-2581  [00] 28991519113890540 [+  125]: security_file_permission <-rw_verify_area
            perl-2581  [00] 28991519113890665 [+   93]: apparmor_file_permission <-security_file_permission
            perl-2581  [00] 28991519113890759 [+  156]: aa_file_permission <-apparmor_file_permission
            perl-2581  [00] 28991519113890915 [+  125]: do_sync_write <-vfs_write
            perl-2581  [00] 28991519113891040 [+  125]: sock_aio_write <-do_sync_write
            perl-2581  [00] 28991519113891165 [+  125]: security_socket_sendmsg <-sock_aio_write
            perl-2581  [00] 28991519113891290 [+  125]: apparmor_socket_sendmsg <-security_socket_sendmsg
            perl-2581  [00] 28991519113891415 [+  125]: aa_revalidate_sk <-apparmor_socket_sendmsg
            perl-2581  [00] 28991519113891540 [+  125]: tcp_sendmsg <-sock_aio_write
            perl-2581  [00] 28991519113891665 [+  125]: lock_sock_nested <-tcp_sendmsg
            perl-2581  [00] 28991519113891790 [+  125]: _spin_lock_bh <-lock_sock_nested
            perl-2581  [00] 28991519113891915 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519113892040 [+  187]: local_bh_enable <-lock_sock_nested
            perl-2581  [00] 28991519113892228 [+  125]: tcp_current_mss <-tcp_sendmsg
            perl-2581  [00] 28991519113892353 [+  156]: tcp_established_options <-tcp_current_mss
            perl-2581  [00] 28991519113892509 [+  125]: sk_stream_alloc_skb <-tcp_sendmsg
            perl-2581  [00] 28991519113892634 [+  125]: __alloc_skb <-sk_stream_alloc_skb
            perl-2581  [00] 28991519113892759 [+  156]: kmem_cache_alloc <-__alloc_skb
            perl-2581  [00] 28991519113892915 [+  281]: __kmalloc <-__alloc_skb
            perl-2581  [00] 28991519113893196 [+  125]: skb_put <-tcp_sendmsg
            perl-2581  [00] 28991519113893321 [+  125]: copy_from_user_mvcos_check <-tcp_sendmsg
            perl-2581  [00] 28991519113893446 [+  156]: copy_from_user_std <-copy_from_user_mvcos_check
            perl-2581  [00] 28991519113893603 [+  125]: __tcp_push_pending_frames <-tcp_sendmsg
            perl-2581  [00] 28991519113893728 [+  125]: tcp_init_tso_segs <-__tcp_push_pending_frames
            perl-2581  [00] 28991519113893853 [+  125]: tcp_set_skb_tso_segs <-tcp_init_tso_segs
            perl-2581  [00] 28991519113893978 [+  156]: tcp_transmit_skb <-__tcp_push_pending_frames
            perl-2581  [00] 28991519113894134 [+   93]: skb_clone <-tcp_transmit_skb
            perl-2581  [00] 28991519113894228 [+  125]: __skb_clone <-skb_clone
            perl-2581  [00] 28991519113894353 [+  156]: __copy_skb_header <-__skb_clone
            perl-2581  [00] 28991519113894509 [+  125]: tcp_established_options <-tcp_transmit_skb
            perl-2581  [00] 28991519113894634 [+  125]: skb_push <-tcp_transmit_skb
            perl-2581  [00] 28991519113894759 [+  156]: __tcp_select_window <-tcp_transmit_skb
            perl-2581  [00] 28991519113894915 [+  125]: tcp_options_write <-tcp_transmit_skb
            perl-2581  [00] 28991519113895040 [+  156]: tcp_v4_send_check <-tcp_transmit_skb
            perl-2581  [00] 28991519113895196 [+  125]: ip_queue_xmit <-tcp_transmit_skb
            perl-2581  [00] 28991519113895321 [+  156]: __sk_dst_check <-ip_queue_xmit
            perl-2581  [00] 28991519113895478 [+  125]: skb_push <-ip_queue_xmit
            perl-2581  [00] 28991519113895603 [+   93]: ip_local_out <-ip_queue_xmit
            perl-2581  [00] 28991519113895696 [+  156]: __ip_local_out <-ip_local_out
            perl-2581  [00] 28991519113895853 [+  125]: ip_output <-ip_local_out
            perl-2581  [00] 28991519113895978 [+  156]: ip_finish_output <-ip_output
            perl-2581  [00] 28991519113896134 [+  125]: skb_push <-ip_finish_output
            perl-2581  [00] 28991519113896259 [+  125]: dev_queue_xmit <-ip_finish_output
            perl-2581  [00] 28991519113896384 [+  125]: local_bh_disable <-dev_queue_xmit
            perl-2581  [00] 28991519113896509 [+  187]: _spin_lock <-dev_queue_xmit
            perl-2581  [00] 28991519113896696 [+  125]: pfifo_fast_enqueue <-dev_queue_xmit
            perl-2581  [00] 28991519113896821 [+  125]: __qdisc_run <-dev_queue_xmit
            perl-2581  [00] 28991519113896946 [+  156]: pfifo_fast_dequeue <-__qdisc_run
            perl-2581  [00] 28991519113897103 [+  125]: _spin_lock <-__qdisc_run
            perl-2581  [00] 28991519113897228 [+  187]: dev_hard_start_xmit <-__qdisc_run
            perl-2581  [00] 28991519113897415 [+  156]: qeth_l3_hard_start_xmit <-dev_hard_start_xmit
            perl-2581  [00] 28991519113897571 [+  125]: qeth_get_cast_type <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113897696 [+  156]: qeth_get_priority_queue <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113897853 [+  125]: skb_realloc_headroom <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113897978 [+  125]: pskb_copy <-skb_realloc_headroom
            perl-2581  [00] 28991519113898103 [+   93]: __alloc_skb <-pskb_copy
            perl-2581  [00] 28991519113898196 [+  187]: kmem_cache_alloc <-__alloc_skb
            perl-2581  [00] 28991519113898384 [+  218]: __kmalloc <-__alloc_skb
            perl-2581  [00] 28991519113898603 [+  156]: skb_put <-pskb_copy
            perl-2581  [00] 28991519113898759 [+   93]: copy_skb_header <-pskb_copy
            perl-2581  [00] 28991519113898853 [+  218]: __copy_skb_header <-copy_skb_header
            perl-2581  [00] 28991519113899071 [+  125]: skb_pull <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113899196 [+  125]: skb_push <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113899321 [+  156]: qeth_l3_fill_header <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113899478 [+  125]: qeth_get_elements_no <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113899603 [+  156]: qeth_do_send_packet <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113899759 [+  125]: skb_queue_tail <-qeth_do_send_packet
            perl-2581  [00] 28991519113899884 [+  125]: _spin_lock_irqsave <-skb_queue_tail
            perl-2581  [00] 28991519113900009 [+  156]: _spin_unlock_irqrestore <-skb_queue_tail
            perl-2581  [00] 28991519113900165 [+  187]: qeth_flush_buffers <-qeth_do_send_packet
            perl-2581  [00] 28991519113900353 [+  156]: do_QDIO <-qeth_flush_buffers
            perl-2581  [00] 28991519113900509 [+  156]: qdio_perf_stat_inc <-do_QDIO
            perl-2581  [00] 28991519113900665 [+  125]: qdio_kick_outbound_q <-do_QDIO
            perl-2581  [00] 28991519113900790 [+  812]: qdio_perf_stat_inc <-qdio_kick_outbound_q
            perl-2581  [00] 28991519113901603 [+  218]: __tasklet_schedule <-do_QDIO
            perl-2581  [00] 28991519113901821 [+  156]: dev_kfree_skb_any <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113901978 [+  125]: kfree_skb <-dev_kfree_skb_any
            perl-2581  [00] 28991519113902103 [+   93]: __kfree_skb <-kfree_skb
            perl-2581  [00] 28991519113902196 [+  125]: skb_release_all <-__kfree_skb
            perl-2581  [00] 28991519113902321 [+  156]: dst_release <-skb_release_all
            perl-2581  [00] 28991519113902478 [+   93]: sock_wfree <-skb_release_all
            perl-2581  [00] 28991519113902571 [+  187]: skb_release_data <-skb_release_all
            perl-2581  [00] 28991519113902759 [+  125]: __netif_schedule <-qeth_l3_hard_start_xmit
            perl-2581  [00] 28991519113902884 [+  156]: raise_softirq_irqoff <-__netif_schedule
            perl-2581  [00] 28991519113903040 [+  125]: _spin_lock <-__qdisc_run
            perl-2581  [00] 28991519113903165 [+  187]: local_bh_enable <-dev_queue_xmit
            perl-2581  [00] 28991519113903353 [+  156]: do_softirq <-local_bh_enable
            perl-2581  [00] 28991519113903509 [+   93]: __do_softirq <-do_softirq
            perl-2581  [00] 28991519113903603 [+   93]: account_system_vtime <-__do_softirq
            perl-2581  [00] 28991519113903696 [+  125]: account_system_time <-account_system_vtime
            perl-2581  [00] 28991519113903821 [+   93]: acct_update_integrals <-account_system_time
            perl-2581  [00] 28991519113903915 [+  187]: jiffies_to_timeval <-acct_update_integrals
            perl-2581  [00] 28991519113904103 [+  156]: net_tx_action <-__do_softirq
            perl-2581  [00] 28991519113904259 [+  125]: _spin_trylock <-net_tx_action
            perl-2581  [00] 28991519113904384 [+  125]: __qdisc_run <-net_tx_action
            perl-2581  [00] 28991519113904509 [+  156]: pfifo_fast_dequeue <-__qdisc_run
            perl-2581  [00] 28991519113904665 [+  187]: tasklet_action <-__do_softirq
            perl-2581  [00] 28991519113904853 [+   93]: qdio_outbound_processing <-tasklet_action
            perl-2581  [00] 28991519113904946 [+  187]: qdio_perf_stat_inc <-qdio_outbound_processing
            perl-2581  [00] 28991519113905134 [+  156]: mod_timer <-qdio_outbound_processing
            perl-2581  [00] 28991519113905290 [+  125]: __timer_stats_timer_set_start_info <-mod_timer
            perl-2581  [00] 28991519113905415 [+  125]: __mod_timer <-mod_timer
            perl-2581  [00] 28991519113905540 [+  125]: __timer_stats_timer_set_start_info <-__mod_timer
            perl-2581  [00] 28991519113905665 [+   93]: lock_timer_base <-__mod_timer
            perl-2581  [00] 28991519113905759 [+  156]: _spin_lock_irqsave <-lock_timer_base
            perl-2581  [00] 28991519113905915 [+  125]: internal_add_timer <-__mod_timer
            perl-2581  [00] 28991519113906040 [+  125]: _spin_unlock_irqrestore <-__mod_timer
            perl-2581  [00] 28991519113906165 [+  156]: qdio_perf_stat_inc <-qdio_outbound_processing
            perl-2581  [00] 28991519113906321 [+  125]: account_system_vtime <-__do_softirq
            perl-2581  [00] 28991519113906446 [+   62]: account_system_time <-account_system_vtime
            perl-2581  [00] 28991519113906509 [+  125]: acct_update_integrals <-account_system_time
            perl-2581  [00] 28991519113906634 [+  125]: jiffies_to_timeval <-acct_update_integrals
            perl-2581  [00] 28991519113906759 [+  156]: _local_bh_enable <-__do_softirq
            perl-2581  [00] 28991519113906915 [+  125]: tcp_event_new_data_sent <-__tcp_push_pending_frames
            perl-2581  [00] 28991519113907040 [+  125]: sk_reset_timer <-tcp_event_new_data_sent
            perl-2581  [00] 28991519113907165 [+   93]: mod_timer <-sk_reset_timer
            perl-2581  [00] 28991519113907259 [+  187]: __timer_stats_timer_set_start_info <-mod_timer
            perl-2581  [00] 28991519113907446 [+  156]: release_sock <-tcp_sendmsg
            perl-2581  [00] 28991519113907603 [+  125]: _spin_lock_bh <-release_sock
            perl-2581  [00] 28991519113907728 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519113907853 [+  125]: _spin_unlock_bh <-release_sock
            perl-2581  [00] 28991519113907978 [+  187]: local_bh_enable_ip <-_spin_unlock_bh
            perl-2581  [00] 28991519113908165 [+   93]: dnotify_parent <-vfs_write
            perl-2581  [00] 28991519113908259 [+  156]: _spin_lock <-dnotify_parent
            perl-2581  [00] 28991519113908415 [+  156]: inotify_dentry_parent_queue_event <-vfs_write
            perl-2581  [00] 28991519113908571 [+  875]: inotify_inode_queue_event <-vfs_write
            perl-2581  [00] 28991519113909446 [+   93]: SyS_read <-sysc_noemu
            perl-2581  [00] 28991519113909540 [+  156]: fget_light <-SyS_read
            perl-2581  [00] 28991519113909696 [+  125]: vfs_read <-SyS_read
            perl-2581  [00] 28991519113909821 [+  125]: rw_verify_area <-vfs_read
            perl-2581  [00] 28991519113909946 [+  125]: security_file_permission <-rw_verify_area
            perl-2581  [00] 28991519113910071 [+  125]: apparmor_file_permission <-security_file_permission
            perl-2581  [00] 28991519113910196 [+  125]: aa_file_permission <-apparmor_file_permission
            perl-2581  [00] 28991519113910321 [+  156]: do_sync_read <-vfs_read
            perl-2581  [00] 28991519113910478 [+  125]: sock_aio_read <-do_sync_read
            perl-2581  [00] 28991519113910603 [+  125]: security_socket_recvmsg <-sock_aio_read
            perl-2581  [00] 28991519113910728 [+   93]: apparmor_socket_recvmsg <-security_socket_recvmsg
            perl-2581  [00] 28991519113910821 [+  156]: aa_revalidate_sk <-apparmor_socket_recvmsg
            perl-2581  [00] 28991519113910978 [+  125]: sock_common_recvmsg <-sock_aio_read
            perl-2581  [00] 28991519113911103 [+  125]: tcp_recvmsg <-sock_common_recvmsg
            perl-2581  [00] 28991519113911228 [+  125]: lock_sock_nested <-tcp_recvmsg
            perl-2581  [00] 28991519113911353 [+  125]: _spin_lock_bh <-lock_sock_nested
            perl-2581  [00] 28991519113911478 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519113911603 [+  250]: local_bh_enable <-lock_sock_nested
            perl-2581  [00] 28991519113911853 [+  125]: tcp_cleanup_rbuf <-tcp_recvmsg
            perl-2581  [00] 28991519113911978 [+  125]: sk_wait_data <-tcp_recvmsg
            perl-2581  [00] 28991519113912103 [+  125]: prepare_to_wait <-sk_wait_data
            perl-2581  [00] 28991519113912228 [+  125]: _spin_lock_irqsave <-prepare_to_wait
            perl-2581  [00] 28991519113912353 [+  125]: _spin_unlock_irqrestore <-prepare_to_wait
            perl-2581  [00] 28991519113912478 [+  125]: release_sock <-sk_wait_data
            perl-2581  [00] 28991519113912603 [+  125]: _spin_lock_bh <-release_sock
            perl-2581  [00] 28991519113912728 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519113912853 [+  125]: _spin_unlock_bh <-release_sock
            perl-2581  [00] 28991519113912978 [+  187]: local_bh_enable_ip <-_spin_unlock_bh
            perl-2581  [00] 28991519113913165 [+  125]: schedule_timeout <-sk_wait_data
            perl-2581  [00] 28991519113913290 [+  125]: schedule <-schedule_timeout
            perl-2581  [00] 28991519113913415 [+  125]: _spin_lock <-schedule
            perl-2581  [00] 28991519113913540 [+   93]: sched_clock_cpu <-schedule
            perl-2581  [00] 28991519113913634 [+  125]: sched_clock <-sched_clock_cpu
            perl-2581  [00] 28991519113913759 [+   93]: deactivate_task <-schedule
            perl-2581  [00] 28991519113913853 [+  125]: dequeue_task <-deactivate_task
            perl-2581  [00] 28991519113913978 [+   93]: dequeue_task_fair <-dequeue_task
            perl-2581  [00] 28991519113914071 [+   93]: dequeue_entity <-dequeue_task_fair
            perl-2581  [00] 28991519113914165 [+  156]: update_curr <-dequeue_entity
            perl-2581  [00] 28991519113914321 [+   93]: hrtick_start_fair <-dequeue_task_fair
            perl-2581  [00] 28991519113914415 [+  125]: put_prev_task_fair <-schedule
            perl-2581  [00] 28991519113914540 [+  125]: pick_next_task_fair <-schedule
            perl-2581  [00] 28991519113914665 [+   93]: pick_next_task_rt <-schedule
            perl-2581  [00] 28991519113914759 [+   93]: pick_next_task_fair <-schedule
            perl-2581  [00] 28991519113914853 [+  125]: pick_next_task_idle <-schedule
            perl-2581  [00] 28991519113914978 [+  218]: _spin_trylock <-tracing_record_cmdline
          <idle>-0     [00] 28991519113915196 [+  125]: finish_task_switch <-schedule
          <idle>-0     [00] 28991519113915321 [+   62]: account_vtime <-finish_task_switch
          <idle>-0     [00] 28991519113915384 [+   93]: account_user_time <-account_vtime
          <idle>-0     [00] 28991519113915478 [+   93]: acct_update_integrals <-account_user_time
          <idle>-0     [00] 28991519113915571 [+  125]: jiffies_to_timeval <-acct_update_integrals
          <idle>-0     [00] 28991519113915696 [+   93]: account_system_time <-account_vtime
          <idle>-0     [00] 28991519113915790 [+   93]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28991519113915884 [+  187]: jiffies_to_timeval <-acct_update_integrals
          <idle>-0     [00] 28991519113916071 [+  125]: tick_nohz_stop_sched_tick <-cpu_idle
          <idle>-0     [00] 28991519113916196 [+   93]: ktime_get <-tick_nohz_stop_sched_tick
          <idle>-0     [00] 28991519113916290 [+   93]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519113916384 [+   93]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519113916478 [+  125]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519113916603 [+  156]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519113916759 [+   93]: get_next_timer_interrupt <-tick_nohz_stop_sched_tick
          <idle>-0     [00] 28991519113916853 [+  125]: _spin_lock <-get_next_timer_interrupt
          <idle>-0     [00] 28991519113916978 [+   93]: hrtimer_get_next_event <-get_next_timer_interrupt
          <idle>-0     [00] 28991519113917071 [+  125]: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [00] 28991519113917196 [+   93]: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [00] 28991519113917290 [+   93]: rcu_needs_cpu <-tick_nohz_stop_sched_tick
          <idle>-0     [00] 28991519113917384 [+  156]: rcu_pending <-rcu_needs_cpu
          <idle>-0     [00] 28991519113917540 [+   93]: select_nohz_load_balancer <-tick_nohz_stop_sched_tick
          <idle>-0     [00] 28991519113917634 [+   93]: hweight64 <-select_nohz_load_balancer
          <idle>-0     [00] 28991519113917728 [+  156]: hweight64 <-select_nohz_load_balancer
          <idle>-0     [00] 28991519113917884 [+   62]: hrtimer_start <-tick_nohz_stop_sched_tick
          <idle>-0     [00] 28991519113917946 [+   93]: lock_hrtimer_base <-hrtimer_start
          <idle>-0     [00] 28991519113918040 [+  125]: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [00] 28991519113918165 [+   93]: __remove_hrtimer <-hrtimer_start
          <idle>-0     [00] 28991519113918259 [+  125]: rb_next <-__remove_hrtimer
          <idle>-0     [00] 28991519113918384 [+  125]: hrtimer_force_reprogram <-__remove_hrtimer
          <idle>-0     [00] 28991519113918509 [+   93]: tick_program_event <-hrtimer_force_reprogram
          <idle>-0     [00] 28991519113918603 [+   93]: tick_dev_program_event <-tick_program_event
          <idle>-0     [00] 28991519113918696 [+   93]: ktime_get <-tick_dev_program_event
          <idle>-0     [00] 28991519113918790 [+   93]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519113918884 [+   62]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519113918946 [+  125]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519113919071 [+  125]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519113919196 [+   93]: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [00] 28991519113919290 [+   93]: s390_next_event <-clockevents_program_event
          <idle>-0     [00] 28991519113919384 [+  156]: rb_erase <-__remove_hrtimer
          <idle>-0     [00] 28991519113919540 [+   93]: __timer_stats_hrtimer_set_start_info <-hrtimer_start
          <idle>-0     [00] 28991519113919634 [+   93]: enqueue_hrtimer <-hrtimer_start
          <idle>-0     [00] 28991519113919728 [+  125]: hrtimer_reprogram <-enqueue_hrtimer
          <idle>-0     [00] 28991519113919853 [+   62]: tick_program_event <-hrtimer_reprogram
          <idle>-0     [00] 28991519113919915 [+   93]: tick_dev_program_event <-tick_program_event
          <idle>-0     [00] 28991519113920009 [+   93]: ktime_get <-tick_dev_program_event
          <idle>-0     [00] 28991519113920103 [+   93]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519113920196 [+   62]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519113920259 [+  125]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519113920384 [+  125]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519113920509 [+   62]: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [00] 28991519113920571 [+  125]: s390_next_event <-clockevents_program_event
          <idle>-0     [00] 28991519113920696 [+  125]: rb_insert_color <-enqueue_hrtimer
          <idle>-0     [00] 28991519113920821 [+  156]: _spin_unlock_irqrestore <-hrtimer_start
          <idle>-0     [00] 28991519113920978 [+  125]: _spin_lock <-cpu_idle
          <idle>-0     [00] 28991519113921103 [+229750]: vtime_stop_cpu_timer <-cpu_idle
          <idle>-0     [00] 28991519114150853 [+   93]: do_IRQ <-io_return
          <idle>-0     [00] 28991519114150946 [+   62]: irq_enter <-do_IRQ
          <idle>-0     [00] 28991519114151009 [+  125]: idle_cpu <-irq_enter
          <idle>-0     [00] 28991519114151134 [+   93]: tick_nohz_stop_idle <-irq_enter
          <idle>-0     [00] 28991519114151228 [+   62]: ktime_get <-tick_nohz_stop_idle
          <idle>-0     [00] 28991519114151290 [+  125]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519114151415 [+   62]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519114151478 [+   93]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519114151571 [+  125]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519114151696 [+   93]: account_system_vtime <-irq_enter
          <idle>-0     [00] 28991519114151790 [+   93]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28991519114151884 [+  125]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28991519114152009 [+   93]: idle_cpu <-irq_enter
          <idle>-0     [00] 28991519114152103 [+   62]: tick_nohz_update_jiffies <-irq_enter
          <idle>-0     [00] 28991519114152165 [+   93]: ktime_get <-tick_nohz_update_jiffies
          <idle>-0     [00] 28991519114152259 [+   93]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519114152353 [+   62]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519114152415 [+   93]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519114152509 [+  125]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519114152634 [+  125]: tick_do_update_jiffies64 <-tick_nohz_update_jiffies
          <idle>-0     [00] 28991519114152759 [+   62]: s390_idle_leave <-do_IRQ
          <idle>-0     [00] 28991519114152821 [+  125]: vtime_start_cpu_timer <-s390_idle_leave
          <idle>-0     [00] 28991519114152946 [+   93]: _spin_lock <-s390_idle_leave
          <idle>-0     [00] 28991519114153040 [+  250]: do_adapter_IO <-do_IRQ
          <idle>-0     [00] 28991519114153290 [+   93]: tiqdio_thinint_handler <-do_adapter_IO
          <idle>-0     [00] 28991519114153384 [+  156]: qdio_perf_stat_inc <-tiqdio_thinint_handler
          <idle>-0     [00] 28991519114153540 [+  250]: __tasklet_hi_schedule <-tiqdio_thinint_handler
          <idle>-0     [00] 28991519114153790 [+   93]: irq_exit <-do_IRQ
          <idle>-0     [00] 28991519114153884 [+   93]: account_system_vtime <-irq_exit
          <idle>-0     [00] 28991519114153978 [+   93]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28991519114154071 [+   93]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28991519114154165 [+   93]: do_softirq <-irq_exit
          <idle>-0     [00] 28991519114154259 [+   93]: __do_softirq <-do_softirq
          <idle>-0     [00] 28991519114154353 [+   62]: account_system_vtime <-__do_softirq
          <idle>-0     [00] 28991519114154415 [+  125]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28991519114154540 [+   93]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28991519114154634 [+   93]: tasklet_hi_action <-__do_softirq
          <idle>-0     [00] 28991519114154728 [+   93]: tiqdio_tasklet_fn <-tasklet_hi_action
          <idle>-0     [00] 28991519114154821 [+  218]: qdio_perf_stat_inc <-tiqdio_tasklet_fn
          <idle>-0     [00] 28991519114155040 [+  218]: __tasklet_schedule <-tiqdio_tasklet_fn
          <idle>-0     [00] 28991519114155259 [+  125]: tasklet_action <-__do_softirq
          <idle>-0     [00] 28991519114155384 [+   93]: tiqdio_inbound_processing <-tasklet_action
          <idle>-0     [00] 28991519114155478 [+   93]: qdio_perf_stat_inc <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114155571 [+  125]: qdio_sync_after_thinint <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114155696 [+   93]: qdio_siga_sync <-qdio_sync_after_thinint
          <idle>-0     [00] 28991519114155790 [+  125]: qdio_check_outbound_after_thinint <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114155915 [+  125]: __tasklet_schedule <-qdio_check_outbound_after_thinint
          <idle>-0     [00] 28991519114156040 [+  218]: qdio_inbound_q_moved <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114156259 [+   93]: qdio_kick_inbound_handler <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114156353 [+   93]: qdio_perf_stat_inc <-qdio_kick_inbound_handler
          <idle>-0     [00] 28991519114156446 [+  156]: qeth_l3_qdio_input_handler <-qdio_kick_inbound_handler
          <idle>-0     [00] 28991519114156603 [+  218]: qeth_core_get_next_skb <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114156821 [+   62]: dev_alloc_skb <-qeth_core_get_next_skb
          <idle>-0     [00] 28991519114156884 [+  125]: __alloc_skb <-dev_alloc_skb
          <idle>-0     [00] 28991519114157009 [+   93]: kmem_cache_alloc <-__alloc_skb
          <idle>-0     [00] 28991519114157103 [+  187]: __kmalloc <-__alloc_skb
          <idle>-0     [00] 28991519114157290 [+  187]: skb_put <-qeth_core_get_next_skb
          <idle>-0     [00] 28991519114157478 [+   93]: eth_header <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114157571 [+   93]: skb_push <-eth_header
          <idle>-0     [00] 28991519114157665 [+   93]: eth_type_trans <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114157759 [+  125]: skb_pull <-eth_type_trans
          <idle>-0     [00] 28991519114157884 [+   93]: netif_rx <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114157978 [+  125]: __napi_schedule <-netif_rx
          <idle>-0     [00] 28991519114158103 [+  125]: qeth_core_get_next_skb <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114158228 [+  125]: qeth_queue_input_buffer <-qeth_l3_qdio_input_handler
          <idle>-0     [00] 28991519114158353 [+   62]: qdio_siga_sync_q <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114158415 [+  125]: qdio_siga_sync <-qdio_siga_sync_q
          <idle>-0     [00] 28991519114158540 [+   93]: get_buf_state <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114158634 [+   93]: qdio_stop_polling <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114158728 [+  125]: qdio_perf_stat_inc <-qdio_stop_polling
          <idle>-0     [00] 28991519114158853 [+   93]: qdio_siga_sync_q <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114158946 [+   93]: qdio_siga_sync <-qdio_siga_sync_q
          <idle>-0     [00] 28991519114159040 [+  156]: get_buf_state <-tiqdio_inbound_processing
          <idle>-0     [00] 28991519114159196 [+   93]: net_rx_action <-__do_softirq
          <idle>-0     [00] 28991519114159290 [+   93]: process_backlog <-net_rx_action
          <idle>-0     [00] 28991519114159384 [+  156]: netif_receive_skb <-process_backlog
          <idle>-0     [00] 28991519114159540 [+  125]: ip_rcv <-netif_receive_skb
          <idle>-0     [00] 28991519114159665 [+  125]: ip_rcv_finish <-ip_rcv
          <idle>-0     [00] 28991519114159790 [+  156]: ip_route_input <-ip_rcv_finish
          <idle>-0     [00] 28991519114159946 [+   62]: ip_local_deliver <-ip_rcv_finish
          <idle>-0     [00] 28991519114160009 [+   93]: ip_local_deliver_finish <-ip_local_deliver
          <idle>-0     [00] 28991519114160103 [+  125]: raw_local_deliver <-ip_local_deliver_finish
          <idle>-0     [00] 28991519114160228 [+   93]: tcp_v4_rcv <-ip_local_deliver_finish
          <idle>-0     [00] 28991519114160321 [+  125]: __inet_lookup_established <-tcp_v4_rcv
          <idle>-0     [00] 28991519114160446 [+  125]: _read_lock <-__inet_lookup_established
          <idle>-0     [00] 28991519114160571 [+   93]: sk_filter <-tcp_v4_rcv
          <idle>-0     [00] 28991519114160665 [+   93]: security_sock_rcv_skb <-sk_filter
          <idle>-0     [00] 28991519114160759 [+   62]: cap_socket_sock_rcv_skb <-security_sock_rcv_skb
          <idle>-0     [00] 28991519114160821 [+  125]: local_bh_disable <-sk_filter
          <idle>-0     [00] 28991519114160946 [+   93]: local_bh_enable <-sk_filter
          <idle>-0     [00] 28991519114161040 [+   93]: _spin_lock <-tcp_v4_rcv
          <idle>-0     [00] 28991519114161134 [+   93]: __wake_up <-tcp_v4_rcv
          <idle>-0     [00] 28991519114161228 [+   93]: _spin_lock_irqsave <-__wake_up
          <idle>-0     [00] 28991519114161321 [+   93]: __wake_up_common <-__wake_up
          <idle>-0     [00] 28991519114161415 [+   93]: autoremove_wake_function <-__wake_up_common
          <idle>-0     [00] 28991519114161509 [+   62]: default_wake_function <-autoremove_wake_function
          <idle>-0     [00] 28991519114161571 [+  125]: try_to_wake_up <-default_wake_function
          <idle>-0     [00] 28991519114161696 [+   62]: task_rq_lock <-try_to_wake_up
          <idle>-0     [00] 28991519114161759 [+  125]: _spin_lock <-task_rq_lock
          <idle>-0     [00] 28991519114161884 [+  125]: select_task_rq_fair <-try_to_wake_up
          <idle>-0     [00] 28991519114162009 [+   93]: sched_clock_cpu <-try_to_wake_up
          <idle>-0     [00] 28991519114162103 [+   93]: sched_clock <-sched_clock_cpu
          <idle>-0     [00] 28991519114162196 [+   93]: activate_task <-try_to_wake_up
          <idle>-0     [00] 28991519114162290 [+   93]: enqueue_task <-activate_task
          <idle>-0     [00] 28991519114162384 [+   93]: enqueue_task_fair <-enqueue_task
          <idle>-0     [00] 28991519114162478 [+   93]: enqueue_entity <-enqueue_task_fair
          <idle>-0     [00] 28991519114162571 [+   93]: update_curr <-enqueue_entity
          <idle>-0     [00] 28991519114162665 [+   93]: place_entity <-enqueue_entity
          <idle>-0     [00] 28991519114162759 [+   93]: account_scheduler_latency <-enqueue_entity
          <idle>-0     [00] 28991519114162853 [+   93]: __enqueue_entity <-enqueue_entity
          <idle>-0     [00] 28991519114162946 [+   93]: rb_insert_color <-__enqueue_entity
          <idle>-0     [00] 28991519114163040 [+  125]: hrtick_start_fair <-enqueue_task_fair
          <idle>-0     [00] 28991519114163165 [+   93]: check_preempt_curr_idle <-try_to_wake_up
          <idle>-0     [00] 28991519114163259 [+   62]: resched_task <-check_preempt_curr_idle
          <idle>-0     [00] 28991519114163321 [+  125]: _spin_unlock_irqrestore <-try_to_wake_up
          <idle>-0     [00] 28991519114163446 [+   93]: _spin_unlock_irqrestore <-__wake_up
          <idle>-0     [00] 28991519114163540 [+   93]: sk_reset_timer <-tcp_v4_rcv
          <idle>-0     [00] 28991519114163634 [+  125]: mod_timer <-sk_reset_timer
          <idle>-0     [00] 28991519114163759 [+  125]: __timer_stats_timer_set_start_info <-mod_timer
          <idle>-0     [00] 28991519114163884 [+  125]: __mod_timer <-mod_timer
          <idle>-0     [00] 28991519114164009 [+  125]: __timer_stats_timer_set_start_info <-__mod_timer
          <idle>-0     [00] 28991519114164134 [+  125]: lock_timer_base <-__mod_timer
          <idle>-0     [00] 28991519114164259 [+  125]: _spin_lock_irqsave <-lock_timer_base
          <idle>-0     [00] 28991519114164384 [+  125]: internal_add_timer <-__mod_timer
          <idle>-0     [00] 28991519114164509 [+  281]: _spin_unlock_irqrestore <-__mod_timer
          <idle>-0     [00] 28991519114164790 [+  156]: tasklet_action <-__do_softirq
          <idle>-0     [00] 28991519114164946 [+  125]: qdio_outbound_processing <-tasklet_action
          <idle>-0     [00] 28991519114165071 [+  250]: qdio_perf_stat_inc <-qdio_outbound_processing
          <idle>-0     [00] 28991519114165321 [+  187]: qeth_qdio_output_handler <-qdio_outbound_processing
          <idle>-0     [00] 28991519114165509 [+  125]: qeth_check_qdio_errors <-qeth_qdio_output_handler
          <idle>-0     [00] 28991519114165634 [+  125]: qeth_clear_output_buffer <-qeth_qdio_output_handler
          <idle>-0     [00] 28991519114165759 [+   93]: skb_dequeue <-qeth_clear_output_buffer
          <idle>-0     [00] 28991519114165853 [+  125]: _spin_lock_irqsave <-skb_dequeue
          <idle>-0     [00] 28991519114165978 [+  156]: _spin_unlock_irqrestore <-skb_dequeue
          <idle>-0     [00] 28991519114166134 [+  156]: dev_kfree_skb_any <-qeth_clear_output_buffer
          <idle>-0     [00] 28991519114166290 [+  125]: kfree_skb <-dev_kfree_skb_any
          <idle>-0     [00] 28991519114166415 [+  125]: __kfree_skb <-kfree_skb
          <idle>-0     [00] 28991519114166540 [+  125]: skb_release_all <-__kfree_skb
          <idle>-0     [00] 28991519114166665 [+  125]: dst_release <-skb_release_all
          <idle>-0     [00] 28991519114166790 [+  125]: skb_release_data <-skb_release_all
          <idle>-0     [00] 28991519114166915 [+  218]: kfree <-skb_release_data
          <idle>-0     [00] 28991519114167134 [+  156]: kmem_cache_free <-__kfree_skb
          <idle>-0     [00] 28991519114167290 [+  125]: skb_dequeue <-qeth_clear_output_buffer
          <idle>-0     [00] 28991519114167415 [+  125]: _spin_lock_irqsave <-skb_dequeue
          <idle>-0     [00] 28991519114167540 [+  125]: _spin_unlock_irqrestore <-skb_dequeue
          <idle>-0     [00] 28991519114167665 [+  312]: qeth_eddp_buf_release_contexts <-qeth_clear_output_buffer
          <idle>-0     [00] 28991519114167978 [+  156]: qeth_switch_to_nonpacking_if_needed <-qeth_qdio_output_handler
          <idle>-0     [00] 28991519114168134 [+  125]: __netif_schedule <-qeth_qdio_output_handler
          <idle>-0     [00] 28991519114168259 [+  156]: raise_softirq_irqoff <-__netif_schedule
          <idle>-0     [00] 28991519114168415 [+  125]: del_timer <-qdio_outbound_processing
          <idle>-0     [00] 28991519114168540 [+   93]: lock_timer_base <-del_timer
          <idle>-0     [00] 28991519114168634 [+  156]: _spin_lock_irqsave <-lock_timer_base
          <idle>-0     [00] 28991519114168790 [+  218]: _spin_unlock_irqrestore <-del_timer
          <idle>-0     [00] 28991519114169009 [+  187]: net_tx_action <-__do_softirq
          <idle>-0     [00] 28991519114169196 [+  125]: _spin_trylock <-net_tx_action
          <idle>-0     [00] 28991519114169321 [+  125]: __qdisc_run <-net_tx_action
          <idle>-0     [00] 28991519114169446 [+  156]: pfifo_fast_dequeue <-__qdisc_run
          <idle>-0     [00] 28991519114169603 [+   93]: account_system_vtime <-__do_softirq
          <idle>-0     [00] 28991519114169696 [+  125]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28991519114169821 [+   93]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28991519114169915 [+  125]: _local_bh_enable <-__do_softirq
          <idle>-0     [00] 28991519114170040 [+  187]: idle_cpu <-irq_exit
          <idle>-0     [00] 28991519114170228 [+  125]: tick_nohz_restart_sched_tick <-cpu_idle
          <idle>-0     [00] 28991519114170353 [+   93]: tick_nohz_stop_idle <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114170446 [+   93]: select_nohz_load_balancer <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114170540 [+   93]: ktime_get <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114170634 [+  125]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519114170759 [+   62]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519114170821 [+  156]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519114170978 [+   93]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519114171071 [+  125]: tick_do_update_jiffies64 <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114171196 [+   93]: hrtimer_cancel <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114171290 [+   93]: hrtimer_try_to_cancel <-hrtimer_cancel
          <idle>-0     [00] 28991519114171384 [+   62]: lock_hrtimer_base <-hrtimer_try_to_cancel
          <idle>-0     [00] 28991519114171446 [+  125]: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [00] 28991519114171571 [+   93]: __remove_hrtimer <-hrtimer_try_to_cancel
          <idle>-0     [00] 28991519114171665 [+   93]: rb_next <-__remove_hrtimer
          <idle>-0     [00] 28991519114171759 [+   93]: hrtimer_force_reprogram <-__remove_hrtimer
          <idle>-0     [00] 28991519114171853 [+   93]: tick_program_event <-hrtimer_force_reprogram
          <idle>-0     [00] 28991519114171946 [+   93]: tick_dev_program_event <-tick_program_event
          <idle>-0     [00] 28991519114172040 [+  125]: ktime_get <-tick_dev_program_event
          <idle>-0     [00] 28991519114172165 [+   62]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519114172228 [+   93]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519114172321 [+   93]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519114172415 [+  156]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519114172571 [+   93]: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [00] 28991519114172665 [+  125]: s390_next_event <-clockevents_program_event
          <idle>-0     [00] 28991519114172790 [+  125]: rb_erase <-__remove_hrtimer
          <idle>-0     [00] 28991519114172915 [+   93]: _spin_unlock_irqrestore <-hrtimer_try_to_cancel
          <idle>-0     [00] 28991519114173009 [+   93]: hrtimer_forward <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114173103 [+  125]: hrtimer_start <-tick_nohz_restart_sched_tick
          <idle>-0     [00] 28991519114173228 [+   62]: lock_hrtimer_base <-hrtimer_start
          <idle>-0     [00] 28991519114173290 [+  125]: _spin_lock_irqsave <-lock_hrtimer_base
          <idle>-0     [00] 28991519114173415 [+   93]: __timer_stats_hrtimer_set_start_info <-hrtimer_start
          <idle>-0     [00] 28991519114173509 [+   93]: enqueue_hrtimer <-hrtimer_start
          <idle>-0     [00] 28991519114173603 [+   93]: hrtimer_reprogram <-enqueue_hrtimer
          <idle>-0     [00] 28991519114173696 [+   93]: tick_program_event <-hrtimer_reprogram
          <idle>-0     [00] 28991519114173790 [+   93]: tick_dev_program_event <-tick_program_event
          <idle>-0     [00] 28991519114173884 [+   62]: ktime_get <-tick_dev_program_event
          <idle>-0     [00] 28991519114173946 [+   93]: ktime_get_ts <-ktime_get
          <idle>-0     [00] 28991519114174040 [+   93]: getnstimeofday <-ktime_get_ts
          <idle>-0     [00] 28991519114174134 [+   93]: read_tod_clock <-getnstimeofday
          <idle>-0     [00] 28991519114174228 [+   93]: set_normalized_timespec <-ktime_get_ts
          <idle>-0     [00] 28991519114174321 [+  125]: clockevents_program_event <-tick_dev_program_event
          <idle>-0     [00] 28991519114174446 [+  125]: s390_next_event <-clockevents_program_event
          <idle>-0     [00] 28991519114174571 [+   93]: rb_insert_color <-enqueue_hrtimer
          <idle>-0     [00] 28991519114174665 [+  156]: _spin_unlock_irqrestore <-hrtimer_start
          <idle>-0     [00] 28991519114174821 [+  125]: schedule <-cpu_idle
          <idle>-0     [00] 28991519114174946 [+   93]: _spin_lock <-schedule
          <idle>-0     [00] 28991519114175040 [+  125]: sched_clock_cpu <-schedule
          <idle>-0     [00] 28991519114175165 [+  125]: sched_clock <-sched_clock_cpu
          <idle>-0     [00] 28991519114175290 [+   93]: put_prev_task_idle <-schedule
          <idle>-0     [00] 28991519114175384 [+   93]: pick_next_task_fair <-schedule
          <idle>-0     [00] 28991519114175478 [+   93]: set_next_entity <-pick_next_task_fair
          <idle>-0     [00] 28991519114175571 [+   93]: __dequeue_entity <-set_next_entity
          <idle>-0     [00] 28991519114175665 [+   93]: rb_next <-__dequeue_entity
          <idle>-0     [00] 28991519114175759 [+  125]: rb_erase <-__dequeue_entity
          <idle>-0     [00] 28991519114175884 [+  187]: hrtick_start_fair <-pick_next_task_fair
          <idle>-0     [00] 28991519114176071 [+  187]: _spin_trylock <-tracing_record_cmdline
            perl-2581  [00] 28991519114176259 [+  125]: finish_task_switch <-schedule
            perl-2581  [00] 28991519114176384 [+   93]: account_vtime <-finish_task_switch
            perl-2581  [00] 28991519114176478 [+   93]: account_user_time <-account_vtime
            perl-2581  [00] 28991519114176571 [+   93]: acct_update_integrals <-account_user_time
            perl-2581  [00] 28991519114176665 [+  125]: account_system_time <-account_vtime
            perl-2581  [00] 28991519114176790 [+  156]: acct_update_integrals <-account_system_time
            perl-2581  [00] 28991519114176946 [+  125]: lock_sock_nested <-sk_wait_data
            perl-2581  [00] 28991519114177071 [+  125]: _spin_lock_bh <-lock_sock_nested
            perl-2581  [00] 28991519114177196 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519114177321 [+  187]: local_bh_enable <-lock_sock_nested
            perl-2581  [00] 28991519114177509 [+  156]: finish_wait <-sk_wait_data
            perl-2581  [00] 28991519114177665 [+  125]: tcp_prequeue_process <-tcp_recvmsg
            perl-2581  [00] 28991519114177790 [+  156]: local_bh_disable <-tcp_prequeue_process
            perl-2581  [00] 28991519114177946 [+  125]: tcp_v4_do_rcv <-tcp_prequeue_process
            perl-2581  [00] 28991519114178071 [+  125]: tcp_rcv_established <-tcp_v4_do_rcv
            perl-2581  [00] 28991519114178196 [+  156]: local_bh_enable <-tcp_rcv_established
            perl-2581  [00] 28991519114178353 [+  125]: skb_copy_datagram_iovec <-tcp_rcv_established
            perl-2581  [00] 28991519114178478 [+  125]: memcpy_toiovec <-skb_copy_datagram_iovec
            perl-2581  [00] 28991519114178603 [+  125]: copy_to_user_mvcos_check <-memcpy_toiovec
            perl-2581  [00] 28991519114178728 [+  187]: copy_to_user_std <-copy_to_user_mvcos_check
            perl-2581  [00] 28991519114178915 [+   93]: tcp_rcv_space_adjust <-tcp_rcv_established
            perl-2581  [00] 28991519114179009 [+  156]: local_bh_disable <-tcp_rcv_established
            perl-2581  [00] 28991519114179165 [+  125]: get_seconds <-tcp_rcv_established
            perl-2581  [00] 28991519114179290 [+  156]: tcp_event_data_recv <-tcp_rcv_established
            perl-2581  [00] 28991519114179446 [+  187]: tcp_ack <-tcp_rcv_established
            perl-2581  [00] 28991519114179634 [+  156]: __kfree_skb <-tcp_ack
            perl-2581  [00] 28991519114179790 [+   93]: skb_release_all <-__kfree_skb
            perl-2581  [00] 28991519114179884 [+  125]: dst_release <-skb_release_all
            perl-2581  [00] 28991519114180009 [+  156]: skb_release_data <-skb_release_all
            perl-2581  [00] 28991519114180165 [+  187]: kfree <-skb_release_data
            perl-2581  [00] 28991519114180353 [+  187]: kmem_cache_free <-__kfree_skb
            perl-2581  [00] 28991519114180540 [+  125]: tcp_rtt_estimator <-tcp_ack
            perl-2581  [00] 28991519114180665 [+  156]: bictcp_acked <-tcp_ack
            perl-2581  [00] 28991519114180821 [+   93]: bictcp_cong_avoid <-tcp_ack
            perl-2581  [00] 28991519114180915 [+  156]: tcp_is_cwnd_limited <-bictcp_cong_avoid
            perl-2581  [00] 28991519114181071 [+  125]: tcp_current_mss <-tcp_rcv_established
            perl-2581  [00] 28991519114181196 [+  125]: tcp_established_options <-tcp_current_mss
            perl-2581  [00] 28991519114181321 [+  125]: __tcp_push_pending_frames <-tcp_rcv_established
            perl-2581  [00] 28991519114181446 [+  125]: __tcp_ack_snd_check <-tcp_rcv_established
            perl-2581  [00] 28991519114181571 [+  125]: tcp_send_delayed_ack <-__tcp_ack_snd_check
            perl-2581  [00] 28991519114181696 [+  125]: sk_reset_timer <-tcp_send_delayed_ack
            perl-2581  [00] 28991519114181821 [+  125]: mod_timer <-sk_reset_timer
            perl-2581  [00] 28991519114181946 [+  125]: __timer_stats_timer_set_start_info <-mod_timer
            perl-2581  [00] 28991519114182071 [+   93]: __mod_timer <-mod_timer
            perl-2581  [00] 28991519114182165 [+  156]: __timer_stats_timer_set_start_info <-__mod_timer
            perl-2581  [00] 28991519114182321 [+   93]: lock_timer_base <-__mod_timer
            perl-2581  [00] 28991519114182415 [+  156]: _spin_lock_irqsave <-lock_timer_base
            perl-2581  [00] 28991519114182571 [+   93]: internal_add_timer <-__mod_timer
            perl-2581  [00] 28991519114182665 [+  156]: _spin_unlock_irqrestore <-__mod_timer
            perl-2581  [00] 28991519114182821 [+   93]: __kfree_skb <-tcp_rcv_established
            perl-2581  [00] 28991519114182915 [+  125]: skb_release_all <-__kfree_skb
            perl-2581  [00] 28991519114183040 [+  125]: dst_release <-skb_release_all
            perl-2581  [00] 28991519114183165 [+  125]: skb_release_data <-skb_release_all
            perl-2581  [00] 28991519114183290 [+  187]: kfree <-skb_release_data
            perl-2581  [00] 28991519114183478 [+  187]: kmem_cache_free <-__kfree_skb
            perl-2581  [00] 28991519114183665 [+  187]: local_bh_enable <-tcp_prequeue_process
            perl-2581  [00] 28991519114183853 [+  125]: tcp_cleanup_rbuf <-tcp_recvmsg
            perl-2581  [00] 28991519114183978 [+  156]: __tcp_select_window <-tcp_cleanup_rbuf
            perl-2581  [00] 28991519114184134 [+  125]: release_sock <-tcp_recvmsg
            perl-2581  [00] 28991519114184259 [+   93]: _spin_lock_bh <-release_sock
            perl-2581  [00] 28991519114184353 [+  125]: local_bh_disable <-_spin_lock_bh
            perl-2581  [00] 28991519114184478 [+  156]: _spin_unlock_bh <-release_sock
            perl-2581  [00] 28991519114184634 [+  218]: local_bh_enable_ip <-_spin_unlock_bh
            perl-2581  [00] 28991519114184853 [+  156]: dnotify_parent <-vfs_read
            perl-2581  [00] 28991519114185009 [+  125]: _spin_lock <-dnotify_parent
            perl-2581  [00] 28991519114185134 [+  125]: inotify_dentry_parent_queue_event <-vfs_read
            perl-2581  [00] 28991519114185259 [+ 1687]: inotify_inode_queue_event <-vfs_read

And this is for 2.6.16:

# tracer: ftrace
#
#           TASK-PID   CPU#    TIMESTAMP  FUNCTION
#              | |      |          |         |
           <...>-2804  [00] 28994586831213199 [+   93]: sys_write <-sysc_noemu
           <...>-2804  [00] 28994586831213292 [+  218]: fget_light <-sys_write
           <...>-2804  [00] 28994586831213511 [+  156]: apparmor_file_permission <-sys_write
           <...>-2804  [00] 28994586831213667 [+  125]: sock_aio_write <-sys_write
           <...>-2804  [00] 28994586831213792 [+  125]: dummy_socket_sendmsg <-sock_aio_write
           <...>-2804  [00] 28994586831213917 [+  125]: inet_sendmsg <-sock_aio_write
           <...>-2804  [00] 28994586831214042 [+   93]: tcp_sendmsg <-inet_sendmsg
           <...>-2804  [00] 28994586831214136 [+  125]: lock_sock <-tcp_sendmsg
           <...>-2804  [00] 28994586831214261 [+  125]: _spin_lock_bh <-lock_sock
           <...>-2804  [00] 28994586831214386 [+  125]: _spin_unlock_bh <-lock_sock
           <...>-2804  [00] 28994586831214511 [+  187]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831214699 [+  156]: tcp_current_mss <-tcp_sendmsg
           <...>-2804  [00] 28994586831214855 [+  125]: __alloc_skb <-tcp_sendmsg
           <...>-2804  [00] 28994586831214980 [+  156]: kmem_cache_alloc <-__alloc_skb
           <...>-2804  [00] 28994586831215136 [+  218]: __kmalloc <-__alloc_skb
           <...>-2804  [00] 28994586831215355 [+  187]: memset <-__alloc_skb
           <...>-2804  [00] 28994586831215542 [+  125]: copy_from_user_mvcos_check <-tcp_sendmsg
           <...>-2804  [00] 28994586831215667 [+  218]: copy_from_user_std <-copy_from_user_mvcos_check
           <...>-2804  [00] 28994586831215886 [+  125]: __tcp_push_pending_frames <-tcp_sendmsg
           <...>-2804  [00] 28994586831216011 [+  125]: tcp_transmit_skb <-__tcp_push_pending_frames
           <...>-2804  [00] 28994586831216136 [+  156]: skb_clone <-tcp_transmit_skb
           <...>-2804  [00] 28994586831216292 [+  250]: memcpy <-skb_clone
           <...>-2804  [00] 28994586831216542 [+  156]: tcp_v4_send_check <-tcp_transmit_skb
           <...>-2804  [00] 28994586831216699 [+  187]: ip_queue_xmit <-tcp_transmit_skb
           <...>-2804  [00] 28994586831216886 [+  187]: ip_output <-ip_queue_xmit
           <...>-2804  [00] 28994586831217074 [+  125]: dev_queue_xmit <-ip_output
           <...>-2804  [00] 28994586831217199 [+  125]: _spin_lock <-dev_queue_xmit
           <...>-2804  [00] 28994586831217324 [+  125]: pfifo_fast_enqueue <-dev_queue_xmit
           <...>-2804  [00] 28994586831217449 [+  125]: __qdisc_run <-dev_queue_xmit
           <...>-2804  [00] 28994586831217574 [+  156]: pfifo_fast_dequeue <-__qdisc_run
           <...>-2804  [00] 28994586831217730 [+  125]: _spin_trylock <-__qdisc_run
           <...>-2804  [00] 28994586831217855 [+  156]: dev_hard_start_xmit <-__qdisc_run
           <...>-2804  [00] 28994586831218011 [+  250]: qeth_hard_start_xmit <-dev_hard_start_xmit
           <...>-2804  [00] 28994586831218261 [+  125]: qeth_fill_header <-qeth_hard_start_xmit
           <...>-2804  [00] 28994586831218386 [+  156]: memset <-qeth_fill_header
           <...>-2804  [00] 28994586831218542 [+  187]: memset <-qeth_fill_header
           <...>-2804  [00] 28994586831218730 [+  125]: skb_queue_tail <-qeth_hard_start_xmit
           <...>-2804  [00] 28994586831218855 [+  125]: _spin_lock_irqsave <-skb_queue_tail
           <...>-2804  [00] 28994586831218980 [+  156]: _spin_unlock_irqrestore <-skb_queue_tail
           <...>-2804  [00] 28994586831219136 [+  218]: qeth_flush_buffers <-qeth_hard_start_xmit
           <...>-2804  [00] 28994586831219355 [+  875]: do_QDIO <-qeth_flush_buffers
           <...>-2804  [00] 28994586831220230 [+   93]: mod_timer <-do_QDIO
           <...>-2804  [00] 28994586831220324 [+  187]: _spin_lock_irqsave <-mod_timer
           <...>-2804  [00] 28994586831220511 [+  187]: _spin_unlock_irqrestore <-mod_timer
           <...>-2804  [00] 28994586831220699 [+  125]: raise_softirq_irqoff <-qeth_hard_start_xmit
           <...>-2804  [00] 28994586831220824 [+  156]: _spin_lock <-__qdisc_run
           <...>-2804  [00] 28994586831220980 [+  125]: pfifo_fast_dequeue <-__qdisc_run
           <...>-2804  [00] 28994586831221105 [+  156]: local_bh_enable <-dev_queue_xmit
           <...>-2804  [00] 28994586831221261 [+  156]: do_softirq <-local_bh_enable
           <...>-2804  [00] 28994586831221417 [+   93]: account_system_vtime <-do_softirq
           <...>-2804  [00] 28994586831221511 [+   93]: account_system_time <-account_system_vtime
           <...>-2804  [00] 28994586831221605 [+  125]: acct_update_integrals <-account_system_time
           <...>-2804  [00] 28994586831221730 [+  156]: __do_softirq <-do_softirq
           <...>-2804  [00] 28994586831221886 [+  156]: net_tx_action <-__do_softirq
           <...>-2804  [00] 28994586831222042 [+  125]: _spin_trylock <-net_tx_action
           <...>-2804  [00] 28994586831222167 [+  125]: __qdisc_run <-net_tx_action
           <...>-2804  [00] 28994586831222292 [+  156]: pfifo_fast_dequeue <-__qdisc_run
           <...>-2804  [00] 28994586831222449 [+   93]: account_system_vtime <-do_softirq
           <...>-2804  [00] 28994586831222542 [+  125]: account_system_time <-account_system_vtime
           <...>-2804  [00] 28994586831222667 [+  156]: acct_update_integrals <-account_system_time
           <...>-2804  [00] 28994586831222824 [+  125]: sk_reset_timer <-__tcp_push_pending_frames
           <...>-2804  [00] 28994586831222949 [+  156]: mod_timer <-sk_reset_timer
           <...>-2804  [00] 28994586831223105 [+  125]: release_sock <-tcp_sendmsg
           <...>-2804  [00] 28994586831223230 [+   93]: _spin_lock_bh <-release_sock
           <...>-2804  [00] 28994586831223324 [+  156]: _spin_unlock_bh <-release_sock
           <...>-2804  [00] 28994586831223480 [+  187]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831223667 [+   93]: dnotify_parent <-sys_write
           <...>-2804  [00] 28994586831223761 [+  125]: _spin_lock <-dnotify_parent
           <...>-2804  [00] 28994586831223886 [+  156]: inotify_dentry_parent_queue_event <-sys_write
           <...>-2804  [00] 28994586831224042 [+  812]: inotify_inode_queue_event <-sys_write
           <...>-2804  [00] 28994586831224855 [+   93]: sys_read <-sysc_noemu
           <...>-2804  [00] 28994586831224949 [+  187]: fget_light <-sys_read
           <...>-2804  [00] 28994586831225136 [+  125]: apparmor_file_permission <-sys_read
           <...>-2804  [00] 28994586831225261 [+  125]: sock_aio_read <-sys_read
           <...>-2804  [00] 28994586831225386 [+  156]: dummy_socket_recvmsg <-sock_aio_read
           <...>-2804  [00] 28994586831225542 [+   93]: sock_common_recvmsg <-sock_aio_read
           <...>-2804  [00] 28994586831225636 [+  125]: tcp_recvmsg <-sock_common_recvmsg
           <...>-2804  [00] 28994586831225761 [+  125]: lock_sock <-tcp_recvmsg
           <...>-2804  [00] 28994586831225886 [+  125]: _spin_lock_bh <-lock_sock
           <...>-2804  [00] 28994586831226011 [+  125]: _spin_unlock_bh <-lock_sock
           <...>-2804  [00] 28994586831226136 [+  250]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831226386 [+  125]: sk_wait_data <-tcp_recvmsg
           <...>-2804  [00] 28994586831226511 [+  125]: prepare_to_wait <-sk_wait_data
           <...>-2804  [00] 28994586831226636 [+  156]: _spin_lock_irqsave <-prepare_to_wait
           <...>-2804  [00] 28994586831226792 [+  156]: _spin_unlock_irqrestore <-prepare_to_wait
           <...>-2804  [00] 28994586831226949 [+   93]: _spin_lock_bh <-sk_wait_data
           <...>-2804  [00] 28994586831227042 [+  125]: _spin_unlock_bh <-sk_wait_data
           <...>-2804  [00] 28994586831227167 [+  187]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831227355 [+  125]: schedule_timeout <-sk_wait_data
           <...>-2804  [00] 28994586831227480 [+   93]: schedule <-schedule_timeout
           <...>-2804  [00] 28994586831227574 [+  156]: profile_hit <-schedule
           <...>-2804  [00] 28994586831227730 [+  156]: sched_clock <-schedule
           <...>-2804  [00] 28994586831227886 [+  281]: _spin_lock_irq <-schedule
          <idle>-0     [00] 28994586831228167 [+   93]: account_vtime <-schedule
          <idle>-0     [00] 28994586831228261 [+   93]: account_user_time <-account_vtime
          <idle>-0     [00] 28994586831228355 [+   93]: account_system_time <-account_vtime
          <idle>-0     [00] 28994586831228449 [+  218]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28994586831228667 [+   93]: notifier_call_chain <-cpu_idle
          <idle>-0     [00] 28994586831228761 [+   93]: nohz_idle_notify <-notifier_call_chain
          <idle>-0     [00] 28994586831228855 [+  156]: rcu_needs_cpu <-nohz_idle_notify
          <idle>-0     [00] 28994586831229011 [+   62]: next_timer_interrupt <-nohz_idle_notify
          <idle>-0     [00] 28994586831229074 [+   93]: hrtimer_get_next_event <-next_timer_interrupt
          <idle>-0     [00] 28994586831229167 [+  125]: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [00] 28994586831229292 [+   93]: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [00] 28994586831229386 [+   93]: _spin_lock_irqsave <-hrtimer_get_next_event
          <idle>-0     [00] 28994586831229480 [+   93]: _spin_unlock_irqrestore <-hrtimer_get_next_event
          <idle>-0     [00] 28994586831229574 [+  125]: ktime_get <-hrtimer_get_next_event
          <idle>-0     [00] 28994586831229699 [+   93]: getnstimeofday <-ktime_get
          <idle>-0     [00] 28994586831229792 [+  156]: do_gettimeofday <-getnstimeofday
          <idle>-0     [00] 28994586831229949 [+  156]: set_normalized_timespec <-ktime_get
          <idle>-0     [00] 28994586831230105 [+  125]: ns_to_timespec <-next_timer_interrupt
          <idle>-0     [00] 28994586831230230 [+  156]: _spin_lock <-next_timer_interrupt
          <idle>-0     [00] 28994586831230386 [+212406]: vtimer_idle_notify <-notifier_call_chain
          <idle>-0     [00] 28994586831442792 [+   93]: do_IRQ <-io_return
          <idle>-0     [00] 28994586831442886 [+   62]: account_system_vtime <-do_IRQ
          <idle>-0     [00] 28994586831442949 [+  125]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28994586831443074 [+  250]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28994586831443324 [+   93]: do_monitor_call <-sysc_return
          <idle>-0     [00] 28994586831443417 [+   62]: notifier_call_chain <-do_monitor_call
          <idle>-0     [00] 28994586831443480 [+  156]: nohz_idle_notify <-notifier_call_chain
          <idle>-0     [00] 28994586831443636 [+  125]: vtimer_idle_notify <-notifier_call_chain
          <idle>-0     [00] 28994586831443761 [+  125]: do_adapter_IO <-do_IRQ
          <idle>-0     [00] 28994586831443886 [+  500]: tiqdio_thinint_handler <-do_adapter_IO
          <idle>-0     [00] 28994586831444386 [+  156]: qeth_qdio_output_handler <-tiqdio_thinint_handler
          <idle>-0     [00] 28994586831444542 [+   62]: skb_dequeue <-qeth_qdio_output_handler
          <idle>-0     [00] 28994586831444605 [+  125]: _spin_lock_irqsave <-skb_dequeue
          <idle>-0     [00] 28994586831444730 [+  125]: _spin_unlock_irqrestore <-skb_dequeue
          <idle>-0     [00] 28994586831444855 [+   93]: raise_softirq_irqoff <-qeth_qdio_output_handler
          <idle>-0     [00] 28994586831444949 [+   93]: skb_dequeue <-qeth_qdio_output_handler
          <idle>-0     [00] 28994586831445042 [+   93]: _spin_lock_irqsave <-skb_dequeue
          <idle>-0     [00] 28994586831445136 [+   93]: _spin_unlock_irqrestore <-skb_dequeue
          <idle>-0     [00] 28994586831445230 [+  250]: qeth_eddp_buf_release_contexts <-qeth_qdio_output_handler
          <idle>-0     [00] 28994586831445480 [+  125]: raise_softirq_irqoff <-qeth_qdio_output_handler
          <idle>-0     [00] 28994586831445605 [+   93]: del_timer <-tiqdio_thinint_handler
          <idle>-0     [00] 28994586831445699 [+  125]: _spin_lock_irqsave <-del_timer
          <idle>-0     [00] 28994586831445824 [+  218]: _spin_unlock_irqrestore <-del_timer
          <idle>-0     [00] 28994586831446042 [+  250]: qeth_qdio_input_handler <-tiqdio_thinint_handler
          <idle>-0     [00] 28994586831446292 [+   62]: __alloc_skb <-qeth_qdio_input_handler
          <idle>-0     [00] 28994586831446355 [+  125]: kmem_cache_alloc <-__alloc_skb
          <idle>-0     [00] 28994586831446480 [+  125]: __kmalloc <-__alloc_skb
          <idle>-0     [00] 28994586831446605 [+  156]: memset <-__alloc_skb
          <idle>-0     [00] 28994586831446761 [+  187]: memcpy <-qeth_qdio_input_handler
          <idle>-0     [00] 28994586831446949 [+  500]: netif_rx <-qeth_qdio_input_handler
          <idle>-0     [00] 28994586831447449 [+   93]: irq_exit <-do_IRQ
          <idle>-0     [00] 28994586831447542 [+   93]: account_system_vtime <-irq_exit
          <idle>-0     [00] 28994586831447636 [+   93]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28994586831447730 [+   93]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28994586831447824 [+   93]: do_softirq <-irq_exit
          <idle>-0     [00] 28994586831447917 [+   62]: account_system_vtime <-do_softirq
          <idle>-0     [00] 28994586831447980 [+   93]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28994586831448074 [+  125]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28994586831448199 [+   93]: __do_softirq <-do_softirq
          <idle>-0     [00] 28994586831448292 [+   93]: net_tx_action <-__do_softirq
          <idle>-0     [00] 28994586831448386 [+  125]: __kfree_skb <-net_tx_action
          <idle>-0     [00] 28994586831448511 [+  156]: sock_wfree <-__kfree_skb
          <idle>-0     [00] 28994586831448667 [+   93]: _spin_trylock <-net_tx_action
          <idle>-0     [00] 28994586831448761 [+   62]: __qdisc_run <-net_tx_action
          <idle>-0     [00] 28994586831448824 [+  156]: pfifo_fast_dequeue <-__qdisc_run
          <idle>-0     [00] 28994586831448980 [+   93]: net_rx_action <-__do_softirq
          <idle>-0     [00] 28994586831449074 [+  156]: process_backlog <-net_rx_action
          <idle>-0     [00] 28994586831449230 [+  156]: ip_rcv <-process_backlog
          <idle>-0     [00] 28994586831449386 [+  125]: ip_route_input <-ip_rcv
          <idle>-0     [00] 28994586831449511 [+  125]: ip_local_deliver <-ip_rcv
          <idle>-0     [00] 28994586831449636 [+  125]: tcp_v4_rcv <-ip_local_deliver
          <idle>-0     [00] 28994586831449761 [+  125]: _read_lock <-tcp_v4_rcv
          <idle>-0     [00] 28994586831449886 [+   93]: dummy_socket_sock_rcv_skb <-tcp_v4_rcv
          <idle>-0     [00] 28994586831449980 [+   93]: _spin_lock <-tcp_v4_rcv
          <idle>-0     [00] 28994586831450074 [+   93]: __wake_up <-tcp_v4_rcv
          <idle>-0     [00] 28994586831450167 [+   93]: _spin_lock_irqsave <-__wake_up
          <idle>-0     [00] 28994586831450261 [+   93]: autoremove_wake_function <-__wake_up
          <idle>-0     [00] 28994586831450355 [+   93]: default_wake_function <-autoremove_wake_function
          <idle>-0     [00] 28994586831450449 [+  125]: _spin_lock <-default_wake_function
          <idle>-0     [00] 28994586831450574 [+  187]: sched_clock <-default_wake_function
          <idle>-0     [00] 28994586831450761 [+  125]: resched_task <-default_wake_function
          <idle>-0     [00] 28994586831450886 [+   93]: _spin_unlock_irqrestore <-default_wake_function
          <idle>-0     [00] 28994586831450980 [+   93]: _spin_unlock_irqrestore <-__wake_up
          <idle>-0     [00] 28994586831451074 [+   93]: sk_reset_timer <-tcp_v4_rcv
          <idle>-0     [00] 28994586831451167 [+   93]: mod_timer <-sk_reset_timer
          <idle>-0     [00] 28994586831451261 [+  125]: _spin_lock_irqsave <-mod_timer
          <idle>-0     [00] 28994586831451386 [+  156]: _spin_unlock_irqrestore <-mod_timer
          <idle>-0     [00] 28994586831451542 [+   93]: account_system_vtime <-do_softirq
          <idle>-0     [00] 28994586831451636 [+   93]: account_system_time <-account_system_vtime
          <idle>-0     [00] 28994586831451730 [+  156]: acct_update_integrals <-account_system_time
          <idle>-0     [00] 28994586831451886 [+  125]: schedule <-cpu_idle
          <idle>-0     [00] 28994586831452011 [+   93]: profile_hit <-schedule
          <idle>-0     [00] 28994586831452105 [+  125]: sched_clock <-schedule
          <idle>-0     [00] 28994586831452230 [+  343]: _spin_lock_irq <-schedule
           <...>-2804  [00] 28994586831452574 [+   93]: account_vtime <-schedule
           <...>-2804  [00] 28994586831452667 [+   62]: account_user_time <-account_vtime
           <...>-2804  [00] 28994586831452730 [+  125]: account_system_time <-account_vtime
           <...>-2804  [00] 28994586831452855 [+  125]: acct_update_integrals <-account_system_time
           <...>-2804  [00] 28994586831452980 [+   93]: _spin_lock_bh <-sk_wait_data
           <...>-2804  [00] 28994586831453074 [+   93]: _spin_unlock_bh <-sk_wait_data
           <...>-2804  [00] 28994586831453167 [+   93]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831453261 [+  125]: finish_wait <-sk_wait_data
           <...>-2804  [00] 28994586831453386 [+   93]: tcp_v4_do_rcv <-tcp_recvmsg
           <...>-2804  [00] 28994586831453480 [+   93]: tcp_rcv_established <-tcp_v4_do_rcv
           <...>-2804  [00] 28994586831453574 [+   93]: local_bh_enable <-tcp_rcv_established
           <...>-2804  [00] 28994586831453667 [+   93]: skb_copy_datagram_iovec <-tcp_rcv_established
           <...>-2804  [00] 28994586831453761 [+   93]: memcpy_toiovec <-skb_copy_datagram_iovec
           <...>-2804  [00] 28994586831453855 [+   62]: copy_to_user_mvcos_check <-memcpy_toiovec
           <...>-2804  [00] 28994586831453917 [+  187]: copy_to_user_std <-copy_to_user_mvcos_check
           <...>-2804  [00] 28994586831454105 [+  125]: tcp_ack <-tcp_rcv_established
           <...>-2804  [00] 28994586831454230 [+  125]: __kfree_skb <-tcp_ack
           <...>-2804  [00] 28994586831454355 [+  125]: kfree <-__kfree_skb
           <...>-2804  [00] 28994586831454480 [+  187]: kmem_cache_free <-__kfree_skb
           <...>-2804  [00] 28994586831454667 [+   93]: tcp_reno_cong_avoid <-tcp_ack
           <...>-2804  [00] 28994586831454761 [+   93]: tcp_current_mss <-tcp_rcv_established
           <...>-2804  [00] 28994586831454855 [+  156]: __tcp_push_pending_frames <-tcp_rcv_established
           <...>-2804  [00] 28994586831455011 [+   62]: tcp_send_delayed_ack <-tcp_rcv_established
           <...>-2804  [00] 28994586831455074 [+   93]: sk_reset_timer <-tcp_send_delayed_ack
           <...>-2804  [00] 28994586831455167 [+   93]: mod_timer <-sk_reset_timer
           <...>-2804  [00] 28994586831455261 [+   93]: _spin_lock_irqsave <-mod_timer
           <...>-2804  [00] 28994586831455355 [+   93]: _spin_unlock_irqrestore <-mod_timer
           <...>-2804  [00] 28994586831455449 [+  125]: __kfree_skb <-tcp_rcv_established
           <...>-2804  [00] 28994586831455574 [+   93]: kfree <-__kfree_skb
           <...>-2804  [00] 28994586831455667 [+  125]: kmem_cache_free <-__kfree_skb
           <...>-2804  [00] 28994586831455792 [+  156]: local_bh_enable <-tcp_recvmsg
           <...>-2804  [00] 28994586831455949 [+   93]: __tcp_select_window <-tcp_recvmsg
           <...>-2804  [00] 28994586831456042 [+   93]: release_sock <-tcp_recvmsg
           <...>-2804  [00] 28994586831456136 [+   93]: _spin_lock_bh <-release_sock
           <...>-2804  [00] 28994586831456230 [+   93]: _spin_unlock_bh <-release_sock
           <...>-2804  [00] 28994586831456324 [+  156]: local_bh_enable <-_spin_unlock_bh
           <...>-2804  [00] 28994586831456480 [+   93]: dnotify_parent <-sys_read
           <...>-2804  [00] 28994586831456574 [+  125]: _spin_lock <-dnotify_parent
           <...>-2804  [00] 28994586831456699 [+   62]: inotify_dentry_parent_queue_event <-sys_read
           <...>-2804  [00] 28994586831456761 [+ 1437]: inotify_inode_queue_event <-sys_read

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-14 12:58 [patch 0/3] Allow inlined spinlocks again V4 Heiko Carstens
@ 2009-08-14 12:58 ` Heiko Carstens
  2009-08-16 17:57   ` Heiko Carstens
  0 siblings, 1 reply; 34+ messages in thread
From: Heiko Carstens @ 2009-08-14 12:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Heiko Carstens, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin

[-- Attachment #1: 02_spinlock_ifdef.diff --]
[-- Type: text/plain, Size: 11770 bytes --]

From: Heiko Carstens <heiko.carstens@de.ibm.com>

This allows an architecture to specify per lock variant if the
locking code should be kept out-of-line or inlined.

If an architecure wants out-of-line locking code no change is
needed. To force inlining of e.g. spin_lock() the line

#define __spin_lock_is_small

needs to be added to arch/<whatever>/include/asm/spinlock.h

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
---
 include/linux/spinlock_api_smp.h |  146 ++++++++++++++++++++++++++++++++++++++-
 kernel/spinlock.c                |   56 ++++++++++++++
 2 files changed, 199 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/spinlock_api_smp.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_api_smp.h
+++ linux-2.6/include/linux/spinlock_api_smp.h
@@ -19,46 +19,186 @@ int in_lock_functions(unsigned long addr
 
 #define assert_spin_locked(x)	BUG_ON(!spin_is_locked(x))
 
-void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
 void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass)
 							__acquires(lock);
 void __lockfunc _spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *map)
 							__acquires(lock);
+unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
+							__acquires(lock);
+
+#ifdef __spin_lock_is_small
+#define _spin_lock(lock) __spin_lock(lock)
+#else
+void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __read_lock_is_small
+#define _read_lock(lock) __read_lock(lock)
+#else
 void __lockfunc _read_lock(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __write_lock_is_small
+#define _write_lock(lock) __write_lock(lock)
+#else
 void __lockfunc _write_lock(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __spin_lock_bh_is_small
+#define _spin_lock_bh(lock) __spin_lock_bh(lock)
+#else
 void __lockfunc _spin_lock_bh(spinlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __read_lock_bh_is_small
+#define _read_lock_bh(lock) __read_lock_bh(lock)
+#else
 void __lockfunc _read_lock_bh(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __write_lock_bh_is_small
+#define _write_lock_bh(lock) __write_lock_bh(lock)
+#else
 void __lockfunc _write_lock_bh(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __spin_lock_irq_is_small
+#define _spin_lock_irq(lock) __spin_lock_irq(lock)
+#else
 void __lockfunc _spin_lock_irq(spinlock_t *lock)	__acquires(lock);
+#endif
+
+#ifdef __read_lock_irq_is_small
+#define _read_lock_irq(lock) __read_lock_irq(lock)
+#else
 void __lockfunc _read_lock_irq(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __write_lock_irq_is_small
+#define _write_lock_irq(lock) __write_lock_irq(lock)
+#else
 void __lockfunc _write_lock_irq(rwlock_t *lock)		__acquires(lock);
+#endif
+
+#ifdef __spin_lock_irqsave_is_small
+#define _spin_lock_irqsave(lock) __spin_lock_irqsave(lock)
+#else
 unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock)
 							__acquires(lock);
-unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
-							__acquires(lock);
+#endif
+
+#ifdef __read_lock_irqsave_is_small
+#define _read_lock_irqsave(lock) __read_lock_irqsave(lock)
+#else
 unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock)
 							__acquires(lock);
+#endif
+
+#ifdef __write_lock_irqsave_is_small
+#define _write_lock_irqsave(lock) __write_lock_irqsave(lock)
+#else
 unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock)
 							__acquires(lock);
+#endif
+
+#ifdef __spin_trylock_is_small
+#define _spin_trylock(lock) __spin_trylock(lock)
+#else
 int __lockfunc _spin_trylock(spinlock_t *lock);
+#endif
+
+#ifdef __read_trylock_is_small
+#define _read_trylock(lock) __read_trylock(lock)
+#else
 int __lockfunc _read_trylock(rwlock_t *lock);
+#endif
+
+#ifdef __write_trylock_is_small
+#define _write_trylock(lock) __write_trylock(lock)
+#else
 int __lockfunc _write_trylock(rwlock_t *lock);
+#endif
+
+#ifdef __spin_trylock_bh_is_small
+#define _spin_trylock_bh(lock) __spin_trylock_bh(lock)
+#else
 int __lockfunc _spin_trylock_bh(spinlock_t *lock);
+#endif
+
+#ifdef __spin_unlock_is_small
+#define _spin_unlock(lock) __spin_unlock(lock)
+#else
 void __lockfunc _spin_unlock(spinlock_t *lock)		__releases(lock);
+#endif
+
+#ifdef __read_unlock_is_small
+#define _read_unlock(lock) __read_unlock(lock)
+#else
 void __lockfunc _read_unlock(rwlock_t *lock)		__releases(lock);
+#endif
+
+#ifdef __write_unlock_is_small
+#define _write_unlock(lock) __write_unlock(lock)
+#else
 void __lockfunc _write_unlock(rwlock_t *lock)		__releases(lock);
+#endif
+
+#ifdef __spin_unlock_bh_is_small
+#define _spin_unlock_bh(lock) __spin_unlock_bh(lock)
+#else
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)	__releases(lock);
+#endif
+
+#ifdef __read_unlock_bh_is_small
+#define _read_unlock_bh(lock) __read_unlock_bh(lock)
+#else
 void __lockfunc _read_unlock_bh(rwlock_t *lock)		__releases(lock);
+#endif
+
+#ifdef __write_unlock_bh_is_small
+#define _write_unlock_bh(lock) __write_unlock_bh(lock)
+#else
 void __lockfunc _write_unlock_bh(rwlock_t *lock)	__releases(lock);
+#endif
+
+#ifdef __spin_unlock_irq_is_small
+#define _spin_unlock_irq(lock) __spin_unlock_irq(lock)
+#else
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)	__releases(lock);
+#endif
+
+#ifdef __read_unlock_irq_is_small
+#define _read_unlock_irq(lock) __read_unlock_irq(lock)
+#else
 void __lockfunc _read_unlock_irq(rwlock_t *lock)	__releases(lock);
+#endif
+
+#ifdef __write_unlock_irq_is_small
+#define _write_unlock_irq(lock) __write_unlock_irq(lock)
+#else
 void __lockfunc _write_unlock_irq(rwlock_t *lock)	__releases(lock);
+#endif
+
+#ifdef __spin_unlock_irqrestore_is_small
+#define _spin_unlock_irqrestore(lock, flags) __spin_unlock_irqrestore(lock, flags)
+#else
 void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
 							__releases(lock);
+#endif
+
+#ifdef __read_unlock_irqrestore_is_small
+#define _read_unlock_irqrestore(lock, flags) __read_unlock_irqrestore(lock, flags)
+#else
 void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 							__releases(lock);
+#endif
+
+#ifdef __write_unlock_irqrestore_is_small
+#define _write_unlock_irqrestore(lock, flags) __write_unlock_irqrestore(lock, flags)
+#else
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 							__releases(lock);
+#endif
 
 static inline int __spin_trylock(spinlock_t *lock)
 {
Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -21,23 +21,29 @@
 #include <linux/debug_locks.h>
 #include <linux/module.h>
 
+#ifndef __spin_trylock_is_small
 int __lockfunc _spin_trylock(spinlock_t *lock)
 {
 	return __spin_trylock(lock);
 }
 EXPORT_SYMBOL(_spin_trylock);
+#endif
 
+#ifndef __read_trylock_is_small
 int __lockfunc _read_trylock(rwlock_t *lock)
 {
 	return __read_trylock(lock);
 }
 EXPORT_SYMBOL(_read_trylock);
+#endif
 
+#ifndef __write_trylock_is_small
 int __lockfunc _write_trylock(rwlock_t *lock)
 {
 	return __write_trylock(lock);
 }
 EXPORT_SYMBOL(_write_trylock);
+#endif
 
 /*
  * If lockdep is enabled then we use the non-preemption spin-ops
@@ -46,77 +52,101 @@ EXPORT_SYMBOL(_write_trylock);
  */
 #if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
 
+#ifndef __read_lock_is_small
 void __lockfunc _read_lock(rwlock_t *lock)
 {
 	__read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock);
+#endif
 
+#ifndef __spin_lock_irqsave_is_small
 unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock)
 {
 	return __spin_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irqsave);
+#endif
 
+#ifndef __spin_lock_irq_is_small
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 	__spin_lock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irq);
+#endif
 
+#ifndef __spin_lock_bh_is_small
 void __lockfunc _spin_lock_bh(spinlock_t *lock)
 {
 	__spin_lock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_lock_bh);
+#endif
 
+#ifndef __read_lock_irqsave_is_small
 unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock)
 {
 	return __read_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_read_lock_irqsave);
+#endif
 
+#ifndef __read_lock_irq_is_small
 void __lockfunc _read_lock_irq(rwlock_t *lock)
 {
 	__read_lock_irq(lock);
 }
 EXPORT_SYMBOL(_read_lock_irq);
+#endif
 
+#ifndef __read_lock_bh_is_small
 void __lockfunc _read_lock_bh(rwlock_t *lock)
 {
 	__read_lock_bh(lock);
 }
 EXPORT_SYMBOL(_read_lock_bh);
+#endif
 
+#ifndef __write_lock_irqsave_is_small
 unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock)
 {
 	return __write_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_write_lock_irqsave);
+#endif
 
+#ifndef __write_lock_irq_is_small
 void __lockfunc _write_lock_irq(rwlock_t *lock)
 {
 	__write_lock_irq(lock);
 }
 EXPORT_SYMBOL(_write_lock_irq);
+#endif
 
+#ifndef __write_lock_bh_is_small
 void __lockfunc _write_lock_bh(rwlock_t *lock)
 {
 	__write_lock_bh(lock);
 }
 EXPORT_SYMBOL(_write_lock_bh);
+#endif
 
+#ifndef __spin_lock_is_small
 void __lockfunc _spin_lock(spinlock_t *lock)
 {
 	__spin_lock(lock);
 }
 EXPORT_SYMBOL(_spin_lock);
+#endif
 
+#ifndef __write_lock_is_small
 void __lockfunc _write_lock(rwlock_t *lock)
 {
 	__write_lock(lock);
 }
 EXPORT_SYMBOL(_write_lock);
+#endif
 
 #else /* CONFIG_PREEMPT: */
 
@@ -242,83 +272,109 @@ EXPORT_SYMBOL(_spin_lock_nest_lock);
 
 #endif
 
+#ifndef __spin_unlock_is_small
 void __lockfunc _spin_unlock(spinlock_t *lock)
 {
 	__spin_unlock(lock);
 }
 EXPORT_SYMBOL(_spin_unlock);
+#endif
 
+#ifndef __write_unlock_is_small
 void __lockfunc _write_unlock(rwlock_t *lock)
 {
 	__write_unlock(lock);
 }
 EXPORT_SYMBOL(_write_unlock);
+#endif
 
+#ifndef __read_unlock_is_small
 void __lockfunc _read_unlock(rwlock_t *lock)
 {
 	__read_unlock(lock);
 }
 EXPORT_SYMBOL(_read_unlock);
+#endif
 
+#ifndef __spin_unlock_irqrestore_is_small
 void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
 {
 	__spin_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_spin_unlock_irqrestore);
+#endif
 
+#ifndef __spin_unlock_irq_is_small
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)
 {
 	__spin_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_irq);
+#endif
 
+#ifndef __spin_unlock_bh_is_small
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)
 {
 	__spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_bh);
+#endif
 
+#ifndef __read_unlock_irqrestore_is_small
 void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
 	__read_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_read_unlock_irqrestore);
+#endif
 
+#ifndef __read_unlock_irq_is_small
 void __lockfunc _read_unlock_irq(rwlock_t *lock)
 {
 	__read_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_read_unlock_irq);
+#endif
 
+#ifndef __read_unlock_bh_is_small
 void __lockfunc _read_unlock_bh(rwlock_t *lock)
 {
 	__read_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_read_unlock_bh);
+#endif
 
+#ifndef __write_unlock_irqrestore_is_small
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
 	__write_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_write_unlock_irqrestore);
+#endif
 
+#ifndef __write_unlock_irq_is_small
 void __lockfunc _write_unlock_irq(rwlock_t *lock)
 {
 	__write_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_write_unlock_irq);
+#endif
 
+#ifndef __write_unlock_bh_is_small
 void __lockfunc _write_unlock_bh(rwlock_t *lock)
 {
 	__write_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_write_unlock_bh);
+#endif
 
+#ifndef __spin_trylock_bh_is_small
 int __lockfunc _spin_trylock_bh(spinlock_t *lock)
 {
 	return __spin_trylock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_trylock_bh);
+#endif
 
 notrace int in_lock_functions(unsigned long addr)
 {

-- 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 12:34   ` Heiko Carstens
@ 2009-08-14 16:04     ` Linus Torvalds
  2009-08-14 17:13       ` Heiko Carstens
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-14 16:04 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, David S. Miller



On Fri, 14 Aug 2009, Heiko Carstens wrote:
> 
> All of this started when we compared a 2.6.27 based distro kernel
> to a 2.6.16 based one. It turned out that 2.6.27 burns a lot more cpu
> cycles than 2.6.16 does for the same workloads.
> 
> As an example: we took a simple one connection ping pong network load.
> (client sends packet to server and waits until packet from server
>  returns and game starts again). It uses more cpu cycles on 2.6.27.
> 
> Using ftrace we figured out that on 2.6.27 it takes more than 500 function
> calls on the client for a single roundtrip while on 2.6.16 it took only
> ~230 calls.

Hmm. But the spinlock part of this seems to not really have changed.

Removing everything but the actual callchain info, and then doing a "diff" 
between your two roundtrips, and they look very different, and basically 
none of the difference seems to be due to spinlocks.

It seems to be due to a number of different things, but the bulk of the 
new costs seem to be in networking (and to some degree scheduling). There 
are smaller differences elsewhere, but the networking code _really_ blows 
up.

I don't know how much of some of these are "real" kernel changes, and how 
much of it is less inlining, but spinlocks don't seem to be the issue. 
Here's a quick walk-through (some of these get repeated a couple of times 
in your traces)

 * Apparmor going through the generic security code and rw_verify_area:

	apparmor_file_permission <-sys_write

   becomes

	security_file_permission <-rw_verify_area
	apparmor_file_permission <-security_file_permission
	aa_file_permission <-apparmor_file_permission

 * the vfs write itself being deeper

	sock_aio_write <-sys_write

   becomes

	vfs_write <-SyS_write
	rw_verify_area <-vfs_write
	do_sync_write <-vfs_write
	sock_aio_write <-do_sync_write

 * the socket code does totally new apparmor security things that were not 
   in 2.6.16 at all:

	security_socket_sendmsg <-sock_aio_write
	apparmor_socket_sendmsg <-security_socket_sendmsg
	aa_revalidate_sk <-apparmor_socket_sendmsg

 * TCP doing more:

	__alloc_skb <-tcp_sendmsg
	..
	..
	ip_output <-ip_queue_xmit
	dev_queue_xmit <-ip_output

   becomes

	tcp_established_options <-tcp_current_mss
	sk_stream_alloc_skb <-tcp_sendmsg
	__alloc_skb <-sk_stream_alloc_skb
	..
	tcp_init_tso_segs <-__tcp_push_pending_frames
	tcp_set_skb_tso_segs <-tcp_init_tso_segs
	..
	__skb_clone <-skb_clone
	__copy_skb_header <-__skb_clone
	tcp_established_options <-tcp_transmit_skb
	skb_push <-tcp_transmit_skb
	__tcp_select_window <-tcp_transmit_skb
	tcp_options_write <-tcp_transmit_skb
	..
	__sk_dst_check <-ip_queue_xmit
	skb_push <-ip_queue_xmit
	ip_local_out <-ip_queue_xmit
	__ip_local_out <-ip_local_out
	ip_output <-ip_local_out
	ip_finish_output <-ip_output
	skb_push <-ip_finish_output
	dev_queue_xmit <-ip_finish_output
	local_bh_disable <-dev_queue_xmit

 * More TCP changes at tcp_ack time:

	kfree <-__kfree_skb
	kmem_cache_free <-__kfree_skb
	tcp_reno_cong_avoid <-tcp_ack
	tcp_current_mss <-tcp_rcv_established
	__tcp_push_pending_frames <-tcp_rcv_established
	tcp_send_delayed_ack <-tcp_rcv_established

   becomes

	skb_release_all <-__kfree_skb
	dst_release <-skb_release_all
	skb_release_data <-skb_release_all
	kfree <-skb_release_data
	kmem_cache_free <-__kfree_skb
	tcp_rtt_estimator <-tcp_ack
	bictcp_acked <-tcp_ack
	bictcp_cong_avoid <-tcp_ack
	tcp_is_cwnd_limited <-bictcp_cong_avoid
	tcp_current_mss <-tcp_rcv_established
	tcp_established_options <-tcp_current_mss
	__tcp_push_pending_frames <-tcp_rcv_established
	__tcp_ack_snd_check <-tcp_rcv_established
	tcp_send_delayed_ack <-__tcp_ack_snd_check

 * The network driver doing more:

	qeth_hard_start_xmit <-dev_hard_start_xmit
	qeth_fill_header <-qeth_hard_start_xmit
	memset <-qeth_fill_header
	memset <-qeth_fill_header
	skb_queue_tail <-qeth_hard_start_xmit
	..
	mod_timer <-do_QDIO
	_spin_lock_irqsave <-mod_timer
	_spin_unlock_irqrestore <-mod_timer
	raise_softirq_irqoff <-qeth_hard_start_xmit
	..

   becomes

	qeth_l3_hard_start_xmit <-dev_hard_start_xmit
	qeth_get_cast_type <-qeth_l3_hard_start_xmit
	qeth_get_priority_queue <-qeth_l3_hard_start_xmit
	skb_realloc_headroom <-qeth_l3_hard_start_xmit
	pskb_copy <-skb_realloc_headroom
	__alloc_skb <-pskb_copy
	kmem_cache_alloc <-__alloc_skb
	__kmalloc <-__alloc_skb
	skb_put <-pskb_copy
	copy_skb_header <-pskb_copy
	__copy_skb_header <-copy_skb_header
	skb_pull <-qeth_l3_hard_start_xmit
	skb_push <-qeth_l3_hard_start_xmit
	qeth_l3_fill_header <-qeth_l3_hard_start_xmit
	qeth_get_elements_no <-qeth_l3_hard_start_xmit
	qeth_do_send_packet <-qeth_l3_hard_start_xmit
	skb_queue_tail <-qeth_do_send_packet
	..
	qdio_perf_stat_inc <-do_QDIO
	qdio_kick_outbound_q <-do_QDIO
	qdio_perf_stat_inc <-qdio_kick_outbound_q
	__tasklet_schedule <-do_QDIO
	dev_kfree_skb_any <-qeth_l3_hard_start_xmit
	kfree_skb <-dev_kfree_skb_any
	__kfree_skb <-kfree_skb
	skb_release_all <-__kfree_skb
	dst_release <-skb_release_all
	sock_wfree <-skb_release_all
	skb_release_data <-skb_release_all
	__netif_schedule <-qeth_l3_hard_start_xmit
	raise_softirq_irqoff <-__netif_schedule
	..
	tasklet_action <-__do_softirq
	qdio_outbound_processing <-tasklet_action
	qdio_perf_stat_inc <-qdio_outbound_processing
	mod_timer <-qdio_outbound_processing

 * the timer code doing more:

	__timer_stats_timer_set_start_info <-mod_timer
	__mod_timer <-mod_timer
	__timer_stats_timer_set_start_info <-__mod_timer
	lock_timer_base <-__mod_timer
	_spin_lock_irqsave <-lock_timer_base
	internal_add_timer <-__mod_timer
	_spin_unlock_irqrestore <-__mod_timer
	qdio_perf_stat_inc <-qdio_outbound_processing
	account_system_vtime <-__do_softirq

 * the scheduler code doing way deeper chains:

	profile_hit <-schedule
	sched_clock <-schedule
	_spin_lock_irq <-schedule
	account_vtime <-schedule
	account_user_time <-account_vtime

   becomes

	_spin_lock <-schedule
	sched_clock_cpu <-schedule
	sched_clock <-sched_clock_cpu
	deactivate_task <-schedule
	dequeue_task <-deactivate_task
	dequeue_task_fair <-dequeue_task
	dequeue_entity <-dequeue_task_fair
	update_curr <-dequeue_entity
	hrtick_start_fair <-dequeue_task_fair
	put_prev_task_fair <-schedule
	pick_next_task_fair <-schedule
	pick_next_task_rt <-schedule
	pick_next_task_fair <-schedule
	pick_next_task_idle <-schedule
	_spin_trylock <-tracing_record_cmdline
	finish_task_switch <-schedule
	account_vtime <-finish_task_switch
	account_user_time <-account_vtime
	acct_update_integrals <-account_user_time
	jiffies_to_timeval <-acct_update_integrals

Hmm. Not pretty.

		Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 16:04     ` Linus Torvalds
@ 2009-08-14 17:13       ` Heiko Carstens
  2009-08-14 18:08         ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Heiko Carstens @ 2009-08-14 17:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, David S. Miller

On Fri, Aug 14, 2009 at 09:04:49AM -0700, Linus Torvalds wrote:
> On Fri, 14 Aug 2009, Heiko Carstens wrote:
> > 
> > All of this started when we compared a 2.6.27 based distro kernel
> > to a 2.6.16 based one. It turned out that 2.6.27 burns a lot more cpu
> > cycles than 2.6.16 does for the same workloads.
> > 
> > As an example: we took a simple one connection ping pong network load.
> > (client sends packet to server and waits until packet from server
> >  returns and game starts again). It uses more cpu cycles on 2.6.27.
> > 
> > Using ftrace we figured out that on 2.6.27 it takes more than 500 function
> > calls on the client for a single roundtrip while on 2.6.16 it took only
> > ~230 calls.
> 
> Hmm. But the spinlock part of this seems to not really have changed.
> 
> Removing everything but the actual callchain info, and then doing a "diff" 
> between your two roundtrips, and they look very different, and basically 
> none of the difference seems to be due to spinlocks.
> 
> It seems to be due to a number of different things, but the bulk of the 
> new costs seem to be in networking (and to some degree scheduling). There 
> are smaller differences elsewhere, but the networking code _really_ blows 
> up.
> 
> I don't know how much of some of these are "real" kernel changes, and how 
> much of it is less inlining, but spinlocks don't seem to be the issue. 
> Here's a quick walk-through (some of these get repeated a couple of times 
> in your traces)

That's true. I just wanted to give some background how all of this started.
Spinlocks are equally good or bad for s390 on both kernel versions.

However quite a few of the additional function calls in networking code
come from uninlining:

c2aa270a [NET]: uninline skb_push, de-bloats a lot
6be8ac2f [NET]: uninline skb_pull, de-bloats a lot
419ae74e [NET]: uninline skb_trim, de-bloats

Christian Ehrhardt identified these and more, but hasn't posted patches
yet. Reverting these patches does increase performance.

That's when I started wondering what we would gain when we would inline
spinlock code again.

>  * the timer code doing more:
> 
> 	__timer_stats_timer_set_start_info <-mod_timer
> 	__mod_timer <-mod_timer
> 	__timer_stats_timer_set_start_info <-__mod_timer
> 	lock_timer_base <-__mod_timer
> 	_spin_lock_irqsave <-lock_timer_base
> 	internal_add_timer <-__mod_timer
> 	_spin_unlock_irqrestore <-__mod_timer
> 	qdio_perf_stat_inc <-qdio_outbound_processing
> 	account_system_vtime <-__do_softirq

This should be better now:

507e1231 timer stats: Optimize by adding quick check to avoid function calls

We did also address our device driver with several patches.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 17:13       ` Heiko Carstens
@ 2009-08-14 18:08         ` Linus Torvalds
  2009-08-14 20:19           ` David Miller
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-14 18:08 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, David S. Miller



On Fri, 14 Aug 2009, Heiko Carstens wrote:
> 
> However quite a few of the additional function calls in networking code
> come from uninlining:
> 
> c2aa270a [NET]: uninline skb_push, de-bloats a lot
> 6be8ac2f [NET]: uninline skb_pull, de-bloats a lot
> 419ae74e [NET]: uninline skb_trim, de-bloats

Hmm. Those functions are big only because of the stupid debugging, which 
is almost certainly not worth it any more. I doubt people have seen enough 
skb under/over-flows in the last years to merit the code.

I suspect we should remove the silly skb_under/over_panic functions, and 
do

 - skb_put:
	-       if (unlikely(skb->tail > skb->end))
	-               skb_over_panic(skb, len, __builtin_return_address(0));
	+       BUG_ON(skb->tail > skb->end);

 - skb_push:
	-       if (unlikely(skb->data<skb->head))
	-               skb_under_panic(skb, len, __builtin_return_address(0));
	+       BUG_ON(skb->data < skb->head);

at which point it might be worthwhile to inline them again, because the 
footprint of a BUG_ON() is really fairly small.

If this is a big deal on s390, it might be worth it.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 18:08         ` Linus Torvalds
@ 2009-08-14 20:19           ` David Miller
  2009-08-14 20:45             ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: David Miller @ 2009-08-14 20:19 UTC (permalink / raw)
  To: torvalds
  Cc: heiko.carstens, akpm, a.p.zijlstra, mingo, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 14 Aug 2009 11:08:16 -0700 (PDT)

> On Fri, 14 Aug 2009, Heiko Carstens wrote:
>> 
>> However quite a few of the additional function calls in networking code
>> come from uninlining:
>> 
>> c2aa270a [NET]: uninline skb_push, de-bloats a lot
>> 6be8ac2f [NET]: uninline skb_pull, de-bloats a lot
>> 419ae74e [NET]: uninline skb_trim, de-bloats
> 
> Hmm. Those functions are big only because of the stupid debugging, which 
> is almost certainly not worth it any more. I doubt people have seen enough 
> skb under/over-flows in the last years to merit the code.

We have found a few bugs in the past year or two because of
the check, but indeed not "a lot".

> I suspect we should remove the silly skb_under/over_panic functions, and 
> do
> 
>  - skb_put:
> 	-       if (unlikely(skb->tail > skb->end))
> 	-               skb_over_panic(skb, len, __builtin_return_address(0));
> 	+       BUG_ON(skb->tail > skb->end);
> 
>  - skb_push:
> 	-       if (unlikely(skb->data<skb->head))
> 	-               skb_under_panic(skb, len, __builtin_return_address(0));
> 	+       BUG_ON(skb->data < skb->head);
> 
> at which point it might be worthwhile to inline them again, because the 
> footprint of a BUG_ON() is really fairly small.

Maybe.  Although most people turn on verbose BUG's and that
expands to the same if not more code than what's there now
in these SKB inlines.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 20:19           ` David Miller
@ 2009-08-14 20:45             ` Linus Torvalds
  2009-08-14 21:10               ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-14 20:45 UTC (permalink / raw)
  To: David Miller
  Cc: heiko.carstens, akpm, a.p.zijlstra, mingo, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt



On Fri, 14 Aug 2009, David Miller wrote:
> 
> Maybe.  Although most people turn on verbose BUG's and that
> expands to the same if not more code than what's there now
> in these SKB inlines.

But not inline in the code, though. So yeah, it has a memory footprint, 
but shouldn't have a cache footprint.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 20:45             ` Linus Torvalds
@ 2009-08-14 21:10               ` Linus Torvalds
  2009-08-14 22:23                 ` David Miller
  2009-08-16 18:27                 ` Heiko Carstens
  0 siblings, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2009-08-14 21:10 UTC (permalink / raw)
  To: David Miller
  Cc: heiko.carstens, akpm, a.p.zijlstra, mingo, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt



On Fri, 14 Aug 2009, Linus Torvalds wrote:
> 
> But not inline in the code, though. So yeah, it has a memory footprint, 
> but shouldn't have a cache footprint.

An example of this: stack usage. This is the code with BUG_ON():

	skb_push:
		push   %rbp
		mov    %esi,%eax
		add    %esi,0x68(%rdi)
		neg    %rax
		mov    %rsp,%rbp
		add    0xd8(%rdi),%rax
		mov    %rax,0xd8(%rdi)
		cmp    0xd0(%rdi),%rax
		jae    2f
		ud2a   
	1:	jmp    1b
	2:	leaveq 
		retq   

and note how in the code, we're just jumping over something like four
bytes ("ud2" plus that silly endless loop-jump just to make gcc happy
are both 2 bytes on x86[-64]). 

Here's the same code with that horrid skb_under_panic():

	skb_push:
		push   %rbp
		mov    %esi,%eax
		mov    %rsp,%rbp
		neg    %rax
		push   %rbx
		mov    %rdi,%rbx
		sub    $0x8,%rsp
		add    0xd8(%rdi),%rax
		add    %esi,0x68(%rdi)
		mov    %rax,0xd8(%rdi)
		cmp    0xd0(%rdi),%rax
		jae    1f
		mov    0x8(%rbp),%rdx
		callq  skb_under_panic
	1:	mov    0xd8(%rbx),%rax
		pop    %r11
		pop    %rbx
		leaveq 
		retq   

and notice how it creates a stack frame due to the call (even when the
call is never done!) and the dead space is now 11 bytes. 

(Total size of functions: 41 bytes vs 64 bytes)

So things like that do make a difference.  It's surprisingly expensive
to add a function call t what otherwise could be a leaf function.  And
BUG_ON() avoids that. 

Of course, even at just 41 bytes, it's not _entirely_ obvious that 
inlining is the right thing.  Of those 41 bytes, four are just function 
overhead, and the "call" is five bytes in the caller and a few bytes of 
spills and argument setups, but inlining is still probably going to change 
each call-site from ~10 bytes to ~35 bytes. 

		Linus

---
 include/linux/skbuff.h |    4 ---
 net/core/skbuff.c      |   51 +----------------------------------------------
 2 files changed, 2 insertions(+), 53 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f2c69a2..2af92b4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -483,10 +483,6 @@ extern int	       skb_cow_data(struct sk_buff *skb, int tailbits,
 extern int	       skb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)	consume_skb(a)
 #define dev_consume_skb(a)	kfree_skb_clean(a)
-extern void	      skb_over_panic(struct sk_buff *skb, int len,
-				     void *here);
-extern void	      skb_under_panic(struct sk_buff *skb, int len,
-				      void *here);
 
 extern int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			int getfrag(void *from, char *to, int offset,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9e0597d..ae63473 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -103,51 +103,6 @@ static struct pipe_buf_operations sock_pipe_buf_ops = {
 	.get = sock_pipe_buf_get,
 };
 
-/*
- *	Keep out-of-line to prevent kernel bloat.
- *	__builtin_return_address is not used because it is not always
- *	reliable.
- */
-
-/**
- *	skb_over_panic	- 	private function
- *	@skb: buffer
- *	@sz: size
- *	@here: address
- *
- *	Out of line support code for skb_put(). Not user callable.
- */
-void skb_over_panic(struct sk_buff *skb, int sz, void *here)
-{
-	printk(KERN_EMERG "skb_over_panic: text:%p len:%d put:%d head:%p "
-			  "data:%p tail:%#lx end:%#lx dev:%s\n",
-	       here, skb->len, sz, skb->head, skb->data,
-	       (unsigned long)skb->tail, (unsigned long)skb->end,
-	       skb->dev ? skb->dev->name : "<NULL>");
-	BUG();
-}
-EXPORT_SYMBOL(skb_over_panic);
-
-/**
- *	skb_under_panic	- 	private function
- *	@skb: buffer
- *	@sz: size
- *	@here: address
- *
- *	Out of line support code for skb_push(). Not user callable.
- */
-
-void skb_under_panic(struct sk_buff *skb, int sz, void *here)
-{
-	printk(KERN_EMERG "skb_under_panic: text:%p len:%d put:%d head:%p "
-			  "data:%p tail:%#lx end:%#lx dev:%s\n",
-	       here, skb->len, sz, skb->head, skb->data,
-	       (unsigned long)skb->tail, (unsigned long)skb->end,
-	       skb->dev ? skb->dev->name : "<NULL>");
-	BUG();
-}
-EXPORT_SYMBOL(skb_under_panic);
-
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
@@ -1014,8 +969,7 @@ unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
 	SKB_LINEAR_ASSERT(skb);
 	skb->tail += len;
 	skb->len  += len;
-	if (unlikely(skb->tail > skb->end))
-		skb_over_panic(skb, len, __builtin_return_address(0));
+	BUG_ON(skb->tail > skb->end);
 	return tmp;
 }
 EXPORT_SYMBOL(skb_put);
@@ -1033,8 +987,7 @@ unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
 {
 	skb->data -= len;
 	skb->len  += len;
-	if (unlikely(skb->data<skb->head))
-		skb_under_panic(skb, len, __builtin_return_address(0));
+	BUG_ON(skb->data < skb->head);
 	return skb->data;
 }
 EXPORT_SYMBOL(skb_push);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 21:10               ` Linus Torvalds
@ 2009-08-14 22:23                 ` David Miller
  2009-08-16 18:27                 ` Heiko Carstens
  1 sibling, 0 replies; 34+ messages in thread
From: David Miller @ 2009-08-14 22:23 UTC (permalink / raw)
  To: torvalds
  Cc: heiko.carstens, akpm, a.p.zijlstra, mingo, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt

From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 14 Aug 2009 14:10:45 -0700 (PDT)

> On Fri, 14 Aug 2009, Linus Torvalds wrote:
>> 
>> But not inline in the code, though. So yeah, it has a memory footprint, 
>> but shouldn't have a cache footprint.
> 
> An example of this: stack usage. This is the code with BUG_ON():
 ...
> and note how in the code, we're just jumping over something like four
> bytes ("ud2" plus that silly endless loop-jump just to make gcc happy
> are both 2 bytes on x86[-64]). 
> 
> Here's the same code with that horrid skb_under_panic():

Yes on x86 it's better even with verbose debugging because
of the inlining of the debug message pointer etc.

I'll have to do something more intelligent than I currently do on
sparc64, and I had always been meaning to do that. :-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-14 12:58 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
@ 2009-08-16 17:57   ` Heiko Carstens
  2009-08-16 18:06     ` Ingo Molnar
  2009-08-16 18:22     ` Linus Torvalds
  0 siblings, 2 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-16 17:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin

Hi Linus, Andrew,

do you have any objections to the approach below?
Just wondering, since I didn't receive any comments.

On Fri, Aug 14, 2009 at 02:58:03PM +0200, Heiko Carstens wrote:
> From: Heiko Carstens <heiko.carstens@de.ibm.com>
> 
> This allows an architecture to specify per lock variant if the
> locking code should be kept out-of-line or inlined.
> 
> If an architecure wants out-of-line locking code no change is
> needed. To force inlining of e.g. spin_lock() the line
> 
> #define __spin_lock_is_small
> 
> needs to be added to arch/<whatever>/include/asm/spinlock.h
> 
> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
> ---
>  include/linux/spinlock_api_smp.h |  146 ++++++++++++++++++++++++++++++++++++++-
>  kernel/spinlock.c                |   56 ++++++++++++++
>  2 files changed, 199 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/include/linux/spinlock_api_smp.h
> ===================================================================
> --- linux-2.6.orig/include/linux/spinlock_api_smp.h
> +++ linux-2.6/include/linux/spinlock_api_smp.h
> @@ -19,46 +19,186 @@ int in_lock_functions(unsigned long addr
> 
>  #define assert_spin_locked(x)	BUG_ON(!spin_is_locked(x))
> 
> -void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
>  void __lockfunc _spin_lock_nested(spinlock_t *lock, int subclass)
>  							__acquires(lock);
>  void __lockfunc _spin_lock_nest_lock(spinlock_t *lock, struct lockdep_map *map)
>  							__acquires(lock);
> +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> +							__acquires(lock);
> +
> +#ifdef __spin_lock_is_small
> +#define _spin_lock(lock) __spin_lock(lock)
> +#else
> +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> +#endif

...

> Index: linux-2.6/kernel/spinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/spinlock.c
> +++ linux-2.6/kernel/spinlock.c

...

> +#ifndef __spin_lock_is_small
>  void __lockfunc _spin_lock(spinlock_t *lock)
>  {
>  	__spin_lock(lock);
>  }
>  EXPORT_SYMBOL(_spin_lock);
> +#endif

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 17:57   ` Heiko Carstens
@ 2009-08-16 18:06     ` Ingo Molnar
  2009-08-16 18:43       ` Linus Torvalds
  2009-08-16 18:44       ` Heiko Carstens
  2009-08-16 18:22     ` Linus Torvalds
  1 sibling, 2 replies; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 18:06 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin


* Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> > #define __spin_lock_is_small

> > +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> > +							__acquires(lock);
> > +
> > +#ifdef __spin_lock_is_small
> > +#define _spin_lock(lock) __spin_lock(lock)
> > +#else
> > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > +#endif

Dunno - i'm somewhat wary of introducing a 2^28 variability here. 
(although the number of real variations is much lower - but still).

What's the current situation on s390, precisely which of the 28 lock 
functions are a win to be inlined and which ones are a loss? Do you 
have a list/table perhaps?

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 17:57   ` Heiko Carstens
  2009-08-16 18:06     ` Ingo Molnar
@ 2009-08-16 18:22     ` Linus Torvalds
  2009-08-17 15:46       ` Heiko Carstens
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-16 18:22 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin



On Sun, 16 Aug 2009, Heiko Carstens wrote:
> 
> do you have any objections to the approach below?
> Just wondering, since I didn't receive any comments.

Other things going on, but here goes:

> > +#ifdef __spin_lock_is_small
> > +#define _spin_lock(lock) __spin_lock(lock)
> > +#else
> > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > +#endif

The above repeats many times, and I found that irritating. I think you 
should be able to get rid of a _lot_ of that by doing it like this instead

	extern void __lockfunc _spin_lock(spinlock_t *lock) __acquires(lock);
	... plain declarations for all cases, no #ifdef's ...

and then at the end you just do

	#ifdef __spin_lock_is_small
	  #define _spin_lock(lock) __spin_lock(lock)         
	#endif
	..

and at least make that repeating section be composed of simpler and 
smaller entries.

Now, I actually have a disgusting way to avoid the #ifdef's entirely, but 
I'm not sure it's worth it. You'd do the same unconditional function 
declarations, but then we can also do some unconditional CPP expansions:

	#define __define2_lock(name,extent) name##extent
	#define __define_lock(name,extent) __define2_lock(name,extent)  
	#define define_lock(name) __define_lock(__##name, __##name##_is_small)

	// for each lock type
	#define __spin_lock__spin_lock_is_small _spin_lock  
	#define _spin_lock define_lock(spin_lock)

and what happens is that if '__spin_lock_is_small' is undefined, you end 
up having '_spin_lock()' expand to '__spin_lock__spin_lock_is_small()' and 
then back to _spin_lock, but if it's #defined (to empty), it gets defined 
to __spin_lock() instead.

No ifdefs, just two unconditional #define lines for each lock type.

Ugly? Yes. Less ugly than having variations of that #ifdef repeated many 
times? I dunno.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-14 21:10               ` Linus Torvalds
  2009-08-14 22:23                 ` David Miller
@ 2009-08-16 18:27                 ` Heiko Carstens
  2009-08-16 18:45                   ` Linus Torvalds
  2009-08-17 21:26                   ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
  1 sibling, 2 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-16 18:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Miller, akpm, a.p.zijlstra, mingo, linux-arch, schwidefsky,
	arnd, horsth, ehrhardt

On Fri, Aug 14, 2009 at 02:10:45PM -0700, Linus Torvalds wrote:
> On Fri, 14 Aug 2009, Linus Torvalds wrote:
> > 
> > But not inline in the code, though. So yeah, it has a memory footprint, 
> > but shouldn't have a cache footprint.
> 
> An example of this: stack usage. This is the code with BUG_ON():
> 
> 	skb_push:
> 		push   %rbp
> 		mov    %esi,%eax
> 		add    %esi,0x68(%rdi)
> 		neg    %rax
> 		mov    %rsp,%rbp
> 		add    0xd8(%rdi),%rax
> 		mov    %rax,0xd8(%rdi)
> 		cmp    0xd0(%rdi),%rax
> 		jae    2f
> 		ud2a   
> 	1:	jmp    1b
> 	2:	leaveq 
> 		retq   

An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 4.3.2 results in
the following vmlinux sizes:

574152755 2009-08-16 20:05 vmlinux-orig
574862000 2009-08-16 20:13 vmlinux-skb-inline

The skb-inline variant was created by using your patch and on top of it the
patch below. Size increased by 692KB (0.12%).

Dunno.. is that acceptable?

---
 include/linux/skbuff.h |   35 +++++++++++++++++++++++++++++++++--
 net/core/skbuff.c      |   38 --------------------------------------
 2 files changed, 33 insertions(+), 40 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -1116,7 +1116,6 @@ static inline void skb_set_tail_pointer(
 /*
  *	Add data to an sk_buff
  */
-extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
 {
 	unsigned char *tmp = skb_tail_pointer(skb);
@@ -1126,7 +1125,23 @@ static inline unsigned char *__skb_put(s
 	return tmp;
 }
 
-extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
+/**
+ *	skb_put - add data to a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer. If this would
+ *	exceed the total buffer size the kernel will panic. A pointer to the
+ *	first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+	unsigned char *tmp = __skb_put(skb, len);
+
+	BUG_ON(skb->tail > skb->end);
+	return tmp;
+}
+
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
 	skb->data -= len;
@@ -1134,6 +1149,22 @@ static inline unsigned char *__skb_push(
 	return skb->data;
 }
 
+/**
+ *	skb_push - add data to the start of a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer at the buffer
+ *	start. If this would exceed the total buffer headroom the kernel will
+ *	panic. A pointer to the first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+	__skb_push(skb, len);
+	BUG_ON(skb->data < skb->head);
+	return skb->data;
+}
+
 extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -955,44 +955,6 @@ free_skb:
 EXPORT_SYMBOL(skb_pad);
 
 /**
- *	skb_put - add data to a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer. If this would
- *	exceed the total buffer size the kernel will panic. A pointer to the
- *	first byte of the extra data is returned.
- */
-unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-	unsigned char *tmp = skb_tail_pointer(skb);
-	SKB_LINEAR_ASSERT(skb);
-	skb->tail += len;
-	skb->len  += len;
-	BUG_ON(skb->tail > skb->end);
-	return tmp;
-}
-EXPORT_SYMBOL(skb_put);
-
-/**
- *	skb_push - add data to the start of a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer at the buffer
- *	start. If this would exceed the total buffer headroom the kernel will
- *	panic. A pointer to the first byte of the extra data is returned.
- */
-unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-	skb->data -= len;
-	skb->len  += len;
-	BUG_ON(skb->data < skb->head);
-	return skb->data;
-}
-EXPORT_SYMBOL(skb_push);
-
-/**
  *	skb_pull - remove data from the start of a buffer
  *	@skb: buffer to use
  *	@len: amount of data to remove

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 18:06     ` Ingo Molnar
@ 2009-08-16 18:43       ` Linus Torvalds
  2009-08-16 20:24         ` Ingo Molnar
  2009-08-16 18:44       ` Heiko Carstens
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-16 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin



On Sun, 16 Aug 2009, Ingo Molnar wrote:
> 
> What's the current situation on s390, precisely which of the 28 lock 
> functions are a win to be inlined and which ones are a loss? Do you 
> have a list/table perhaps?

Let's look at x86 instead. 

The one I can _guarantee_ is worth inlining is "spin_unlock()", since it 
just generates a single "incb %m" or whatever. No loops, no conditionals, 
no nuthing. It's not even a locked instruction. Right now we literally 
generate this function for it:

	0xffffffff81420d74 <_spin_unlock+0>:	push   %rbp
	0xffffffff81420d75 <_spin_unlock+1>:	mov    %rsp,%rbp
	0xffffffff81420d78 <_spin_unlock+4>:	incb   (%rdi)
	0xffffffff81420d7a <_spin_unlock+6>:	leaveq 
	0xffffffff81420d7b <_spin_unlock+7>:	retq   

iow, the actual "bulk" of that function is a single two-byte instruction. 
And for that we generate a whole 5-byte "call" instruction, along with all 
the costs of fixed register scheduling and stupid spilling etc.

read_unlock and write_unlock are similar, and are

	lock incl (%rdi)	// 3 bytes

and

	lock addl $0x1000000,(%rdi)	// 7 bytes

respectively. At 7 bytes, write_unlock() is still likely to be smaller 
inlined (because you avoid the register games).

Other cases on x86 that would be smaller in-lined:

 - _spin_unlock_irq: 3 bytes

	incb   (%rdi)
	sti

 - _spin_unlock_irqrestore: 4 bytes

	incb   (%rdi)
	push   %rsi
	popfq

 - _read_unlock_irq/_read_unlock_irqrestore (4 and 5 bytes respectively):

	lock incl (%rdi)
	sti / push+popfq

but not, for example, any of the locking functions, nor any of the "_bh" 
versions (because local_bh_enable ends up pretty complicated, unlike 
local_bh_disable). Nor even perhaps

 - _write_unlock_irqrestore: (9 bytes)

	lock addl $0x1000000,(%rdi)
	push   %rsi
	popfq

which is starting to get to the point where a call _may_ be smaller 
(largely due to that big constant).

And '_spin_lock()' is already too big to inline:

	mov	$0x100,%eax
	lock xadd %ax,(%rdi)
	cmp	%ah,%al
	je	2f
	pause
	mov	(%rdi),%al
	je	1b

which is 20 bytes or so, and that's the simplest of the locking cases. So 
you really do have a mix of "want to inline" and "do not want to inline".

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 18:06     ` Ingo Molnar
  2009-08-16 18:43       ` Linus Torvalds
@ 2009-08-16 18:44       ` Heiko Carstens
  2009-08-16 20:48         ` Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Heiko Carstens @ 2009-08-16 18:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin

On Sun, Aug 16, 2009 at 08:06:31PM +0200, Ingo Molnar wrote:
> * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > #define __spin_lock_is_small
> 
> > > +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> > > +							__acquires(lock);
> > > +
> > > +#ifdef __spin_lock_is_small
> > > +#define _spin_lock(lock) __spin_lock(lock)
> > > +#else
> > > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > > +#endif
> 
> Dunno - i'm somewhat wary of introducing a 2^28 variability here. 
> (although the number of real variations is much lower - but still).
> 
> What's the current situation on s390, precisely which of the 28 lock 
> functions are a win to be inlined and which ones are a loss? Do you 
> have a list/table perhaps?

No list unfortunately. However, the variants we really care about are
only the spin_locks. The *try_lock variants are also not that important.
So we end up with eight ifdefs (spin_lock/bh/irq/irq_save + unlock).

I'll change the patches according to Linus' comments and send them
again.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-16 18:27                 ` Heiko Carstens
@ 2009-08-16 18:45                   ` Linus Torvalds
  2009-08-16 20:36                     ` Ingo Molnar
  2009-08-17 21:26                   ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-16 18:45 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Miller, akpm, a.p.zijlstra, mingo, linux-arch, schwidefsky,
	arnd, horsth, ehrhardt



On Sun, 16 Aug 2009, Heiko Carstens wrote:
> 
> An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 4.3.2 results in
> the following vmlinux sizes:
> 
> 574152755 2009-08-16 20:05 vmlinux-orig
> 574862000 2009-08-16 20:13 vmlinux-skb-inline

It's more relevant to look at just code size. BUG_ON() has more of an 
impact on data sizes, because it does something like 10-12 bytes of 
exception data for each bug-on. But that exception data is out-of-line and 
not interesting.

If the code size blows up a lot, that's worse than if the data size 
expands a bit.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 18:43       ` Linus Torvalds
@ 2009-08-16 20:24         ` Ingo Molnar
  2009-08-16 21:07           ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 16 Aug 2009, Ingo Molnar wrote:
> > 
> > What's the current situation on s390, precisely which of the 28 lock 
> > functions are a win to be inlined and which ones are a loss? Do you 
> > have a list/table perhaps?
> 
> Let's look at x86 instead. 
> 
> The one I can _guarantee_ is worth inlining is "spin_unlock()", 
> since it just generates a single "incb %m" or whatever. [...]

We do that already for that reason:

/*
 * We inline the unlock functions in the nondebug case:
 */
#if defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT) || \
        !defined(CONFIG_SMP)
# define spin_unlock(lock)              _spin_unlock(lock)
# define read_unlock(lock)              _read_unlock(lock)
# define write_unlock(lock)             _write_unlock(lock)
# define spin_unlock_irq(lock)          _spin_unlock_irq(lock)
# define read_unlock_irq(lock)          _read_unlock_irq(lock)
# define write_unlock_irq(lock)         _write_unlock_irq(lock)

this works fine here:

 ffffffff8103d359 <finish_task_switch>:
 ...
 ffffffff8103d393:	fe 03                	incb   (%rbx)
 ffffffff8103d395:	fb                   	sti    

that's an inlined spin_unlock_irq() in finish_task_switch().

We dont do this if DEBUG_SPINLOCK is defined (which bloats the 
function) or if CONFIG_PREEMPT is defined.

Maybe the latter case could be revised although generally it's not 
worth inlining functions that also touch task-struct / threadinfo.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-16 18:45                   ` Linus Torvalds
@ 2009-08-16 20:36                     ` Ingo Molnar
  2009-08-17 10:26                       ` Heiko Carstens
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Heiko Carstens, David Miller, akpm, a.p.zijlstra, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 16 Aug 2009, Heiko Carstens wrote:
> > 
> > An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 
> > 4.3.2 results in the following vmlinux sizes:
> > 
> > 574152755 2009-08-16 20:05 vmlinux-orig
> > 574862000 2009-08-16 20:13 vmlinux-skb-inline
> 
> It's more relevant to look at just code size. BUG_ON() has more of 
> an impact on data sizes, because it does something like 10-12 
> bytes of exception data for each bug-on. But that exception data 
> is out-of-line and not interesting.
> 
> If the code size blows up a lot, that's worse than if the data 
> size expands a bit.

Also, allyesconfig isnt really relevant to inlining cost in general 
as it enables everything. The defconfigs are better (at least on 
x86) because they tend to be some distro-alike config - i.e. they 
are closer to what most Linux users will run in the end.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 18:44       ` Heiko Carstens
@ 2009-08-16 20:48         ` Ingo Molnar
  2009-08-16 21:33           ` Heiko Carstens
  0 siblings, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 20:48 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin


* Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Sun, Aug 16, 2009 at 08:06:31PM +0200, Ingo Molnar wrote:
> > * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > > #define __spin_lock_is_small
> > 
> > > > +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> > > > +							__acquires(lock);
> > > > +
> > > > +#ifdef __spin_lock_is_small
> > > > +#define _spin_lock(lock) __spin_lock(lock)
> > > > +#else
> > > > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > > > +#endif
> > 
> > Dunno - i'm somewhat wary of introducing a 2^28 variability here. 
> > (although the number of real variations is much lower - but still).
> > 
> > What's the current situation on s390, precisely which of the 28 lock 
> > functions are a win to be inlined and which ones are a loss? Do you 
> > have a list/table perhaps?
> 
> No list unfortunately. [...]

Well, if you dont know the functions you want to inline, how will 
you make intelligent use of this facility then in s390?

Btw., i just noticed that s390 has CONFIG_PREEMPT turned on by 
default in its defconfig. Have you made your measurements with 
CONFIG_PREEMPT? If yes then the current inlining rules in spinlock.h 
will turn all the locking APIs into functions.

Could you perhaps re-try your measurements with CONFIG_PREEMPT 
turned off? I suspect you'll see different behavior.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 20:24         ` Ingo Molnar
@ 2009-08-16 21:07           ` Linus Torvalds
  2009-08-16 21:18             ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2009-08-16 21:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin



On Sun, 16 Aug 2009, Ingo Molnar wrote:
> 
> We do that already for that reason:

Ahh, we do. I was mislead by the fact that we still generate the 
out-of-line ones, even though we don't use them. Which is fine, of course.

It doesn't do the irqrestore versions, but that's probably not a big deal.

			Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 21:07           ` Linus Torvalds
@ 2009-08-16 21:18             ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 21:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Heiko Carstens, Andrew Morton, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 16 Aug 2009, Ingo Molnar wrote:
> > 
> > We do that already for that reason:
> 
> Ahh, we do. I was mislead by the fact that we still generate the 
> out-of-line ones, even though we don't use them. Which is fine, of 
> course.
> 
> It doesn't do the irqrestore versions, but that's probably not a 
> big deal.

i remember having done gradual builds and turning the inlining of 
various locking functions on/off to see how it impacts the text 
size, so we were quite close to the optimal setup ... then. (two 
years ago?)

But ... the fact that you didnt notice it shows that the inlining 
decisions in spinlock.h are less than transparent and not really 
cleanly done.

Even if we dont change a thing on the x86 side the s390 observations 
are obviously valid, and Heiko's generalization looks more 
maintainable in terms of tweakability, as it's per function granular 
and per arch as well.

OTOH, it needs some work: the CONFIG_PREEMPT and 
CONFIG_DEBUG_SPINLOCKS dependency should be expressed too - and i 
think CONFIG_PREEMPT is probably was bit s390.

And at that point we've got 2^3 * 2^28 * 20 
(configs*functions*archs) variants, and i feel a bit uneasy about 
that kind of complexity (even if the typical cases will be much 
simpler than that). We wont really get it fully right, we'll just 
drown in that.

Dammit, why cannot the compiler get this right, working it around in 
the kernel source makes things so ugly :-(

It's not rocket science to do a good inliner but IMO GCC messed 
their design up on day one by separating the assembler from the C 
compiler, hence their inliner cannot really know about how large 
instructions are. Rather stupid IMHO.

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 20:48         ` Ingo Molnar
@ 2009-08-16 21:33           ` Heiko Carstens
  2009-08-16 21:36             ` Ingo Molnar
  0 siblings, 1 reply; 34+ messages in thread
From: Heiko Carstens @ 2009-08-16 21:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin

On Sun, Aug 16, 2009 at 10:48:31PM +0200, Ingo Molnar wrote:
> * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > On Sun, Aug 16, 2009 at 08:06:31PM +0200, Ingo Molnar wrote:
> > > * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > > > #define __spin_lock_is_small
> > > 
> > > > > +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> > > > > +							__acquires(lock);
> > > > > +
> > > > > +#ifdef __spin_lock_is_small
> > > > > +#define _spin_lock(lock) __spin_lock(lock)
> > > > > +#else
> > > > > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > > > > +#endif
> > > 
> > > Dunno - i'm somewhat wary of introducing a 2^28 variability here. 
> > > (although the number of real variations is much lower - but still).
> > > 
> > > What's the current situation on s390, precisely which of the 28 lock 
> > > functions are a win to be inlined and which ones are a loss? Do you 
> > > have a list/table perhaps?
> > 
> > No list unfortunately. [...]
> 
> Well, if you dont know the functions you want to inline, how will
> you make intelligent use of this facility then in s390?

What I tried to say: in general we want to have all locking functions inlined.
This is because all our measurements show if anything gets inlined performance
improves. This is even true for large functions.

> Btw., i just noticed that s390 has CONFIG_PREEMPT turned on by 
> default in its defconfig. Have you made your measurements with 
> CONFIG_PREEMPT? If yes then the current inlining rules in spinlock.h 
> will turn all the locking APIs into functions.

All measurements were done with CONFIG_PREEMPT off.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 21:33           ` Heiko Carstens
@ 2009-08-16 21:36             ` Ingo Molnar
  0 siblings, 0 replies; 34+ messages in thread
From: Ingo Molnar @ 2009-08-16 21:36 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Andrew Morton, Linus Torvalds, Peter Zijlstra, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin


* Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> On Sun, Aug 16, 2009 at 10:48:31PM +0200, Ingo Molnar wrote:
> > * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > On Sun, Aug 16, 2009 at 08:06:31PM +0200, Ingo Molnar wrote:
> > > > * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> > > > > > #define __spin_lock_is_small
> > > > 
> > > > > > +unsigned long __lockfunc _spin_lock_irqsave_nested(spinlock_t *lock, int subclass)
> > > > > > +							__acquires(lock);
> > > > > > +
> > > > > > +#ifdef __spin_lock_is_small
> > > > > > +#define _spin_lock(lock) __spin_lock(lock)
> > > > > > +#else
> > > > > > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > > > > > +#endif
> > > > 
> > > > Dunno - i'm somewhat wary of introducing a 2^28 variability here. 
> > > > (although the number of real variations is much lower - but still).
> > > > 
> > > > What's the current situation on s390, precisely which of the 28 lock 
> > > > functions are a win to be inlined and which ones are a loss? Do you 
> > > > have a list/table perhaps?
> > > 
> > > No list unfortunately. [...]
> > 
> > Well, if you dont know the functions you want to inline, how 
> > will you make intelligent use of this facility then in s390?
> 
> What I tried to say: in general we want to have all locking 
> functions inlined. This is because all our measurements show if 
> anything gets inlined performance improves. This is even true for 
> large functions.
> 
> > Btw., i just noticed that s390 has CONFIG_PREEMPT turned on by 
> > default in its defconfig. Have you made your measurements with 
> > CONFIG_PREEMPT? If yes then the current inlining rules in 
> > spinlock.h will turn all the locking APIs into functions.
> 
> All measurements were done with CONFIG_PREEMPT off.

ok.

( Btw., regardless of this patch-set, i suspect you _really_ want 
  compiler help for that - one with a global scope and which can go 
  inline even larger functions, and even if they were declared 
  global in the kernel. )

	Ingo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 0/3] Allow inlined spinlocks again V3
  2009-08-16 20:36                     ` Ingo Molnar
@ 2009-08-17 10:26                       ` Heiko Carstens
  0 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-17 10:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, akpm, a.p.zijlstra, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt

On Sun, Aug 16, 2009 at 10:36:50PM +0200, Ingo Molnar wrote:
> > > An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 
> > > 4.3.2 results in the following vmlinux sizes:
> > > 
> > > 574152755 2009-08-16 20:05 vmlinux-orig
> > > 574862000 2009-08-16 20:13 vmlinux-skb-inline
> > 
> > It's more relevant to look at just code size. BUG_ON() has more of 
> > an impact on data sizes, because it does something like 10-12 
> > bytes of exception data for each bug-on. But that exception data 
> > is out-of-line and not interesting.
> > 
> > If the code size blows up a lot, that's worse than if the data 
> > size expands a bit.
> 
> Also, allyesconfig isnt really relevant to inlining cost in general 
> as it enables everything. The defconfigs are better (at least on 
> x86) because they tend to be some distro-alike config - i.e. they 
> are closer to what most Linux users will run in the end.

Ok, I'm still using allyesconfig but still these numbers should be
good enough:

For the unpatched kernel we have

ffffffff81009000 T _stext
ffffffff83614130 T _etext
size: 39891248

ffffffff8460e250 R __start___bug_table
ffffffff8462e9ec R __stop___bug_table
size: 133020

And for the patched kernel (inlined functions) we get

ffffffff81009000 T _stext
ffffffff83627250 T _etext
size: 39969360

ffffffff84621080 R __start___bug_table
ffffffff84649c4c R __stop___bug_table
size: 166860

We have CONFIG_GENERIC_BUG, CONFIG_GENERIC_BUG_RELATIVE_POINTERS and
CONFIG_DEBUG_BUGVERBOSE all enabled. So the size of each bug table entry
is 10 bytes.
We end up with 3384 additional bug table entries caused by inlining.
Text size increased by 78112 bytes which makes an average extra 23 bytes
for each call site.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/3] spinlock: allow inlined spinlocks
  2009-08-16 18:22     ` Linus Torvalds
@ 2009-08-17 15:46       ` Heiko Carstens
  0 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-17 15:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Peter Zijlstra, Ingo Molnar, linux-arch,
	Martin Schwidefsky, Arnd Bergmann, Horst Hartmann,
	Christian Ehrhardt, Nick Piggin

On Sun, Aug 16, 2009 at 11:22:45AM -0700, Linus Torvalds wrote:
> Other things going on, but here goes:
> > > +#ifdef __spin_lock_is_small
> > > +#define _spin_lock(lock) __spin_lock(lock)
> > > +#else
> > > +void __lockfunc _spin_lock(spinlock_t *lock)		__acquires(lock);
> > > +#endif
> 
> The above repeats many times, and I found that irritating. I think you 
> should be able to get rid of a _lot_ of that by doing it like this instead
> 
> 	extern void __lockfunc _spin_lock(spinlock_t *lock) __acquires(lock);
> 	... plain declarations for all cases, no #ifdef's ...
> 
> and then at the end you just do
> 
> 	#ifdef __spin_lock_is_small
> 	  #define _spin_lock(lock) __spin_lock(lock)         
> 	#endif
> 	..
> 
> and at least make that repeating section be composed of simpler and 
> smaller entries.

Well, I decided to go for the ifdef variant, because that allows to easily
ifdef all functions in kernel/spinlock.c:

With

#ifdef __spin_lock_is_small
#define _spin_lock(lock) __spin_lock(lock)         
#endif

We can do this:

+#ifndef _spin_lock
 void __lockfunc _spin_lock(spinlock_t *lock)
 {
 	__spin_lock(lock);
 }
 EXPORT_SYMBOL(_spin_lock);
+#endif

In addition this allows to overrule architecture settings by adding ifdefs
in one common place instead of requiring each architecture to get it right:

+#ifndef CONFIG_DEBUG_SPINLOCK
+#ifndef CONFIG_GENERIC_LOCKBREAK
+
+#ifdef __spin_lock_is_small
+#define _spin_lock(lock) __spin_lock(lock)
+#endif
+ [...]

This should also address Ingo's concerns.

Does this look ok now?

---
 include/linux/spinlock_api_smp.h |  119 +++++++++++++++++++++++++++++++++++++++
 kernel/spinlock.c                |   56 ++++++++++++++++++
 2 files changed, 175 insertions(+)

Index: linux-2.6/kernel/spinlock.c
===================================================================
--- linux-2.6.orig/kernel/spinlock.c
+++ linux-2.6/kernel/spinlock.c
@@ -21,23 +21,29 @@
 #include <linux/debug_locks.h>
 #include <linux/module.h>
 
+#ifndef _spin_trylock
 int __lockfunc _spin_trylock(spinlock_t *lock)
 {
 	return __spin_trylock(lock);
 }
 EXPORT_SYMBOL(_spin_trylock);
+#endif
 
+#ifndef _read_trylock
 int __lockfunc _read_trylock(rwlock_t *lock)
 {
 	return __read_trylock(lock);
 }
 EXPORT_SYMBOL(_read_trylock);
+#endif
 
+#ifndef _write_trylock
 int __lockfunc _write_trylock(rwlock_t *lock)
 {
 	return __write_trylock(lock);
 }
 EXPORT_SYMBOL(_write_trylock);
+#endif
 
 /*
  * If lockdep is enabled then we use the non-preemption spin-ops
@@ -46,77 +52,101 @@ EXPORT_SYMBOL(_write_trylock);
  */
 #if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
 
+#ifndef _read_lock
 void __lockfunc _read_lock(rwlock_t *lock)
 {
 	__read_lock(lock);
 }
 EXPORT_SYMBOL(_read_lock);
+#endif
 
+#ifndef _spin_lock_irqsave
 unsigned long __lockfunc _spin_lock_irqsave(spinlock_t *lock)
 {
 	return __spin_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irqsave);
+#endif
 
+#ifndef _spin_lock_irq
 void __lockfunc _spin_lock_irq(spinlock_t *lock)
 {
 	__spin_lock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_lock_irq);
+#endif
 
+#ifndef _spin_lock_bh
 void __lockfunc _spin_lock_bh(spinlock_t *lock)
 {
 	__spin_lock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_lock_bh);
+#endif
 
+#ifndef _read_lock_irqsave
 unsigned long __lockfunc _read_lock_irqsave(rwlock_t *lock)
 {
 	return __read_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_read_lock_irqsave);
+#endif
 
+#ifndef _read_lock_irq
 void __lockfunc _read_lock_irq(rwlock_t *lock)
 {
 	__read_lock_irq(lock);
 }
 EXPORT_SYMBOL(_read_lock_irq);
+#endif
 
+#ifndef _read_lock_bh
 void __lockfunc _read_lock_bh(rwlock_t *lock)
 {
 	__read_lock_bh(lock);
 }
 EXPORT_SYMBOL(_read_lock_bh);
+#endif
 
+#ifndef _write_lock_irqsave
 unsigned long __lockfunc _write_lock_irqsave(rwlock_t *lock)
 {
 	return __write_lock_irqsave(lock);
 }
 EXPORT_SYMBOL(_write_lock_irqsave);
+#endif
 
+#ifndef _write_lock_irq
 void __lockfunc _write_lock_irq(rwlock_t *lock)
 {
 	__write_lock_irq(lock);
 }
 EXPORT_SYMBOL(_write_lock_irq);
+#endif
 
+#ifndef _write_lock_bh
 void __lockfunc _write_lock_bh(rwlock_t *lock)
 {
 	__write_lock_bh(lock);
 }
 EXPORT_SYMBOL(_write_lock_bh);
+#endif
 
+#ifndef _spin_lock
 void __lockfunc _spin_lock(spinlock_t *lock)
 {
 	__spin_lock(lock);
 }
 EXPORT_SYMBOL(_spin_lock);
+#endif
 
+#ifndef _write_lock
 void __lockfunc _write_lock(rwlock_t *lock)
 {
 	__write_lock(lock);
 }
 EXPORT_SYMBOL(_write_lock);
+#endif
 
 #else /* CONFIG_PREEMPT: */
 
@@ -242,83 +272,109 @@ EXPORT_SYMBOL(_spin_lock_nest_lock);
 
 #endif
 
+#ifndef _spin_unlock
 void __lockfunc _spin_unlock(spinlock_t *lock)
 {
 	__spin_unlock(lock);
 }
 EXPORT_SYMBOL(_spin_unlock);
+#endif
 
+#ifndef _write_unlock
 void __lockfunc _write_unlock(rwlock_t *lock)
 {
 	__write_unlock(lock);
 }
 EXPORT_SYMBOL(_write_unlock);
+#endif
 
+#ifndef _read_unlock
 void __lockfunc _read_unlock(rwlock_t *lock)
 {
 	__read_unlock(lock);
 }
 EXPORT_SYMBOL(_read_unlock);
+#endif
 
+#ifndef _spin_unlock_irqrestore
 void __lockfunc _spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags)
 {
 	__spin_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_spin_unlock_irqrestore);
+#endif
 
+#ifndef _spin_unlock_irq
 void __lockfunc _spin_unlock_irq(spinlock_t *lock)
 {
 	__spin_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_irq);
+#endif
 
+#ifndef _spin_unlock_bh
 void __lockfunc _spin_unlock_bh(spinlock_t *lock)
 {
 	__spin_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_unlock_bh);
+#endif
 
+#ifndef _read_unlock_irqrestore
 void __lockfunc _read_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
 	__read_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_read_unlock_irqrestore);
+#endif
 
+#ifndef _read_unlock_irq
 void __lockfunc _read_unlock_irq(rwlock_t *lock)
 {
 	__read_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_read_unlock_irq);
+#endif
 
+#ifndef _read_unlock_bh
 void __lockfunc _read_unlock_bh(rwlock_t *lock)
 {
 	__read_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_read_unlock_bh);
+#endif
 
+#ifndef _write_unlock_irqrestore
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 {
 	__write_unlock_irqrestore(lock, flags);
 }
 EXPORT_SYMBOL(_write_unlock_irqrestore);
+#endif
 
+#ifndef _write_unlock_irq
 void __lockfunc _write_unlock_irq(rwlock_t *lock)
 {
 	__write_unlock_irq(lock);
 }
 EXPORT_SYMBOL(_write_unlock_irq);
+#endif
 
+#ifndef _write_unlock_bh
 void __lockfunc _write_unlock_bh(rwlock_t *lock)
 {
 	__write_unlock_bh(lock);
 }
 EXPORT_SYMBOL(_write_unlock_bh);
+#endif
 
+#ifndef _spin_trylock_bh
 int __lockfunc _spin_trylock_bh(spinlock_t *lock)
 {
 	return __spin_trylock_bh(lock);
 }
 EXPORT_SYMBOL(_spin_trylock_bh);
+#endif
 
 notrace int in_lock_functions(unsigned long addr)
 {
Index: linux-2.6/include/linux/spinlock_api_smp.h
===================================================================
--- linux-2.6.orig/include/linux/spinlock_api_smp.h
+++ linux-2.6/include/linux/spinlock_api_smp.h
@@ -60,6 +60,125 @@ void __lockfunc _read_unlock_irqrestore(
 void __lockfunc _write_unlock_irqrestore(rwlock_t *lock, unsigned long flags)
 							__releases(lock);
 
+#ifndef CONFIG_DEBUG_SPINLOCK
+#ifndef CONFIG_GENERIC_LOCKBREAK
+
+#ifdef __spin_lock_is_small
+#define _spin_lock(lock) __spin_lock(lock)
+#endif
+
+#ifdef __read_lock_is_small
+#define _read_lock(lock) __read_lock(lock)
+#endif
+
+#ifdef __write_lock_is_small
+#define _write_lock(lock) __write_lock(lock)
+#endif
+
+#ifdef __spin_lock_bh_is_small
+#define _spin_lock_bh(lock) __spin_lock_bh(lock)
+#endif
+
+#ifdef __read_lock_bh_is_small
+#define _read_lock_bh(lock) __read_lock_bh(lock)
+#endif
+
+#ifdef __write_lock_bh_is_small
+#define _write_lock_bh(lock) __write_lock_bh(lock)
+#endif
+
+#ifdef __spin_lock_irq_is_small
+#define _spin_lock_irq(lock) __spin_lock_irq(lock)
+#endif
+
+#ifdef __read_lock_irq_is_small
+#define _read_lock_irq(lock) __read_lock_irq(lock)
+#endif
+
+#ifdef __write_lock_irq_is_small
+#define _write_lock_irq(lock) __write_lock_irq(lock)
+#endif
+
+#ifdef __spin_lock_irqsave_is_small
+#define _spin_lock_irqsave(lock) __spin_lock_irqsave(lock)
+#endif
+
+#ifdef __read_lock_irqsave_is_small
+#define _read_lock_irqsave(lock) __read_lock_irqsave(lock)
+#endif
+
+#ifdef __write_lock_irqsave_is_small
+#define _write_lock_irqsave(lock) __write_lock_irqsave(lock)
+#endif
+
+#endif /* !CONFIG_GENERIC_LOCKBREAK */
+
+#ifdef __spin_trylock_is_small
+#define _spin_trylock(lock) __spin_trylock(lock)
+#endif
+
+#ifdef __read_trylock_is_small
+#define _read_trylock(lock) __read_trylock(lock)
+#endif
+
+#ifdef __write_trylock_is_small
+#define _write_trylock(lock) __write_trylock(lock)
+#endif
+
+#ifdef __spin_trylock_bh_is_small
+#define _spin_trylock_bh(lock) __spin_trylock_bh(lock)
+#endif
+
+#ifdef __spin_unlock_is_small
+#define _spin_unlock(lock) __spin_unlock(lock)
+#endif
+
+#ifdef __read_unlock_is_small
+#define _read_unlock(lock) __read_unlock(lock)
+#endif
+
+#ifdef __write_unlock_is_small
+#define _write_unlock(lock) __write_unlock(lock)
+#endif
+
+#ifdef __spin_unlock_bh_is_small
+#define _spin_unlock_bh(lock) __spin_unlock_bh(lock)
+#endif
+
+#ifdef __read_unlock_bh_is_small
+#define _read_unlock_bh(lock) __read_unlock_bh(lock)
+#endif
+
+#ifdef __write_unlock_bh_is_small
+#define _write_unlock_bh(lock) __write_unlock_bh(lock)
+#endif
+
+#ifdef __spin_unlock_irq_is_small
+#define _spin_unlock_irq(lock) __spin_unlock_irq(lock)
+#endif
+
+#ifdef __read_unlock_irq_is_small
+#define _read_unlock_irq(lock) __read_unlock_irq(lock)
+#endif
+
+#ifdef __write_unlock_irq_is_small
+#define _write_unlock_irq(lock) __write_unlock_irq(lock)
+#endif
+
+#ifdef __spin_unlock_irqrestore_is_small
+#define _spin_unlock_irqrestore(lock, flags) __spin_unlock_irqrestore(lock, flags)
+#endif
+
+#ifdef __read_unlock_irqrestore_is_small
+#define _read_unlock_irqrestore(lock, flags) __read_unlock_irqrestore(lock, flags)
+#endif
+
+#ifdef __write_unlock_irqrestore_is_small
+#define _write_unlock_irqrestore(lock, flags) __write_unlock_irqrestore(lock, flags)
+#endif
+
+#endif /* CONFIG_DEBUG_SPINLOCK */
+
 static inline int __spin_trylock(spinlock_t *lock)
 {
 	preempt_disable();

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [numbers] Re: [patch] more skb ops inlining
  2009-08-16 18:27                 ` Heiko Carstens
  2009-08-16 18:45                   ` Linus Torvalds
@ 2009-08-17 21:26                   ` Ingo Molnar
  2009-08-18 11:34                     ` Heiko Carstens
  1 sibling, 1 reply; 34+ messages in thread
From: Ingo Molnar @ 2009-08-17 21:26 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Linus Torvalds, David Miller, akpm, a.p.zijlstra, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt


* Heiko Carstens <heiko.carstens@de.ibm.com> wrote:

> An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 4.3.2 
> results in the following vmlinux sizes:
> 
> 574152755 2009-08-16 20:05 vmlinux-orig
> 574862000 2009-08-16 20:13 vmlinux-skb-inline
> 
> The skb-inline variant was created by using your patch and on top 
> of it the patch below. Size increased by 692KB (0.12%).
> 
> Dunno.. is that acceptable?

i ported your patch to latest mainline (see it attached below) i've 
done x86(-64) defconfig (with disabled CONFIG_OPTIMIZE_INLINING) 
measurements vmlinux:

   text	   data	    bss	     dec	    hex	filename
8455329	1192304	 989908	10637541	 a250e5	vmlinux.before
8471036	1192304	 989908	10653248	 a28e40	vmlinux.after

that's a +0.18% text size increase - quite significant.

I also did performance measurements, trying to figure out whether 
this increase in size can be measured via any runtime performance 
metrics such as cycles, instructions or cache-misses.

To do that i picked a micro-benchmarks lat_tcp.

( Note that microbenchmarks are biased in favor of inlining patches
  like yours, because they disproportionately favor the speed
  benefit of it while not adequately showing the cache footprint
  disadvantage of increased kernel text size. )

Without your patch i got:

titan:~/l> perf stat --repeat 10 -e task-clock -e cycles -e instructions -e L1-icache-refs -e L1-icache-misses ./lat_tcp localhost
TCP latency using localhost: 17.5198 microseconds
TCP latency using localhost: 17.5060 microseconds
TCP latency using localhost: 17.4684 microseconds
TCP latency using localhost: 17.5416 microseconds
TCP latency using localhost: 17.4881 microseconds
TCP latency using localhost: 17.4832 microseconds
TCP latency using localhost: 17.4830 microseconds
TCP latency using localhost: 17.4723 microseconds
TCP latency using localhost: 17.4733 microseconds
TCP latency using localhost: 17.5255 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

    1494.625043  task-clock-msecs         #      0.564 CPUs    ( +-   0.077% )
     3665868023  cycles                   #   2452.701 M/sec   ( +-   0.070% )
     2163212606  instructions             #      0.590 IPC     ( +-   0.090% )
     2449056216  L1-icache-loads          #   1638.576 M/sec   ( +-   0.056% )
       61081182  L1-icache-load-misses    #     40.867 M/sec   ( +-   0.192% )

    2.649159316  seconds time elapsed   ( +-   0.090% )

With your patch i got:

titan:~/l> perf stat --repeat 10 -e task-clock -e cycles -e instructions -e L1-icache-refs -e L1-icache-misses ./lat_tcp localhost
TCP latency using localhost: 17.6913 microseconds
TCP latency using localhost: 17.7218 microseconds
TCP latency using localhost: 17.6882 microseconds
TCP latency using localhost: 17.7196 microseconds
TCP latency using localhost: 17.6411 microseconds
TCP latency using localhost: 17.7136 microseconds
TCP latency using localhost: 17.6488 microseconds
TCP latency using localhost: 17.6663 microseconds
TCP latency using localhost: 17.7445 microseconds
TCP latency using localhost: 17.6851 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

    1497.479984  task-clock-msecs         #      0.564 CPUs    ( +-   0.106% )
     3690097505  cycles                   #   2464.205 M/sec   ( +-   0.109% )
     2162432402  instructions             #      0.586 IPC     ( +-   0.127% )
     2437974192  L1-icache-loads          #   1628.051 M/sec   ( +-   0.105% )
       63064588  L1-icache-load-misses    #     42.114 M/sec   ( +-   0.278% )

    2.656952976  seconds time elapsed   ( +-   0.110% )

Summarized (for lat_tcp):

|  Before:
|
|    1494.625043  task-clock-msecs         #      0.564 CPUs    ( +-   0.077% )
|     3665868023  cycles                   #   2452.701 M/sec   ( +-   0.070% )
|     2163212606  instructions             #      0.590 IPC     ( +-   0.090% )
|
|  After:
|
|    1497.479984  task-clock-msecs         #      0.564 CPUs    ( +-   0.106% )
|     3690097505  cycles                   #   2464.205 M/sec   ( +-   0.109% )
|     2162432402  instructions             #      0.586 IPC     ( +-   0.127% )

So we've got a 0.2% slowdown, which is provably outside the noise 
level of 0.1%. The instruction count remained almost the same.

The reason for the slowdown can be seen in the L1 icache-fetch 
stats:

     2449056216  L1-icache-loads          #   1638.576 M/sec   ( +-   0.056% )
       61081182  L1-icache-load-misses    #     40.867 M/sec   ( +-   0.192% )

     2437974192  L1-icache-loads          #   1628.051 M/sec   ( +-   0.105% )
       63064588  L1-icache-load-misses    #     42.114 M/sec   ( +-   0.278% )

The icache-misses went up from 40.867 M/sec to 42.114 M/sec, a 
3.05% worsening - well beyond the statistical significance of 
0.28%.

So unless i botched the measurements or some accidental code layout 
differences punish the patch artificially, this patch is a small 
but measureable performance regression on x86.

The slowdown itself is barely provable with a test repeat count of 
10 iterations (not a surprise really - such patches were not really 
provable before via /usr/bin/time kinds of measurements), on an 
otherwise totally idle system.

The icache-miss stats do support the notion that what is causing 
the slowdown is the critical path fallout out of the icache.

( By all means please repeat these measurements - they are very
  easy to do via 'perf stat --repeat 10' - see the examples
  above. )

To build the kernel i used gcc 4.3.2. The CPU i used is friendly to 
inlining patches as well: it's an Extreme Edition Core2 Duo (booted 
to a single core, to stabilize the numbers):

  Intel(R) Core(TM)2 CPU E6800  @ 2.93GHz

With 4MB of L2 cache. I suspect smaller CPUs would feel the 
negative effects of inlining in a more pronounced way.

Thanks,

	Ingo

---
 include/linux/skbuff.h |   37 +++++++++++++++++++++++++++++++++++--
 net/core/skbuff.c      |   40 ----------------------------------------
 2 files changed, 35 insertions(+), 42 deletions(-)

Index: linux/include/linux/skbuff.h
===================================================================
--- linux.orig/include/linux/skbuff.h
+++ linux/include/linux/skbuff.h
@@ -1120,7 +1120,6 @@ static inline void skb_set_tail_pointer(
 /*
  *	Add data to an sk_buff
  */
-extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
 {
 	unsigned char *tmp = skb_tail_pointer(skb);
@@ -1130,7 +1129,24 @@ static inline unsigned char *__skb_put(s
 	return tmp;
 }
 
-extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
+/**
+ *	skb_put - add data to a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer. If this would
+ *	exceed the total buffer size the kernel will panic. A pointer to the
+ *	first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+	unsigned char *tmp = __skb_put(skb, len);
+
+	if (unlikely(skb->tail > skb->end))
+		skb_over_panic(skb, len, __builtin_return_address(0));
+	return tmp;
+}
+
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
 	skb->data -= len;
@@ -1138,6 +1154,23 @@ static inline unsigned char *__skb_push(
 	return skb->data;
 }
 
+/**
+ *	skb_push - add data to the start of a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer at the buffer
+ *	start. If this would exceed the total buffer headroom the kernel will
+ *	panic. A pointer to the first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+	__skb_push(skb, len);
+	if (unlikely(skb->data<skb->head))
+		skb_under_panic(skb, len, __builtin_return_address(0));
+	return skb->data;
+}
+
 extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
Index: linux/net/core/skbuff.c
===================================================================
--- linux.orig/net/core/skbuff.c
+++ linux/net/core/skbuff.c
@@ -1000,46 +1000,6 @@ free_skb:
 EXPORT_SYMBOL(skb_pad);
 
 /**
- *	skb_put - add data to a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer. If this would
- *	exceed the total buffer size the kernel will panic. A pointer to the
- *	first byte of the extra data is returned.
- */
-unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-	unsigned char *tmp = skb_tail_pointer(skb);
-	SKB_LINEAR_ASSERT(skb);
-	skb->tail += len;
-	skb->len  += len;
-	if (unlikely(skb->tail > skb->end))
-		skb_over_panic(skb, len, __builtin_return_address(0));
-	return tmp;
-}
-EXPORT_SYMBOL(skb_put);
-
-/**
- *	skb_push - add data to the start of a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer at the buffer
- *	start. If this would exceed the total buffer headroom the kernel will
- *	panic. A pointer to the first byte of the extra data is returned.
- */
-unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-	skb->data -= len;
-	skb->len  += len;
-	if (unlikely(skb->data<skb->head))
-		skb_under_panic(skb, len, __builtin_return_address(0));
-	return skb->data;
-}
-EXPORT_SYMBOL(skb_push);
-
-/**
  *	skb_pull - remove data from the start of a buffer
  *	@skb: buffer to use
  *	@len: amount of data to remove

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [numbers] Re: [patch] more skb ops inlining
  2009-08-17 21:26                   ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
@ 2009-08-18 11:34                     ` Heiko Carstens
  0 siblings, 0 replies; 34+ messages in thread
From: Heiko Carstens @ 2009-08-18 11:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, David Miller, akpm, a.p.zijlstra, linux-arch,
	schwidefsky, arnd, horsth, ehrhardt

On Mon, Aug 17, 2009 at 11:26:29PM +0200, Ingo Molnar wrote:
> * Heiko Carstens <heiko.carstens@de.ibm.com> wrote:
> 
> > An allyesconfig (CONFIG_FTRACE disabled) on x86_64 with gcc 4.3.2 
> > results in the following vmlinux sizes:
> > 
> > 574152755 2009-08-16 20:05 vmlinux-orig
> > 574862000 2009-08-16 20:13 vmlinux-skb-inline
> > 
> > The skb-inline variant was created by using your patch and on top 
> > of it the patch below. Size increased by 692KB (0.12%).
> > 
> > Dunno.. is that acceptable?
> 
> i ported your patch to latest mainline (see it attached below) i've 
> done x86(-64) defconfig (with disabled CONFIG_OPTIMIZE_INLINING) 
> measurements vmlinux:

> +static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
> +{
> +	unsigned char *tmp = __skb_put(skb, len);
> +
> +	if (unlikely(skb->tail > skb->end))
> +		skb_over_panic(skb, len, __builtin_return_address(0));
> +	return tmp;
> +}

No, the proposed patch replaces skb_over_panic() with BUG_ON() (see below).
With that I get slightly different results. However that's on my Thinkpad
booted with maxcpus=1 and a 32 bit kernel. Kernel is latest git as of today.

Now, this is a completely different machine than yours. With a different
config, so the machine actually boots. But the numbers are strange.

Part of /proc/cpuinfo :

model name      : Intel(R) Core(TM)2 CPU         T7600  @ 2.33GHz
stepping        : 6
cache size      : 4096 KB

Without the patch the result is:

TCP latency using localhost: 0.2576 microseconds
TCP latency using localhost: 0.2571 microseconds
TCP latency using localhost: 0.2579 microseconds
TCP latency using localhost: 0.2573 microseconds
TCP latency using localhost: 0.2574 microseconds
TCP latency using localhost: 0.2583 microseconds
TCP latency using localhost: 0.2572 microseconds
TCP latency using localhost: 0.2582 microseconds
TCP latency using localhost: 0.2583 microseconds
TCP latency using localhost: 0.2584 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

   24578.750193  task-clock-msecs         #      0.994 CPUs    ( +-   0.295% )
    57166815217  cycles                   #   2325.863 M/sec   ( +-   0.296% )
    24994309721  instructions             #      0.437 IPC     ( +-   0.288% )
    53691289530  L1-icache-loads          #   2184.460 M/sec   ( +-   0.304% )
         424027  L1-icache-load-misses    #      0.017 M/sec   ( +-   0.824% )

   24.714691813  seconds time elapsed   ( +-   0.285% )


With the patch I get this:

TCP latency using localhost: 0.2588 microseconds
TCP latency using localhost: 0.2589 microseconds
TCP latency using localhost: 0.2581 microseconds
TCP latency using localhost: 0.2586 microseconds
TCP latency using localhost: 0.2584 microseconds
TCP latency using localhost: 0.2588 microseconds
TCP latency using localhost: 0.2586 microseconds
TCP latency using localhost: 0.2589 microseconds
TCP latency using localhost: 0.2582 microseconds
TCP latency using localhost: 0.2586 microseconds

 Performance counter stats for './lat_tcp localhost' (10 runs):

   24464.554706  task-clock-msecs         #      0.995 CPUs    ( +-   0.063% )
    56903708182  cycles                   #   2325.965 M/sec   ( +-   0.063% )
    24831456567  instructions             #      0.436 IPC     ( +-   0.048% )
    53068943303  L1-icache-loads          #   2169.218 M/sec   ( +-   0.081% )
         418802  L1-icache-load-misses    #      0.017 M/sec   ( +-   0.637% )

   24.596311828  seconds time elapsed   ( +-   0.097% )

So latency increases while all other numbers decrease..

---
 include/linux/skbuff.h |   39 +++++++++++++++++++---
 net/core/skbuff.c      |   85 -------------------------------------------------
 2 files changed, 33 insertions(+), 91 deletions(-)

Index: linux-2.6/include/linux/skbuff.h
===================================================================
--- linux-2.6.orig/include/linux/skbuff.h
+++ linux-2.6/include/linux/skbuff.h
@@ -483,10 +483,6 @@ extern int	       skb_cow_data(struct sk
 extern int	       skb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)	consume_skb(a)
 #define dev_consume_skb(a)	kfree_skb_clean(a)
-extern void	      skb_over_panic(struct sk_buff *skb, int len,
-				     void *here);
-extern void	      skb_under_panic(struct sk_buff *skb, int len,
-				      void *here);
 
 extern int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
 			int getfrag(void *from, char *to, int offset,
@@ -1120,7 +1116,6 @@ static inline void skb_set_tail_pointer(
 /*
  *	Add data to an sk_buff
  */
-extern unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len)
 {
 	unsigned char *tmp = skb_tail_pointer(skb);
@@ -1130,7 +1125,23 @@ static inline unsigned char *__skb_put(s
 	return tmp;
 }
 
-extern unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
+/**
+ *	skb_put - add data to a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer. If this would
+ *	exceed the total buffer size the kernel will panic. A pointer to the
+ *	first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
+{
+	unsigned char *tmp = __skb_put(skb, len);
+
+	BUG_ON(skb->tail > skb->end);
+	return tmp;
+}
+
 static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len)
 {
 	skb->data -= len;
@@ -1138,6 +1149,22 @@ static inline unsigned char *__skb_push(
 	return skb->data;
 }
 
+/**
+ *	skb_push - add data to the start of a buffer
+ *	@skb: buffer to use
+ *	@len: amount of data to add
+ *
+ *	This function extends the used data area of the buffer at the buffer
+ *	start. If this would exceed the total buffer headroom the kernel will
+ *	panic. A pointer to the first byte of the extra data is returned.
+ */
+static inline unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
+{
+	__skb_push(skb, len);
+	BUG_ON(skb->data < skb->head);
+	return skb->data;
+}
+
 extern unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
 static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len)
 {
Index: linux-2.6/net/core/skbuff.c
===================================================================
--- linux-2.6.orig/net/core/skbuff.c
+++ linux-2.6/net/core/skbuff.c
@@ -103,51 +103,6 @@ static struct pipe_buf_operations sock_p
 	.get = sock_pipe_buf_get,
 };
 
-/*
- *	Keep out-of-line to prevent kernel bloat.
- *	__builtin_return_address is not used because it is not always
- *	reliable.
- */
-
-/**
- *	skb_over_panic	- 	private function
- *	@skb: buffer
- *	@sz: size
- *	@here: address
- *
- *	Out of line support code for skb_put(). Not user callable.
- */
-void skb_over_panic(struct sk_buff *skb, int sz, void *here)
-{
-	printk(KERN_EMERG "skb_over_panic: text:%p len:%d put:%d head:%p "
-			  "data:%p tail:%#lx end:%#lx dev:%s\n",
-	       here, skb->len, sz, skb->head, skb->data,
-	       (unsigned long)skb->tail, (unsigned long)skb->end,
-	       skb->dev ? skb->dev->name : "<NULL>");
-	BUG();
-}
-EXPORT_SYMBOL(skb_over_panic);
-
-/**
- *	skb_under_panic	- 	private function
- *	@skb: buffer
- *	@sz: size
- *	@here: address
- *
- *	Out of line support code for skb_push(). Not user callable.
- */
-
-void skb_under_panic(struct sk_buff *skb, int sz, void *here)
-{
-	printk(KERN_EMERG "skb_under_panic: text:%p len:%d put:%d head:%p "
-			  "data:%p tail:%#lx end:%#lx dev:%s\n",
-	       here, skb->len, sz, skb->head, skb->data,
-	       (unsigned long)skb->tail, (unsigned long)skb->end,
-	       skb->dev ? skb->dev->name : "<NULL>");
-	BUG();
-}
-EXPORT_SYMBOL(skb_under_panic);
-
 /* 	Allocate a new skbuff. We do this ourselves so we can fill in a few
  *	'private' fields and also do memory statistics to find all the
  *	[BEEP] leaks.
@@ -1000,46 +955,6 @@ free_skb:
 EXPORT_SYMBOL(skb_pad);
 
 /**
- *	skb_put - add data to a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer. If this would
- *	exceed the total buffer size the kernel will panic. A pointer to the
- *	first byte of the extra data is returned.
- */
-unsigned char *skb_put(struct sk_buff *skb, unsigned int len)
-{
-	unsigned char *tmp = skb_tail_pointer(skb);
-	SKB_LINEAR_ASSERT(skb);
-	skb->tail += len;
-	skb->len  += len;
-	if (unlikely(skb->tail > skb->end))
-		skb_over_panic(skb, len, __builtin_return_address(0));
-	return tmp;
-}
-EXPORT_SYMBOL(skb_put);
-
-/**
- *	skb_push - add data to the start of a buffer
- *	@skb: buffer to use
- *	@len: amount of data to add
- *
- *	This function extends the used data area of the buffer at the buffer
- *	start. If this would exceed the total buffer headroom the kernel will
- *	panic. A pointer to the first byte of the extra data is returned.
- */
-unsigned char *skb_push(struct sk_buff *skb, unsigned int len)
-{
-	skb->data -= len;
-	skb->len  += len;
-	if (unlikely(skb->data<skb->head))
-		skb_under_panic(skb, len, __builtin_return_address(0));
-	return skb->data;
-}
-EXPORT_SYMBOL(skb_push);
-
-/**
  *	skb_pull - remove data from the start of a buffer
  *	@skb: buffer to use
  *	@len: amount of data to remove

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2009-08-18 11:34 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-12 18:39 [patch 0/3] Allow inlined spinlocks again V3 Heiko Carstens
2009-08-12 18:39 ` [patch 1/3] spinlock: move spinlock function bodies to header file Heiko Carstens
2009-08-12 18:39 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
2009-08-12 18:39 ` [patch 3/3] spinlock: allow inlined spinlocks on s390 Heiko Carstens
2009-08-13 18:11 ` [patch 0/3] Allow inlined spinlocks again V3 Linus Torvalds
2009-08-13 18:34   ` Ingo Molnar
2009-08-13 18:43     ` Ingo Molnar
2009-08-14 12:34   ` Heiko Carstens
2009-08-14 16:04     ` Linus Torvalds
2009-08-14 17:13       ` Heiko Carstens
2009-08-14 18:08         ` Linus Torvalds
2009-08-14 20:19           ` David Miller
2009-08-14 20:45             ` Linus Torvalds
2009-08-14 21:10               ` Linus Torvalds
2009-08-14 22:23                 ` David Miller
2009-08-16 18:27                 ` Heiko Carstens
2009-08-16 18:45                   ` Linus Torvalds
2009-08-16 20:36                     ` Ingo Molnar
2009-08-17 10:26                       ` Heiko Carstens
2009-08-17 21:26                   ` [numbers] Re: [patch] more skb ops inlining Ingo Molnar
2009-08-18 11:34                     ` Heiko Carstens
  -- strict thread matches above, loose matches on Subject: below --
2009-08-14 12:58 [patch 0/3] Allow inlined spinlocks again V4 Heiko Carstens
2009-08-14 12:58 ` [patch 2/3] spinlock: allow inlined spinlocks Heiko Carstens
2009-08-16 17:57   ` Heiko Carstens
2009-08-16 18:06     ` Ingo Molnar
2009-08-16 18:43       ` Linus Torvalds
2009-08-16 20:24         ` Ingo Molnar
2009-08-16 21:07           ` Linus Torvalds
2009-08-16 21:18             ` Ingo Molnar
2009-08-16 18:44       ` Heiko Carstens
2009-08-16 20:48         ` Ingo Molnar
2009-08-16 21:33           ` Heiko Carstens
2009-08-16 21:36             ` Ingo Molnar
2009-08-16 18:22     ` Linus Torvalds
2009-08-17 15:46       ` Heiko Carstens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).