public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] lock contention tracking -v3
@ 2007-05-29 12:52 Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 1/5] fix raw_spinlock_t vs lockdep Peter Zijlstra
                   ` (4 more replies)
  0 siblings, 5 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra


Use the lockdep infrastructure to track lock contention and other lock
statistics.

It tracks lock contention events, and the first four unique call-sites that
encountered contention.

It also measures lock wait-time and hold-time in nanoseconds. The minimum and
maximum times are tracked, as well as a total (which together with the number
of event can give the avg).

All statistics are done per lock class, per write (exclusive state) and per read
(shared state). 

The statistics are collected per-cpu, so that the collection overhead is
minimized via having no global cachemisses.

This new lock statistics feature is independent of the lock dependency checking
traditionally done by lockdep; it just shares the lock tracking code. It is
also possible to enable both and runtime disabled either component - thereby
avoiding the O(n^2) lock chain walks for instance.


Andrew, can we give this a spin in -mm?



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/5] fix raw_spinlock_t vs lockdep
  2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
@ 2007-05-29 12:52 ` Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING Peter Zijlstra
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra

[-- Attachment #1: raw_spinlock_fix.patch --]
[-- Type: text/plain, Size: 1743 bytes --]

raw_spinlock_t should not use lockdep (and doesn't) since lockdep itself relies
on it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/spinlock_types.h    |    4 ++--
 include/linux/spinlock_types_up.h |    9 +--------
 2 files changed, 3 insertions(+), 10 deletions(-)

Index: linux-2.6-git/include/linux/spinlock_types_up.h
===================================================================
--- linux-2.6-git.orig/include/linux/spinlock_types_up.h
+++ linux-2.6-git/include/linux/spinlock_types_up.h
@@ -12,14 +12,10 @@
  * Released under the General Public License (GPL).
  */
 
-#if defined(CONFIG_DEBUG_SPINLOCK) || \
-	defined(CONFIG_DEBUG_LOCK_ALLOC)
+#ifdef CONFIG_DEBUG_SPINLOCK
 
 typedef struct {
 	volatile unsigned int slock;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map dep_map;
-#endif
 } raw_spinlock_t;
 
 #define __RAW_SPIN_LOCK_UNLOCKED { 1 }
@@ -34,9 +30,6 @@ typedef struct { } raw_spinlock_t;
 
 typedef struct {
 	/* no debug version on UP */
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map dep_map;
-#endif
 } raw_rwlock_t;
 
 #define __RAW_RW_LOCK_UNLOCKED { }
Index: linux-2.6-git/include/linux/spinlock_types.h
===================================================================
--- linux-2.6-git.orig/include/linux/spinlock_types.h
+++ linux-2.6-git/include/linux/spinlock_types.h
@@ -9,14 +9,14 @@
  * Released under the General Public License (GPL).
  */
 
-#include <linux/lockdep.h>
-
 #if defined(CONFIG_SMP)
 # include <asm/spinlock_types.h>
 #else
 # include <linux/spinlock_types_up.h>
 #endif
 
+#include <linux/lockdep.h>
+
 typedef struct {
 	raw_spinlock_t raw_lock;
 #if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)

--


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING
  2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 1/5] fix raw_spinlock_t vs lockdep Peter Zijlstra
@ 2007-05-29 12:52 ` Peter Zijlstra
  2007-05-29 13:21   ` Christoph Hellwig
  2007-05-29 12:52 ` [PATCH 3/5] lockstat: core infrastructure Peter Zijlstra
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra

[-- Attachment #1: lockdep-prove-locking.patch --]
[-- Type: text/plain, Size: 3494 bytes --]

Ensure that all of the lock dependency tracking code is under
CONFIG_PROVE_LOCKING. This allows us to use the held lock tracking code
for other purposes.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
---
 kernel/lockdep.c  |   13 ++++++++++++-
 kernel/spinlock.c |    4 ++--
 2 files changed, 14 insertions(+), 3 deletions(-)

Index: linux-2.6-git/kernel/lockdep.c
===================================================================
--- linux-2.6-git.orig/kernel/lockdep.c
+++ linux-2.6-git/kernel/lockdep.c
@@ -95,6 +95,7 @@ static int lockdep_initialized;
 unsigned long nr_list_entries;
 static struct lock_list list_entries[MAX_LOCKDEP_ENTRIES];
 
+#ifdef CONFIG_PROVE_LOCKING
 /*
  * Allocate a lockdep entry. (assumes the graph_lock held, returns
  * with NULL on failure)
@@ -111,6 +112,7 @@ static struct lock_list *alloc_list_entr
 	}
 	return list_entries + nr_list_entries++;
 }
+#endif
 
 /*
  * All data structures here are protected by the global debug_lock.
@@ -140,7 +142,9 @@ LIST_HEAD(all_lock_classes);
 static struct list_head classhash_table[CLASSHASH_SIZE];
 
 unsigned long nr_lock_chains;
+#ifdef CONFIG_PROVE_LOCKING
 static struct lock_chain lock_chains[MAX_LOCKDEP_CHAINS];
+#endif
 
 /*
  * We put the lock dependency chains into a hash-table as well, to cache
@@ -486,6 +490,7 @@ static void print_lock_dependencies(stru
 	}
 }
 
+#ifdef CONFIG_PROVE_LOCKING
 /*
  * Add a new dependency to the head of the list:
  */
@@ -545,6 +550,7 @@ print_circular_bug_entry(struct lock_lis
 
 	return 0;
 }
+#endif
 
 static void print_kernel_version(void)
 {
@@ -553,6 +559,7 @@ static void print_kernel_version(void)
 		init_utsname()->version);
 }
 
+#ifdef CONFIG_PROVE_LOCKING
 /*
  * When a circular dependency is detected, print the
  * header first:
@@ -643,6 +650,7 @@ check_noncircular(struct lock_class *sou
 	}
 	return 1;
 }
+#endif
 
 static int very_verbose(struct lock_class *class)
 {
@@ -827,6 +835,7 @@ check_usage(struct task_struct *curr, st
 
 #endif
 
+#ifdef CONFIG_PROVE_LOCKING
 static int
 print_deadlock_bug(struct task_struct *curr, struct held_lock *prev,
 		   struct held_lock *next)
@@ -1091,7 +1100,7 @@ out_bug:
 
 	return 0;
 }
-
+#endif
 
 /*
  * Is this the address of a static object:
@@ -1311,6 +1320,7 @@ out_unlock_set:
 	return class;
 }
 
+#ifdef CONFIG_PROVE_LOCKING
 /*
  * Look up a dependency chain. If the key is not present yet then
  * add it and return 1 - in this case the new dependency chain is
@@ -1385,6 +1395,7 @@ cache_hit:
 
 	return 1;
 }
+#endif
 
 /*
  * We are building curr_chain_key incrementally, so double-check
Index: linux-2.6-git/kernel/spinlock.c
===================================================================
--- linux-2.6-git.orig/kernel/spinlock.c
+++ linux-2.6-git/kernel/spinlock.c
@@ -88,7 +88,7 @@ unsigned long __lockfunc _spin_lock_irqs
 	 * _raw_spin_lock_flags() code, because lockdep assumes
 	 * that interrupts are not re-enabled during lock-acquire:
 	 */
-#ifdef CONFIG_PROVE_LOCKING
+#ifdef CONFIG_LOCKDEP
 	_raw_spin_lock(lock);
 #else
 	_raw_spin_lock_flags(lock, &flags);
@@ -305,7 +305,7 @@ unsigned long __lockfunc _spin_lock_irqs
 	 * _raw_spin_lock_flags() code, because lockdep assumes
 	 * that interrupts are not re-enabled during lock-acquire:
 	 */
-#ifdef CONFIG_PROVE_SPIN_LOCKING
+#ifdef CONFIG_LOCKDEP
 	_raw_spin_lock(lock);
 #else
 	_raw_spin_lock_flags(lock, &flags);

--


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/5] lockstat: core infrastructure
  2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 1/5] fix raw_spinlock_t vs lockdep Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING Peter Zijlstra
@ 2007-05-29 12:52 ` Peter Zijlstra
  2007-05-29 20:28   ` Daniel Walker
  2007-05-30  3:43   ` Andrew Morton
  2007-05-29 12:52 ` [PATCH 4/5] lockstat: human readability tweaks Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 5/5] lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex Peter Zijlstra
  4 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra

[-- Attachment #1: lockstat-core.patch --]
[-- Type: text/plain, Size: 16749 bytes --]

Introduce the core lock statistics code.

Lock statistics provides lock wait-time and hold-time (as well as the count
of corresponding contention and acquisitions events). Also, the first few
call-sites that encounter contention are tracked.

Lock wait-time is the time spent waiting on the lock. This provides insight
into the locking scheme, that is, a heavily contended lock is indicative of
a too coarse locking scheme.

Lock hold-time is the duration the lock was held, this provides a reference for
the wait-time numbers, so they can be put into perspective.

  1)
    lock
  2)
    ... do stuff ..
    unlock
  3)

The time between 1 and 2 is the wait-time. The time between 2 and 3 is the
hold-time.

The lockdep held-lock tracking code is reused, because it already collects locks
into meaningful groups (classes), and because it is an existing infrastructure
for lock instrumentation.

Currently lockdep tracks lock acquisition with two hooks:

  lock()
    lock_acquire()
    _lock()

 ... code protected by lock ...

  unlock()
    lock_release()
    _unlock()

We need to extend this with two more hooks, in order to measure contention.

  lock_contended() - used to measure contention events
  lock_acquired()  - completion of the contention

These are then placed the following way:

  lock()
    lock_acquire()
    if (!_try_lock())
      lock_contended()
      _lock()
      lock_acquired()

 ... do locked stuff ...

  unlock()
    lock_release()
    _unlock()

(Note: the try_lock() 'trick' is used to avoid instrumenting all platform
       dependent lock primitive implementations.)

It is also possible to toggle the two lockdep features at runtime using:

  /proc/sys/kernel/prove_locking
  /proc/sys/kernel/lock_stat

(esp. turning off the O(n^2) prove_locking functionaliy can help)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
---
 include/linux/lockdep.h |   50 +++++++
 include/linux/sysctl.h  |    2 
 kernel/lockdep.c        |  314 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sysctl.c         |   28 ++++
 lib/Kconfig.debug       |   11 +
 5 files changed, 404 insertions(+), 1 deletion(-)

Index: linux-2.6-git/kernel/lockdep.c
===================================================================
--- linux-2.6-git.orig/kernel/lockdep.c
+++ linux-2.6-git/kernel/lockdep.c
@@ -42,6 +42,16 @@
 
 #include "lockdep_internals.h"
 
+#ifdef CONFIG_PROVE_LOCKING
+int prove_locking = 1;
+module_param(prove_locking, int, 0644);
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+int lock_stat = 1;
+module_param(lock_stat, int, 0644);
+#endif
+
 /*
  * lockdep_lock: protects the lockdep graph, the hashes and the
  *               class/list/hash allocators.
@@ -123,6 +133,92 @@ static struct lock_list *alloc_list_entr
 unsigned long nr_lock_classes;
 static struct lock_class lock_classes[MAX_LOCKDEP_KEYS];
 
+#ifdef CONFIG_LOCK_STAT
+static DEFINE_PER_CPU(struct lock_class_stats[MAX_LOCKDEP_KEYS], lock_stats);
+
+static int lock_contention_point(struct lock_class *class, unsigned long ip)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(class->contention_point); i++) {
+		if (class->contention_point[i] == 0) {
+			class->contention_point[i] = ip;
+			break;
+		}
+		if (class->contention_point[i] == ip)
+			break;
+	}
+
+	return i;
+}
+
+static inline void lock_time_add(struct lock_time *src, struct lock_time *dst)
+{
+	dst->min += src->min;
+	dst->max += src->max;
+	dst->total += src->total;
+	dst->nr += src->nr;
+}
+
+static void lock_time_inc(struct lock_time *lt, unsigned long long time)
+{
+	if (time > lt->max)
+		lt->max = time;
+
+	if (time < lt->min || !lt->min)
+		lt->min = time;
+
+	lt->total += time;
+	lt->nr++;
+}
+
+struct lock_class_stats lock_stats(struct lock_class *class)
+{
+	struct lock_class_stats stats;
+	int cpu, i;
+
+	memset(&stats, 0, sizeof(struct lock_class_stats));
+	for_each_possible_cpu(cpu) {
+		struct lock_class_stats *pcs =
+			&per_cpu(lock_stats, cpu)[class - lock_classes];
+
+		for (i = 0; i < ARRAY_SIZE(stats.contention_point); i++)
+			stats.contention_point[i] += pcs->contention_point[i];
+
+		lock_time_add(&pcs->read_waittime, &stats.read_waittime);
+		lock_time_add(&pcs->write_waittime, &stats.write_waittime);
+
+		lock_time_add(&pcs->read_holdtime, &stats.read_holdtime);
+		lock_time_add(&pcs->write_holdtime, &stats.write_holdtime);
+	}
+
+	return stats;
+}
+
+void clear_lock_stats(struct lock_class *class)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct lock_class_stats *cpu_stats =
+			&per_cpu(lock_stats, cpu)[class - lock_classes];
+
+		memset(cpu_stats, 0, sizeof(struct lock_class_stats));
+	}
+	memset(class->contention_point, 0, sizeof(class->contention_point));
+}
+
+static struct lock_class_stats *get_lock_stats(struct lock_class *class)
+{
+	return &get_cpu_var(lock_stats)[class - lock_classes];
+}
+
+static void put_lock_stats(struct lock_class_stats *stats)
+{
+	put_cpu_var(lock_stats);
+}
+#endif
+
 /*
  * We keep a global list of all lock classes. The list only grows,
  * never shrinks. The list is only accessed with the lockdep
@@ -2035,6 +2131,11 @@ static int __lock_acquire(struct lockdep
 	int chain_head = 0;
 	u64 chain_key;
 
+#ifdef CONFIG_PROVE_LOCKING
+	if (!prove_locking)
+		check = 1;
+#endif
+
 	if (unlikely(!debug_locks))
 		return 0;
 
@@ -2085,6 +2186,10 @@ static int __lock_acquire(struct lockdep
 	hlock->read = read;
 	hlock->check = check;
 	hlock->hardirqs_off = hardirqs_off;
+#ifdef CONFIG_LOCK_STAT
+	hlock->waittime_stamp = 0;
+	hlock->holdtime_stamp = sched_clock();
+#endif
 
 	if (check != 2)
 		goto out_calc_hash;
@@ -2132,10 +2237,11 @@ static int __lock_acquire(struct lockdep
 		}
 	}
 #endif
+out_calc_hash:
 	/* mark it as used: */
 	if (!mark_lock(curr, hlock, LOCK_USED))
 		return 0;
-out_calc_hash:
+
 	/*
 	 * Calculate the chain hash: it's the combined has of all the
 	 * lock keys along the dependency chain. We save the hash value
@@ -2183,6 +2289,7 @@ out_calc_hash:
 	chain_key = iterate_chain_key(chain_key, id);
 	curr->curr_chain_key = chain_key;
 
+#ifdef CONFIG_PROVE_LOCKING
 	/*
 	 * Trylock needs to maintain the stack of held locks, but it
 	 * does not add new dependencies, because trylock can be done
@@ -2229,6 +2336,7 @@ out_calc_hash:
 		/* after lookup_chain_cache(): */
 		if (unlikely(!debug_locks))
 			return 0;
+#endif
 
 	curr->lockdep_depth++;
 	check_chain_key(curr);
@@ -2276,6 +2384,35 @@ print_unlock_inbalance_bug(struct task_s
 	return 0;
 }
 
+#ifdef CONFIG_LOCK_STAT
+static int
+print_lock_contention_bug(struct task_struct *curr, struct lockdep_map *lock,
+			   unsigned long ip)
+{
+	if (!debug_locks_off())
+		return 0;
+	if (debug_locks_silent)
+		return 0;
+
+	printk("\n=================================\n");
+	printk(  "[ BUG: bad contention detected! ]\n");
+	printk(  "---------------------------------\n");
+	printk("%s/%d is trying to contend lock (",
+		curr->comm, curr->pid);
+	print_lockdep_cache(lock);
+	printk(") at:\n");
+	print_ip_sym(ip);
+	printk("but there are no locks held!\n");
+	printk("\nother info that might help us debug this:\n");
+	lockdep_print_held_locks(curr);
+
+	printk("\nstack backtrace:\n");
+	dump_stack();
+
+	return 0;
+}
+#endif
+
 /*
  * Common debugging checks for both nested and non-nested unlock:
  */
@@ -2293,6 +2430,32 @@ static int check_unlock(struct task_stru
 	return 1;
 }
 
+#ifdef CONFIG_LOCK_STAT
+static void lock_release_holdtime(struct held_lock *hlock)
+{
+	struct lock_class_stats *stats;
+	unsigned long long holdtime;
+
+	if (!lock_stat)
+		return;
+
+	holdtime = sched_clock() - hlock->holdtime_stamp;
+
+	stats = get_lock_stats(hlock->class);
+
+	if (hlock->read)
+		lock_time_inc(&stats->read_holdtime, holdtime);
+	else
+		lock_time_inc(&stats->write_holdtime, holdtime);
+
+	put_lock_stats(stats);
+}
+#else
+static void lock_release_holdtime(struct held_lock *hlock)
+{
+}
+#endif
+
 /*
  * Remove the lock to the list of currently held locks in a
  * potentially non-nested (out of order) manner. This is a
@@ -2330,6 +2493,8 @@ lock_release_non_nested(struct task_stru
 	return print_unlock_inbalance_bug(curr, lock, ip);
 
 found_it:
+	lock_release_holdtime(hlock);
+
 	/*
 	 * We have the right lock to unlock, 'hlock' points to it.
 	 * Now we remove it from the stack, and add back the other
@@ -2382,6 +2547,8 @@ static int lock_release_nested(struct ta
 
 	curr->curr_chain_key = hlock->prev_chain_key;
 
+	lock_release_holdtime(hlock);
+
 #ifdef CONFIG_DEBUG_LOCKDEP
 	hlock->prev_chain_key = 0;
 	hlock->class = NULL;
@@ -2416,6 +2583,95 @@ __lock_release(struct lockdep_map *lock,
 	check_chain_key(curr);
 }
 
+#ifdef CONFIG_LOCK_STAT
+static void
+__lock_contended(struct lockdep_map *lock, unsigned long ip)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock, *prev_hlock;
+	struct lock_class_stats *stats;
+	unsigned int depth;
+	int i, point;
+
+	depth = curr->lockdep_depth;
+	if (DEBUG_LOCKS_WARN_ON(!depth))
+		return;
+
+	prev_hlock = NULL;
+	for (i = depth-1; i >= 0; i--) {
+		hlock = curr->held_locks + i;
+		/*
+		 * We must not cross into another context:
+		 */
+		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+			break;
+		if (hlock->instance == lock)
+			goto found_it;
+		prev_hlock = hlock;
+	}
+	print_lock_contention_bug(curr, lock, ip);
+	return;
+
+found_it:
+	hlock->waittime_stamp = sched_clock();
+
+	point = lock_contention_point(hlock->class, ip);
+
+	stats = get_lock_stats(hlock->class);
+	if (point < ARRAY_SIZE(stats->contention_point))
+		stats->contention_point[i]++;
+	put_lock_stats(stats);
+}
+
+static void
+__lock_acquired(struct lockdep_map *lock)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock, *prev_hlock;
+	struct lock_class_stats *stats;
+	unsigned int depth;
+	unsigned long long now, waittime;
+	int i;
+
+	depth = curr->lockdep_depth;
+	if (DEBUG_LOCKS_WARN_ON(!depth))
+		return;
+
+	prev_hlock = NULL;
+	for (i = depth-1; i >= 0; i--) {
+		hlock = curr->held_locks + i;
+		/*
+		 * We must not cross into another context:
+		 */
+		if (prev_hlock && prev_hlock->irq_context != hlock->irq_context)
+			break;
+		if (hlock->instance == lock)
+			goto found_it;
+		prev_hlock = hlock;
+	}
+	print_lock_contention_bug(curr, lock, _RET_IP_);
+	return;
+
+found_it:
+	if (!hlock->waittime_stamp)
+		return;
+
+	now = sched_clock();
+	waittime = now - hlock->waittime_stamp;
+
+	hlock->holdtime_stamp = now;
+
+	stats = get_lock_stats(hlock->class);
+
+	if (hlock->read)
+		lock_time_inc(&stats->read_waittime, waittime);
+	else
+		lock_time_inc(&stats->write_waittime, waittime);
+
+	put_lock_stats(stats);
+}
+#endif
+
 /*
  * Check whether we follow the irq-flags state precisely:
  */
@@ -2456,6 +2712,14 @@ void lock_acquire(struct lockdep_map *lo
 {
 	unsigned long flags;
 
+#ifdef CONFIG_LOCK_STAT
+	if (unlikely(!lock_stat))
+#endif
+#ifdef CONFIG_PROVE_LOCKING
+		if (unlikely(!prove_locking))
+#endif
+			return;
+
 	if (unlikely(current->lockdep_recursion))
 		return;
 
@@ -2475,6 +2739,14 @@ void lock_release(struct lockdep_map *lo
 {
 	unsigned long flags;
 
+#ifdef CONFIG_LOCK_STAT
+	if (unlikely(!lock_stat))
+#endif
+#ifdef CONFIG_PROVE_LOCKING
+		if (unlikely(!prove_locking))
+#endif
+			return;
+
 	if (unlikely(current->lockdep_recursion))
 		return;
 
@@ -2488,6 +2760,46 @@ void lock_release(struct lockdep_map *lo
 
 EXPORT_SYMBOL_GPL(lock_release);
 
+#ifdef CONFIG_LOCK_STAT
+void lock_contended(struct lockdep_map *lock, unsigned long ip)
+{
+	unsigned long flags;
+
+	if (unlikely(!lock_stat))
+		return;
+
+	if (unlikely(current->lockdep_recursion))
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+	current->lockdep_recursion = 1;
+	__lock_contended(lock, ip);
+	current->lockdep_recursion = 0;
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_contended);
+
+void lock_acquired(struct lockdep_map *lock)
+{
+	unsigned long flags;
+
+	if (unlikely(!lock_stat))
+		return;
+
+	if (unlikely(current->lockdep_recursion))
+		return;
+
+	raw_local_irq_save(flags);
+	check_flags(flags);
+	current->lockdep_recursion = 1;
+	__lock_acquired(lock);
+	current->lockdep_recursion = 0;
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lock_acquired);
+#endif
+
 /*
  * Used by the testsuite, sanitize the validator state
  * after a simulated failure:
Index: linux-2.6-git/include/linux/lockdep.h
===================================================================
--- linux-2.6-git.orig/include/linux/lockdep.h
+++ linux-2.6-git/include/linux/lockdep.h
@@ -114,8 +114,30 @@ struct lock_class {
 
 	const char			*name;
 	int				name_version;
+
+	unsigned long			contention_point[4];
+};
+
+#ifdef CONFIG_LOCK_STAT
+struct lock_time {
+	unsigned long long		min;
+	unsigned long long		max;
+	unsigned long long		total;
+	unsigned long			nr;
+};
+
+struct lock_class_stats {
+	unsigned long			contention_point[4];
+	struct lock_time		read_waittime;
+	struct lock_time		write_waittime;
+	struct lock_time		read_holdtime;
+	struct lock_time		write_holdtime;
 };
 
+struct lock_class_stats lock_stats(struct lock_class *class);
+void clear_lock_stats(struct lock_class *class);
+#endif
+
 /*
  * Map the lock object (the lock instance) to the lock-class object.
  * This is embedded into specific lock instances:
@@ -165,6 +187,10 @@ struct held_lock {
 	unsigned long			acquire_ip;
 	struct lockdep_map		*instance;
 
+#ifdef CONFIG_LOCK_STAT
+	unsigned long long 		waittime_stamp;
+	unsigned long long		holdtime_stamp;
+#endif
 	/*
 	 * The lock-stack is unified in that the lock chains of interrupt
 	 * contexts nest ontop of process context chains, but we 'separate'
@@ -281,6 +307,30 @@ struct lock_class_key { };
 
 #endif /* !LOCKDEP */
 
+#ifdef CONFIG_LOCK_STAT
+
+extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
+extern void lock_acquired(struct lockdep_map *lock);
+
+#define LOCK_CONTENDED(_lock, try, lock)			\
+do {								\
+	if (!try(_lock)) {					\
+		lock_contended(&(_lock)->dep_map, _RET_IP_);	\
+		lock(_lock);					\
+		lock_acquired(&(_lock)->dep_map);		\
+	}							\
+} while (0)
+
+#else /* CONFIG_LOCK_STAT */
+
+#define lock_contended(l, i)			do { } while (0)
+#define lock_acquired(l)			do { } while (0)
+
+#define LOCK_CONTENDED(_lock, try, lock) \
+	lock(_lock)
+
+#endif /* CONFIG_LOCK_STAT */
+
 #if defined(CONFIG_TRACE_IRQFLAGS) && defined(CONFIG_GENERIC_HARDIRQS)
 extern void early_init_irq_lock_class(void);
 #else
Index: linux-2.6-git/lib/Kconfig.debug
===================================================================
--- linux-2.6-git.orig/lib/Kconfig.debug
+++ linux-2.6-git/lib/Kconfig.debug
@@ -273,6 +273,17 @@ config LOCKDEP
 	select KALLSYMS
 	select KALLSYMS_ALL
 
+config LOCK_STAT
+	bool "Lock usage statisitics"
+	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
+	select LOCKDEP
+	select DEBUG_SPINLOCK
+	select DEBUG_MUTEXES
+	select DEBUG_LOCK_ALLOC
+	default n
+	help
+	 This feature enables tracking lock contention points
+
 config DEBUG_LOCKDEP
 	bool "Lock dependency engine debugging"
 	depends on DEBUG_KERNEL && LOCKDEP
Index: linux-2.6-git/kernel/sysctl.c
===================================================================
--- linux-2.6-git.orig/kernel/sysctl.c
+++ linux-2.6-git/kernel/sysctl.c
@@ -164,6 +164,14 @@ int sysctl_legacy_va_layout;
 #endif
 
 
+#ifdef CONFIG_PROVE_LOCKING
+extern int prove_locking;
+#endif
+
+#ifdef CONFIG_LOCK_STAT
+extern int lock_stat;
+#endif
+
 /* The default sysctl tables: */
 
 static ctl_table root_table[] = {
@@ -683,6 +691,26 @@ static ctl_table kern_table[] = {
 		.proc_handler	= &proc_dostring,
 		.strategy	= &sysctl_string,
 	},
+#ifdef CONFIG_PROVE_LOCKING
+	{
+		.ctl_name	= KERN_PROVE_LOCKING,
+		.procname	= "prove_locking",
+		.data		= &prove_locking,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
+#ifdef CONFIG_LOCK_STAT
+	{
+		.ctl_name	= KERN_LOCK_STAT,
+		.procname	= "lock_stat",
+		.data		= &lock_stat,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+#endif
 
 	{ .ctl_name = 0 }
 };
Index: linux-2.6-git/include/linux/sysctl.h
===================================================================
--- linux-2.6-git.orig/include/linux/sysctl.h
+++ linux-2.6-git/include/linux/sysctl.h
@@ -166,6 +166,8 @@ enum
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
 	KERN_POWEROFF_CMD=77,	/* string: poweroff command line */
+	KERN_PROVE_LOCKING=78, /* int: enable lock dependancy checking */
+	KERN_LOCK_STAT=79, /* int: enable lock statistics */
 };
 
 

--


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/5] lockstat: human readability tweaks
  2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-05-29 12:52 ` [PATCH 3/5] lockstat: core infrastructure Peter Zijlstra
@ 2007-05-29 12:52 ` Peter Zijlstra
  2007-05-29 12:52 ` [PATCH 5/5] lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex Peter Zijlstra
  4 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra

[-- Attachment #1: lockstat-output.patch --]
[-- Type: text/plain, Size: 10475 bytes --]

Present all this fancy new lock statistics information:

*warning, _wide_ output ahead*

(output edited for purpose of brevity)

 # cat /proc/lock_stat
lock_stat version 0.1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    contentions   waittime-min   waittime-max waittime-total   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

                         &inode->i_mutex:         14458           6.57      398832.75     2469412.23        6768876           0.34    11398383.65   339410830.89
                         ---------------
                         &inode->i_mutex           4486          [<ffffffff802a08f9>] pipe_wait+0x86/0x8d
                         &inode->i_mutex              0          [<ffffffff802a01e8>] pipe_write_fasync+0x29/0x5d
                         &inode->i_mutex              0          [<ffffffff802a0e18>] pipe_read+0x74/0x3a5
                         &inode->i_mutex              0          [<ffffffff802a1a6a>] do_lookup+0x81/0x1ae

.................................................................................................................................................................

              &inode->i_data.tree_lock-W:           491           0.27          62.47         493.89        2477833           0.39         468.89     1146584.25
              &inode->i_data.tree_lock-R:            65           0.44           4.27          48.78       26288792           0.36         184.62    10197458.24
              --------------------------
                &inode->i_data.tree_lock             46          [<ffffffff80277095>] __do_page_cache_readahead+0x69/0x24f
                &inode->i_data.tree_lock             31          [<ffffffff8026f9fb>] add_to_page_cache+0x31/0xba
                &inode->i_data.tree_lock              0          [<ffffffff802770ee>] __do_page_cache_readahead+0xc2/0x24f
                &inode->i_data.tree_lock              0          [<ffffffff8026f6e4>] find_get_page+0x1a/0x58

.................................................................................................................................................................

                      proc_inum_idr.lock:             0           0.00           0.00           0.00             36           0.00          65.60         148.26
                        proc_subdir_lock:             0           0.00           0.00           0.00        3049859           0.00         106.81     1563212.42
                        shrinker_rwsem-W:             0           0.00           0.00           0.00              5           0.00           1.73           3.68
                        shrinker_rwsem-R:             0           0.00           0.00           0.00            633           2.57         246.57       10909.76

'contentions' and 'acquisitions' are the number of such events measured (since 
the last reset). The waittime- and holdtime- (min, max, total) numbers are 
presented in microseconds.
 
If there are any contention points, the lock class is presented in the block
format (as i_mutex and tree_lock above), otherwise a single line of output is
presented.

The output is sorted on absolute number of contentions (read + write), this
should get the worst offenders presented first, so that:

 # grep : /proc/lock_stat | head

will quickly show who's bad.

The stats can be reset using:

 # echo 0 > /proc/lock_stat

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
---
 kernel/lockdep_proc.c |  266 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 266 insertions(+)

Index: linux-2.6-git/kernel/lockdep_proc.c
===================================================================
--- linux-2.6-git.orig/kernel/lockdep_proc.c
+++ linux-2.6-git/kernel/lockdep_proc.c
@@ -15,6 +15,10 @@
 #include <linux/seq_file.h>
 #include <linux/kallsyms.h>
 #include <linux/debug_locks.h>
+#include <linux/vmalloc.h>
+#include <linux/sort.h>
+#include <asm/uaccess.h>
+#include <asm/div64.h>
 
 #include "lockdep_internals.h"
 
@@ -342,6 +346,262 @@ static const struct file_operations proc
 	.release	= seq_release,
 };
 
+#ifdef CONFIG_LOCK_STAT
+
+struct lock_stat_data {
+	struct lock_class *class;
+	struct lock_class_stats stats;
+};
+
+struct lock_stat_seq {
+	struct lock_stat_data *iter;
+	struct lock_stat_data *iter_end;
+	struct lock_stat_data stats[MAX_LOCKDEP_KEYS];
+};
+
+/*
+ * sort on absolute number of contentions
+ */
+int lock_stat_cmp(const void *l, const void *r)
+{
+	const struct lock_stat_data *dl = l, *dr = r;
+	unsigned long nl, nr;
+
+	nl = dl->stats.read_waittime.nr + dl->stats.write_waittime.nr;
+	nr = dr->stats.read_waittime.nr + dr->stats.write_waittime.nr;
+
+	return nr - nl;
+}
+
+static void seq_line(struct seq_file *m, char c, int offset, int length)
+{
+	int i;
+
+	for (i = 0; i < offset; i++)
+		seq_puts(m, " ");
+	for (i = 0; i < length; i++)
+		seq_printf(m, "%c", c);
+	seq_puts(m, "\n");
+}
+
+static void snprint_time(char *buf, size_t bufsiz, unsigned long long nr)
+{
+	unsigned long rem;
+
+	rem = do_div(nr, 1000);
+	snprintf(buf, bufsiz, "%llu.%02d", nr, ((int)rem+5)/10);
+}
+
+static void seq_time(struct seq_file *m, unsigned long long time)
+{
+	char num[15];
+
+	snprint_time(num, sizeof(num), time);
+	seq_printf(m, " %14s", num);
+}
+
+static void seq_lock_time(struct seq_file *m, struct lock_time *lt)
+{
+	seq_printf(m, "%14lu", lt->nr);
+	seq_time(m, lt->min);
+	seq_time(m, lt->max);
+	seq_time(m, lt->total);
+}
+
+static void seq_stats(struct seq_file *m, struct lock_stat_data *data)
+{
+	char name[39];
+	struct lock_class *class;
+	struct lock_class_stats *stats;
+	int i, namelen;
+
+	class = data->class;
+	stats = &data->stats;
+
+	snprintf(name, 38, "%s", class->name);
+	namelen = strlen(name);
+
+	if (stats->write_holdtime.nr) {
+		if (stats->read_holdtime.nr)
+			seq_printf(m, "%38s-W:", name);
+		else
+			seq_printf(m, "%40s:", name);
+
+		seq_lock_time(m, &stats->write_waittime);
+		seq_puts(m, " ");
+		seq_lock_time(m, &stats->write_holdtime);
+		seq_puts(m, "\n");
+	}
+
+	if (stats->read_holdtime.nr) {
+		seq_printf(m, "%38s-R:", name);
+		seq_lock_time(m, &stats->read_waittime);
+		seq_puts(m, " ");
+		seq_lock_time(m, &stats->read_holdtime);
+		seq_puts(m, "\n");
+	}
+
+	if (stats->read_waittime.nr + stats->write_waittime.nr == 0)
+		return;
+
+	if (stats->read_holdtime.nr)
+		namelen += 2;
+
+	for (i = 0; i < ARRAY_SIZE(class->contention_point); i++) {
+		char sym[KSYM_SYMBOL_LEN];
+		char ip[32];
+
+		if (class->contention_point[i] == 0)
+			break;
+
+		if (!i)
+			seq_line(m, '-', 40-namelen, namelen);
+
+		sprint_symbol(sym, class->contention_point[i]);
+		snprintf(ip, sizeof(ip), "[<%p>]",
+				(void *)class->contention_point[i]);
+		seq_printf(m, "%40s %14lu %29s %s\n", name,
+				stats->contention_point[i],
+				ip, sym);
+	}
+	if (i) {
+		seq_puts(m, "\n");
+		seq_line(m, '.', 0, 40 + 1 + 8 * (14 + 1));
+		seq_puts(m, "\n");
+	}
+}
+
+static void seq_header(struct seq_file *m)
+{
+	seq_printf(m, "lock_stat version 0.1\n");
+	seq_line(m, '-', 0, 40 + 1 + 8 * (14 + 1));
+	seq_printf(m, "%40s %14s %14s %14s %14s %14s %14s %14s %14s\n",
+			"class name",
+			"contentions",
+			"waittime-min",
+			"waittime-max",
+			"waittime-total",
+			"acquisitions",
+			"holdtime-min",
+			"holdtime-max",
+			"holdtime-total");
+	seq_line(m, '-', 0, 40 + 1 + 8 * (14 + 1));
+	seq_printf(m, "\n");
+}
+
+static void *ls_start(struct seq_file *m, loff_t *pos)
+{
+	struct lock_stat_seq *data = m->private;
+
+	if (data->iter == data->stats)
+		seq_header(m);
+
+	return data->iter;
+}
+
+static void *ls_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	struct lock_stat_seq *data = m->private;
+
+	(*pos)++;
+
+	data->iter = v;
+	data->iter++;
+	if (data->iter == data->iter_end)
+		data->iter = NULL;
+
+	return data->iter;
+}
+
+static void ls_stop(struct seq_file *m, void *v)
+{
+}
+
+static int ls_show(struct seq_file *m, void *v)
+{
+	struct lock_stat_seq *data = m->private;
+
+	seq_stats(m, data->iter);
+	return 0;
+}
+
+static struct seq_operations lockstat_ops = {
+	.start	= ls_start,
+	.next	= ls_next,
+	.stop	= ls_stop,
+	.show	= ls_show,
+};
+
+static int lock_stat_open(struct inode *inode, struct file *file)
+{
+	int res;
+	struct lock_class *class;
+	struct lock_stat_seq *data = vmalloc(sizeof(struct lock_stat_seq));
+
+	if (!data)
+		return -ENOMEM;
+
+	res = seq_open(file, &lockstat_ops);
+	if (!res) {
+		struct lock_stat_data *iter = data->stats;
+		struct seq_file *m = file->private_data;
+
+		data->iter = iter;
+		list_for_each_entry(class, &all_lock_classes, lock_entry) {
+			iter->class = class;
+			iter->stats = lock_stats(class);
+			iter++;
+		}
+		data->iter_end = iter;
+
+		sort(data->stats, data->iter_end - data->iter,
+				sizeof(struct lock_stat_data),
+				lock_stat_cmp, NULL);
+
+		m->private = data;
+	} else
+		vfree(data);
+
+	return res;
+}
+
+ssize_t lock_stat_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct lock_class *class;
+	char c;
+
+	if (count) {
+		if (get_user(c, buf))
+			return -EFAULT;
+
+		if (c != '0')
+			return count;
+
+		list_for_each_entry(class, &all_lock_classes, lock_entry)
+			clear_lock_stats(class);
+	}
+	return count;
+}
+
+static int lock_stat_release(struct inode *inode, struct file *file)
+{
+	struct seq_file *seq = file->private_data;
+
+	vfree(seq->private);
+	seq->private = NULL;
+	return seq_release(inode, file);
+}
+
+static const struct file_operations proc_lock_stat_operations = {
+	.open		= lock_stat_open,
+	.write		= lock_stat_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= lock_stat_release,
+};
+#endif /* CONFIG_LOCK_STAT */
+
 static int __init lockdep_proc_init(void)
 {
 	struct proc_dir_entry *entry;
@@ -354,6 +614,12 @@ static int __init lockdep_proc_init(void
 	if (entry)
 		entry->proc_fops = &proc_lockdep_stats_operations;
 
+#ifdef CONFIG_LOCK_STAT
+	entry = create_proc_entry("lock_stat", S_IRUSR, NULL);
+	if (entry)
+		entry->proc_fops = &proc_lock_stat_operations;
+#endif
+
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/5] lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex
  2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-05-29 12:52 ` [PATCH 4/5] lockstat: human readability tweaks Peter Zijlstra
@ 2007-05-29 12:52 ` Peter Zijlstra
  4 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-29 12:52 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig, Peter Zijlstra

[-- Attachment #1: lockstat-hooks.patch --]
[-- Type: text/plain, Size: 6432 bytes --]

Call the new lockstat tracking functions from the various lock primitives.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jason Baron <jbaron@redhat.com>
---
 kernel/mutex.c    |    8 ++++++++
 kernel/rwsem.c    |    8 ++++----
 kernel/spinlock.c |   28 ++++++++++++++--------------
 3 files changed, 26 insertions(+), 18 deletions(-)

Index: linux-2.6-git/kernel/mutex.c
===================================================================
--- linux-2.6-git.orig/kernel/mutex.c
+++ linux-2.6-git/kernel/mutex.c
@@ -139,6 +139,12 @@ __mutex_lock_common(struct mutex *lock, 
 	list_add_tail(&waiter.list, &lock->wait_list);
 	waiter.task = task;
 
+	old_val = atomic_xchg(&lock->count, -1);
+	if (old_val == 1)
+		goto done;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
 	for (;;) {
 		/*
 		 * Lets try to take the lock again - this is needed even if
@@ -174,6 +180,8 @@ __mutex_lock_common(struct mutex *lock, 
 		spin_lock_mutex(&lock->wait_lock, flags);
 	}
 
+	lock_acquired(&lock->dep_map);
+done:
 	/* got the lock - rejoice! */
 	mutex_remove_waiter(lock, &waiter, task_thread_info(task));
 	debug_mutex_set_owner(lock, task_thread_info(task));
Index: linux-2.6-git/kernel/rwsem.c
===================================================================
--- linux-2.6-git.orig/kernel/rwsem.c
+++ linux-2.6-git/kernel/rwsem.c
@@ -20,7 +20,7 @@ void down_read(struct rw_semaphore *sem)
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_);
 
-	__down_read(sem);
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
 
 EXPORT_SYMBOL(down_read);
@@ -47,7 +47,7 @@ void down_write(struct rw_semaphore *sem
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, 0, 0, _RET_IP_);
 
-	__down_write(sem);
+	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
 
 EXPORT_SYMBOL(down_write);
@@ -111,7 +111,7 @@ void down_read_nested(struct rw_semaphor
 	might_sleep();
 	rwsem_acquire_read(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	__down_read(sem);
+	LOCK_CONTENDED(sem, __down_read_trylock, __down_read);
 }
 
 EXPORT_SYMBOL(down_read_nested);
@@ -130,7 +130,7 @@ void down_write_nested(struct rw_semapho
 	might_sleep();
 	rwsem_acquire(&sem->dep_map, subclass, 0, _RET_IP_);
 
-	__down_write_nested(sem, subclass);
+	LOCK_CONTENDED(sem, __down_write_trylock, __down_write);
 }
 
 EXPORT_SYMBOL(down_write_nested);
Index: linux-2.6-git/kernel/spinlock.c
===================================================================
--- linux-2.6-git.orig/kernel/spinlock.c
+++ linux-2.6-git/kernel/spinlock.c
@@ -72,7 +72,7 @@ void __lockfunc _read_lock(rwlock_t *loc
 {
 	preempt_disable();
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_read_lock(lock);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
 }
 EXPORT_SYMBOL(_read_lock);
 
@@ -89,7 +89,7 @@ unsigned long __lockfunc _spin_lock_irqs
 	 * that interrupts are not re-enabled during lock-acquire:
 	 */
 #ifdef CONFIG_LOCKDEP
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 #else
 	_raw_spin_lock_flags(lock, &flags);
 #endif
@@ -102,7 +102,7 @@ void __lockfunc _spin_lock_irq(spinlock_
 	local_irq_disable();
 	preempt_disable();
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 }
 EXPORT_SYMBOL(_spin_lock_irq);
 
@@ -111,7 +111,7 @@ void __lockfunc _spin_lock_bh(spinlock_t
 	local_bh_disable();
 	preempt_disable();
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 }
 EXPORT_SYMBOL(_spin_lock_bh);
 
@@ -122,7 +122,7 @@ unsigned long __lockfunc _read_lock_irqs
 	local_irq_save(flags);
 	preempt_disable();
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_read_lock(lock);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
 	return flags;
 }
 EXPORT_SYMBOL(_read_lock_irqsave);
@@ -132,7 +132,7 @@ void __lockfunc _read_lock_irq(rwlock_t 
 	local_irq_disable();
 	preempt_disable();
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_read_lock(lock);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
 }
 EXPORT_SYMBOL(_read_lock_irq);
 
@@ -141,7 +141,7 @@ void __lockfunc _read_lock_bh(rwlock_t *
 	local_bh_disable();
 	preempt_disable();
 	rwlock_acquire_read(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_read_lock(lock);
+	LOCK_CONTENDED(lock, _raw_read_trylock, _raw_read_lock);
 }
 EXPORT_SYMBOL(_read_lock_bh);
 
@@ -152,7 +152,7 @@ unsigned long __lockfunc _write_lock_irq
 	local_irq_save(flags);
 	preempt_disable();
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_write_lock(lock);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
 	return flags;
 }
 EXPORT_SYMBOL(_write_lock_irqsave);
@@ -162,7 +162,7 @@ void __lockfunc _write_lock_irq(rwlock_t
 	local_irq_disable();
 	preempt_disable();
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_write_lock(lock);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
 }
 EXPORT_SYMBOL(_write_lock_irq);
 
@@ -171,7 +171,7 @@ void __lockfunc _write_lock_bh(rwlock_t 
 	local_bh_disable();
 	preempt_disable();
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_write_lock(lock);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
 }
 EXPORT_SYMBOL(_write_lock_bh);
 
@@ -179,7 +179,7 @@ void __lockfunc _spin_lock(spinlock_t *l
 {
 	preempt_disable();
 	spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 }
 
 EXPORT_SYMBOL(_spin_lock);
@@ -188,7 +188,7 @@ void __lockfunc _write_lock(rwlock_t *lo
 {
 	preempt_disable();
 	rwlock_acquire(&lock->dep_map, 0, 0, _RET_IP_);
-	_raw_write_lock(lock);
+	LOCK_CONTENDED(lock, _raw_write_trylock, _raw_write_lock);
 }
 
 EXPORT_SYMBOL(_write_lock);
@@ -289,7 +289,7 @@ void __lockfunc _spin_lock_nested(spinlo
 {
 	preempt_disable();
 	spin_acquire(&lock->dep_map, subclass, 0, _RET_IP_);
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 }
 
 EXPORT_SYMBOL(_spin_lock_nested);
@@ -306,7 +306,7 @@ unsigned long __lockfunc _spin_lock_irqs
 	 * that interrupts are not re-enabled during lock-acquire:
 	 */
 #ifdef CONFIG_LOCKDEP
-	_raw_spin_lock(lock);
+	LOCK_CONTENDED(lock, _raw_spin_trylock, _raw_spin_lock);
 #else
 	_raw_spin_lock_flags(lock, &flags);
 #endif

--


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING
  2007-05-29 12:52 ` [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING Peter Zijlstra
@ 2007-05-29 13:21   ` Christoph Hellwig
  2007-05-29 14:16     ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2007-05-29 13:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Andrew Morton, Ingo Molnar, Bill Huey, Jason Baron,
	Steven Rostedt, Christoph Hellwig

On Tue, May 29, 2007 at 02:52:50PM +0200, Peter Zijlstra wrote:
> Ensure that all of the lock dependency tracking code is under
> CONFIG_PROVE_LOCKING. This allows us to use the held lock tracking code
> for other purposes.

There's an awfull lot of ifdefs introduced in this patch, I wonder
whether it might be better to split up lockdep.c at those boundaries.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING
  2007-05-29 13:21   ` Christoph Hellwig
@ 2007-05-29 14:16     ` Ingo Molnar
  2007-05-30  3:14       ` Andrew Morton
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2007-05-29 14:16 UTC (permalink / raw)
  To: Christoph Hellwig, Peter Zijlstra, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Steven Rostedt


* Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, May 29, 2007 at 02:52:50PM +0200, Peter Zijlstra wrote:
> > Ensure that all of the lock dependency tracking code is under
> > CONFIG_PROVE_LOCKING. This allows us to use the held lock tracking code
> > for other purposes.
> 
> There's an awfull lot of ifdefs introduced in this patch, I wonder 
> whether it might be better to split up lockdep.c at those boundaries.

it adds 6 new #ifdefs. There's 35 #ifdefs in page_alloc.c, 44 in 
sysctl.c and 64 in sched.c. I'd not call it 'an awful lot', although 
certainly it could be reduced. Splitting lockdep.c up would uglify it 
well beyond the impact of the 6 #ifdefs, given the amount of glue 
needed.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-29 12:52 ` [PATCH 3/5] lockstat: core infrastructure Peter Zijlstra
@ 2007-05-29 20:28   ` Daniel Walker
  2007-05-30 13:03     ` Peter Zijlstra
  2007-05-30 13:24     ` Ingo Molnar
  2007-05-30  3:43   ` Andrew Morton
  1 sibling, 2 replies; 32+ messages in thread
From: Daniel Walker @ 2007-05-29 20:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Andrew Morton, Ingo Molnar, Bill Huey, Jason Baron,
	Steven Rostedt, Christoph Hellwig

On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> +       now = sched_clock();
> +       waittime = now - hlock->waittime_stamp;
> + 

It looks like your using sched_clock() through out .. It's a little
troubling considering the constraints on the function .. Most
architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
resolution.. I'd imagine a millisecond hold time is pretty rare, even a
millisecond wait time might be fairly rare too ..  There's also no
guarantee that sched_clock timestamps off two different cpu's can be
compared (or at least that's my understanding) ..

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING
  2007-05-29 14:16     ` Ingo Molnar
@ 2007-05-30  3:14       ` Andrew Morton
  0 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2007-05-30  3:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Peter Zijlstra, linux-kernel, Bill Huey,
	Jason Baron, Steven Rostedt

On Tue, 29 May 2007 16:16:17 +0200 Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Christoph Hellwig <hch@infradead.org> wrote:
> 
> > On Tue, May 29, 2007 at 02:52:50PM +0200, Peter Zijlstra wrote:
> > > Ensure that all of the lock dependency tracking code is under
> > > CONFIG_PROVE_LOCKING. This allows us to use the held lock tracking code
> > > for other purposes.
> > 
> > There's an awfull lot of ifdefs introduced in this patch, I wonder 
> > whether it might be better to split up lockdep.c at those boundaries.
> 
> it adds 6 new #ifdefs. There's 35 #ifdefs in page_alloc.c, 44 in 
> sysctl.c and 64 in sched.c. I'd not call it 'an awful lot', although 
> certainly it could be reduced. Splitting lockdep.c up would uglify it 
> well beyond the impact of the 6 #ifdefs, given the amount of glue 
> needed.
> 

I'm not sure that we need to split lockdep.c, but it's a bit disappointing
that the patch didn't (couldn't?) move CONFIG_PROVE_LOCKING-only code and
data close together so that it can all fall within a single (or at least
fewer) ifdefs.

(Who came up with the (mis)name CONFIG_PROVE_LOCKING, btw?  Should have
been CONFIG_MIGHT_DISPROVE_LOCKING).


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-29 12:52 ` [PATCH 3/5] lockstat: core infrastructure Peter Zijlstra
  2007-05-29 20:28   ` Daniel Walker
@ 2007-05-30  3:43   ` Andrew Morton
  1 sibling, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2007-05-30  3:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Bill Huey, Jason Baron, Steven Rostedt,
	Christoph Hellwig

On Tue, 29 May 2007 14:52:51 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Introduce the core lock statistics code.
> 

I must say that an aggregate addition of 27 ifdefs is a bit sad.  And there
is some easy stuff here.

> +#ifdef CONFIG_PROVE_LOCKING
> +int prove_locking = 1;
> +module_param(prove_locking, int, 0644);
> +#endif

#else
#define prove_locking 0
#endif

> +
> +#ifdef CONFIG_LOCK_STAT
> +int lock_stat = 1;
> +module_param(lock_stat, int, 0644);
> +#endif

ditto.

>
> ...
>
> +struct lock_class_stats lock_stats(struct lock_class *class)
> +{
> +	struct lock_class_stats stats;
> +	int cpu, i;
> +
> +	memset(&stats, 0, sizeof(struct lock_class_stats));
> +	for_each_possible_cpu(cpu) {
> +		struct lock_class_stats *pcs =
> +			&per_cpu(lock_stats, cpu)[class - lock_classes];
> +
> +		for (i = 0; i < ARRAY_SIZE(stats.contention_point); i++)
> +			stats.contention_point[i] += pcs->contention_point[i];
> +
> +		lock_time_add(&pcs->read_waittime, &stats.read_waittime);
> +		lock_time_add(&pcs->write_waittime, &stats.write_waittime);
> +
> +		lock_time_add(&pcs->read_holdtime, &stats.read_holdtime);
> +		lock_time_add(&pcs->write_holdtime, &stats.write_holdtime);
> +	}
> +
> +	return stats;
> +}

hm, that isn't trying to be very efficient.

> @@ -2035,6 +2131,11 @@ static int __lock_acquire(struct lockdep
>  	int chain_head = 0;
>  	u64 chain_key;
>  
> +#ifdef CONFIG_PROVE_LOCKING
> +	if (!prove_locking)
> +		check = 1;
> +#endif

Removable

> +#ifdef CONFIG_LOCK_STAT
> +static void lock_release_holdtime(struct held_lock *hlock)
> +{
> +	struct lock_class_stats *stats;
> +	unsigned long long holdtime;
> +
> +	if (!lock_stat)
> +		return;
> +
> +	holdtime = sched_clock() - hlock->holdtime_stamp;
> +
> +	stats = get_lock_stats(hlock->class);
> +
> +	if (hlock->read)
> +		lock_time_inc(&stats->read_holdtime, holdtime);
> +	else
> +		lock_time_inc(&stats->write_holdtime, holdtime);
> +
> +	put_lock_stats(stats);
> +}
> +#else
> +static void lock_release_holdtime(struct held_lock *hlock)

inline

> +{
> +}
> +#endif
> +
> ...
>
> @@ -2456,6 +2712,14 @@ void lock_acquire(struct lockdep_map *lo
>  {
>  	unsigned long flags;
>  
> +#ifdef CONFIG_LOCK_STAT
> +	if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> +		if (unlikely(!prove_locking))
> +#endif

removable

> @@ -2475,6 +2739,14 @@ void lock_release(struct lockdep_map *lo
>  {
>  	unsigned long flags;
>  
> +#ifdef CONFIG_LOCK_STAT
> +	if (unlikely(!lock_stat))
> +#endif

removable

> +#ifdef CONFIG_PROVE_LOCKING
> +		if (unlikely(!prove_locking))
> +#endif
> +			return;
> +
>  	if (unlikely(current->lockdep_recursion))
>  		return;
>  
>  
> ...
>
> +#ifdef CONFIG_LOCK_STAT
> +
> +extern void lock_contended(struct lockdep_map *lock, unsigned long ip);
> +extern void lock_acquired(struct lockdep_map *lock);
> +
> +#define LOCK_CONTENDED(_lock, try, lock)			\
> +do {								\
> +	if (!try(_lock)) {					\
> +		lock_contended(&(_lock)->dep_map, _RET_IP_);	\
> +		lock(_lock);					\
> +		lock_acquired(&(_lock)->dep_map);		\
> +	}							\
> +} while (0)
> +
> +#else /* CONFIG_LOCK_STAT */
> +
> +#define lock_contended(l, i)			do { } while (0)
> +#define lock_acquired(l)			do { } while (0)

inlines are better.

> +#define LOCK_CONTENDED(_lock, try, lock) \
> +	lock(_lock)
> +
> +#endif /* CONFIG_LOCK_STAT */
> +
>  	},
>
> ...
>
> +#ifdef CONFIG_PROVE_LOCKING
> +	{
> +		.ctl_name	= KERN_PROVE_LOCKING,
> +		.procname	= "prove_locking",
> +		.data		= &prove_locking,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},
> +#endif
> +#ifdef CONFIG_LOCK_STAT
> +	{
> +		.ctl_name	= KERN_LOCK_STAT,
> +		.procname	= "lock_stat",
> +		.data		= &lock_stat,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},
> +#endif

Please use CTL_UNNUNBERED for new sysctls.

>  	{ .ctl_name = 0 }
>  };
> Index: linux-2.6-git/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git.orig/include/linux/sysctl.h
> +++ linux-2.6-git/include/linux/sysctl.h
> @@ -166,6 +166,8 @@ enum
>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
>  	KERN_POWEROFF_CMD=77,	/* string: poweroff command line */
> +	KERN_PROVE_LOCKING=78, /* int: enable lock dependancy checking */
> +	KERN_LOCK_STAT=79, /* int: enable lock statistics */
>  };

And lose these.

So I'm inclined to ask if you can redo these pathces with a view to reducing
the ifdef density with a bit of restructuring.

We could do that as a followon patch I guess.  Nicer not to though.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-29 20:28   ` Daniel Walker
@ 2007-05-30 13:03     ` Peter Zijlstra
  2007-05-30 13:24     ` Ingo Molnar
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-30 13:03 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, Andrew Morton, Ingo Molnar, Bill Huey, Jason Baron,
	Steven Rostedt, Christoph Hellwig

On Tue, 2007-05-29 at 13:28 -0700, Daniel Walker wrote:
> On Tue, 2007-05-29 at 14:52 +0200, Peter Zijlstra wrote:
> > +       now = sched_clock();
> > +       waittime = now - hlock->waittime_stamp;
> > + 
> 
> It looks like your using sched_clock() through out .. It's a little
> troubling considering the constraints on the function .. Most
> architecture implement a jiffies sched_clock() w/ 1 millisecond or worse
> resolution.. I'd imagine a millisecond hold time is pretty rare, even a
> millisecond wait time might be fairly rare too ..  There's also no
> guarantee that sched_clock timestamps off two different cpu's can be
> compared (or at least that's my understanding) ..

All valid points, however.. calling anything more expensive 2-3 times
per lock acquisition is going to be _very_ painful.

Also, IMHO the contention count vs the acquisition count is the most
interesting number, the times are just a nice bonus (if and when they
work).



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-29 20:28   ` Daniel Walker
  2007-05-30 13:03     ` Peter Zijlstra
@ 2007-05-30 13:24     ` Ingo Molnar
  2007-05-30 13:40       ` Steven Rostedt
  2007-05-30 15:20       ` Daniel Walker
  1 sibling, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2007-05-30 13:24 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, linux-kernel, Andrew Morton, Bill Huey,
	Jason Baron, Steven Rostedt, Christoph Hellwig


* Daniel Walker <dwalker@mvista.com> wrote:

> [...] Most architecture implement a jiffies sched_clock() w/ 1 
> millisecond or worse resolution.. [...]

weird that you count importance by 'number of architectures', while 98% 
of the installed server base is x86 or x86_64, where sched_clock() is 
pretty precise ;-)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 13:24     ` Ingo Molnar
@ 2007-05-30 13:40       ` Steven Rostedt
  2007-05-30 13:49         ` Ingo Molnar
  2007-05-30 15:20       ` Daniel Walker
  1 sibling, 1 reply; 32+ messages in thread
From: Steven Rostedt @ 2007-05-30 13:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Walker, Peter Zijlstra, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Christoph Hellwig

>
> > [...] Most architecture implement a jiffies sched_clock() w/ 1
> > millisecond or worse resolution.. [...]
>
> weird that you count importance by 'number of architectures', while 98%
> of the installed server base is x86 or x86_64, where sched_clock() is
> pretty precise ;-)
>


I can understand Daniel's POV (from working at TimeSys once upon a time).
He works for Monta Vista which does a lot with embedded systems running on
something other than x86.  So from Daniel's POV, those "number of
architectures" is of importance. While, from those that do server work
where x86 is now becoming the dominant platform, our focus is on those.

As long as the work doesn't "break" an arch. We can argue that sched_clock
is "good enough".  If someone wants better accounting of locks on some
other arch, they can simply change sched_clock to be more precise.

-- Steve


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 13:40       ` Steven Rostedt
@ 2007-05-30 13:49         ` Ingo Molnar
  2007-05-30 17:06           ` Daniel Walker
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2007-05-30 13:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Daniel Walker, Peter Zijlstra, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Christoph Hellwig


* Steven Rostedt <rostedt@goodmis.org> wrote:

> [...] We can argue that sched_clock is "good enough".  If someone 
> wants better accounting of locks on some other arch, they can simply 
> change sched_clock to be more precise.

exactly. Imprecise sched_clock() if there's a better fast clock source 
available is a bug in the architecture code. If the only available 
clocksource is 1 msec resolution then there's no solution and nothing to 
talk about - lock statistics will be 1msec granular just as much as 
scheduling.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 13:24     ` Ingo Molnar
  2007-05-30 13:40       ` Steven Rostedt
@ 2007-05-30 15:20       ` Daniel Walker
  1 sibling, 0 replies; 32+ messages in thread
From: Daniel Walker @ 2007-05-30 15:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, Andrew Morton, Bill Huey,
	Jason Baron, Steven Rostedt, Christoph Hellwig

On Wed, 2007-05-30 at 15:24 +0200, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > [...] Most architecture implement a jiffies sched_clock() w/ 1 
> > millisecond or worse resolution.. [...]
> 
> weird that you count importance by 'number of architectures', while 98% 
> of the installed server base is x86 or x86_64, where sched_clock() is 
> pretty precise ;-)

I work with many other architectures, so it's important to me. That's
aside anyway, just consider there are several Linux architectures. When
we write code we should consider those other architectures .. Non-x86
Linux is used pretty frequently. Dare I say that x86 and x86_64
installations combined don't out weigh all the other architecture
installations, that's only a guess but I wouldn't be surprised if that's
true.

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 13:49         ` Ingo Molnar
@ 2007-05-30 17:06           ` Daniel Walker
  2007-05-30 17:16             ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Walker @ 2007-05-30 17:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, Peter Zijlstra, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Christoph Hellwig

On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > [...] We can argue that sched_clock is "good enough".  If someone 
> > wants better accounting of locks on some other arch, they can simply 
> > change sched_clock to be more precise.
> 
> exactly. Imprecise sched_clock() if there's a better fast clock source 
> available is a bug in the architecture code. If the only available 
> clocksource is 1 msec resolution then there's no solution and nothing to 
> talk about - lock statistics will be 1msec granular just as much as 
> scheduling.


I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
structure.. sched_clock() was a quick way to get lowlevel time stamps
just for the scheduler. The timekeeping clocksource structure is a more
complete solution.

>From the architecture perspective there are two low level clock hooks to
implement one is sched_clock() , and at least one clocksource structure.
Both do essentially the same thing. With timekeepings clocksource
structure actually being easier to implement cause the math is built in.

It's clear to me that architectures will implement clocksource
structures .. However, they will not implement sched_clock() because
there is no benefit associated with it. As you have said the scheduler
works fine with a jiffies resolution clock, even a large number of x86
machines use a jiffies sched_clock() ..

So I don't think it's a bug if sched_clock() is lowres, and it shouldn't
be a bug in the future..

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 17:06           ` Daniel Walker
@ 2007-05-30 17:16             ` Peter Zijlstra
  2007-05-30 17:25               ` Daniel Walker
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-05-30 17:16 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Christoph Hellwig

On Wed, 2007-05-30 at 10:06 -0700, Daniel Walker wrote:
> On Wed, 2007-05-30 at 15:49 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > [...] We can argue that sched_clock is "good enough".  If someone 
> > > wants better accounting of locks on some other arch, they can simply 
> > > change sched_clock to be more precise.
> > 
> > exactly. Imprecise sched_clock() if there's a better fast clock source 
> > available is a bug in the architecture code. If the only available 
> > clocksource is 1 msec resolution then there's no solution and nothing to 
> > talk about - lock statistics will be 1msec granular just as much as 
> > scheduling.
> 
> 
> I don't agree .. sched_clock() is obsoleted by timekeepings clocksource
> structure.. sched_clock() was a quick way to get lowlevel time stamps
> just for the scheduler. The timekeeping clocksource structure is a more
> complete solution.
> 
> >From the architecture perspective there are two low level clock hooks to
> implement one is sched_clock() , and at least one clocksource structure.
> Both do essentially the same thing. With timekeepings clocksource
> structure actually being easier to implement cause the math is built in.

I think you are mistaken here; the two are similar but not identical.

I see sched_clock() as fast first, accurate second. Whereas the
clocksource thing is accurate first, fast second.

There is room for both of them.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 17:16             ` Peter Zijlstra
@ 2007-05-30 17:25               ` Daniel Walker
  2007-06-01 13:12                 ` Ingo Molnar
  2007-06-01 14:25                 ` Andi Kleen
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Walker @ 2007-05-30 17:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Bill Huey, Jason Baron, Christoph Hellwig

On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > >From the architecture perspective there are two low level clock hooks to
> > implement one is sched_clock() , and at least one clocksource structure.
> > Both do essentially the same thing. With timekeepings clocksource
> > structure actually being easier to implement cause the math is built in.
> 
> I think you are mistaken here; the two are similar but not identical.
> 
> I see sched_clock() as fast first, accurate second. Whereas the
> clocksource thing is accurate first, fast second.

This is true .. However, if there is a speed different it's small.
In the past I've replace sched_clock() with a clocksource, and there was
no noticeable speed different .. Just recently I replaced x86's
sched_clock() math with the clocksource math with no noticable
difference .. At least not from my benchmarks ..

> There is room for both of them.

There is room, but we don't need sched_clock() .. Certainly we shouldn't
force architectures to implement sched_clock() by calling it a "bug" if
it's lowres.

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 17:25               ` Daniel Walker
@ 2007-06-01 13:12                 ` Ingo Molnar
  2007-06-01 15:26                   ` Daniel Walker
  2007-06-01 14:25                 ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2007-06-01 13:12 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner


* Daniel Walker <dwalker@mvista.com> wrote:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:

> > I think you are mistaken here; the two are similar but not 
> > identical.
> > 
> > I see sched_clock() as fast first, accurate second. Whereas the 
> > clocksource thing is accurate first, fast second.
> 
> This is true .. However, if there is a speed different it's small.

Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
sched_clock()? What you write is so wrong that it's not even funny. You 
keep repeating this nonsense despite having been told multiple times 
that you are dead wrong.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-05-30 17:25               ` Daniel Walker
  2007-06-01 13:12                 ` Ingo Molnar
@ 2007-06-01 14:25                 ` Andi Kleen
  1 sibling, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2007-06-01 14:25 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, linux-kernel,
	Andrew Morton, Bill Huey, Jason Baron, Christoph Hellwig

Daniel Walker <dwalker@mvista.com> writes:

> On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> 
> > > >From the architecture perspective there are two low level clock hooks to
> > > implement one is sched_clock() , and at least one clocksource structure.
> > > Both do essentially the same thing. With timekeepings clocksource
> > > structure actually being easier to implement cause the math is built in.
> > 
> > I think you are mistaken here; the two are similar but not identical.
> > 
> > I see sched_clock() as fast first, accurate second. Whereas the
> > clocksource thing is accurate first, fast second.
> 
> This is true .. However, if there is a speed different it's small.

pmtimer (factor 1000+) or HPET (factor 10-100+ depending on CPU) 
accesses are much slower than TSC or jiffies read. Talking to the
southbridge is slow.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 13:12                 ` Ingo Molnar
@ 2007-06-01 15:26                   ` Daniel Walker
  2007-06-01 15:52                     ` Peter Zijlstra
  2007-06-01 18:19                     ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Walker @ 2007-06-01 15:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> 
> > > I think you are mistaken here; the two are similar but not 
> > > identical.
> > > 
> > > I see sched_clock() as fast first, accurate second. Whereas the 
> > > clocksource thing is accurate first, fast second.
> > 
> > This is true .. However, if there is a speed different it's small.
> 
> Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
> sched_clock()? What you write is so wrong that it's not even funny. You 
> keep repeating this nonsense despite having been told multiple times 
> that you are dead wrong.

Yes I have, and your right there is a difference, and a big
difference .. Above I was referring only to the TSC clocksource, since
that's an apples to apples comparison .. I would never compare the TSC
to the acpi_pm, that's no contest ..

The acpi_pm as sched_clock() with hackbench was about %25 slower, the
pit was 10x slower approximately. (I did this months ago.)

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 15:26                   ` Daniel Walker
@ 2007-06-01 15:52                     ` Peter Zijlstra
  2007-06-01 16:11                       ` Daniel Walker
  2007-06-01 18:19                     ` Ingo Molnar
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2007-06-01 15:52 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > * Daniel Walker <dwalker@mvista.com> wrote:
> > 
> > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> > 
> > > > I think you are mistaken here; the two are similar but not 
> > > > identical.
> > > > 
> > > > I see sched_clock() as fast first, accurate second. Whereas the 
> > > > clocksource thing is accurate first, fast second.
> > > 
> > > This is true .. However, if there is a speed different it's small.
> > 
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
> > sched_clock()? What you write is so wrong that it's not even funny. You 
> > keep repeating this nonsense despite having been told multiple times 
> > that you are dead wrong.
> 
> Yes I have, and your right there is a difference, and a big
> difference .. Above I was referring only to the TSC clocksource, since
> that's an apples to apples comparison .. I would never compare the TSC
> to the acpi_pm, that's no contest ..
> 
> The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> pit was 10x slower approximately. (I did this months ago.)

The whole issue is that you don't have any control over what clocksource
you'll end up with. If it so happens that pmtimer gets selected your
whole box will crawl if its used liberaly, like the patch under
discussion does.

So, having two interfaces, one fast and one accurate is the right answer
IMHO.

And I agree, that if the arch has a fast clock but doesn't use it for
sched_clock() that would be a shortcoming of that arch.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 15:52                     ` Peter Zijlstra
@ 2007-06-01 16:11                       ` Daniel Walker
  2007-06-01 18:30                         ` Ingo Molnar
  2007-06-01 18:43                         ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Daniel Walker @ 2007-06-01 16:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-01 at 08:26 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 15:12 +0200, Ingo Molnar wrote:
> > > * Daniel Walker <dwalker@mvista.com> wrote:
> > > 
> > > > On Wed, 2007-05-30 at 19:16 +0200, Peter Zijlstra wrote:
> > > 
> > > > > I think you are mistaken here; the two are similar but not 
> > > > > identical.
> > > > > 
> > > > > I see sched_clock() as fast first, accurate second. Whereas the 
> > > > > clocksource thing is accurate first, fast second.
> > > > 
> > > > This is true .. However, if there is a speed different it's small.
> > > 
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
> > > sched_clock()? What you write is so wrong that it's not even funny. You 
> > > keep repeating this nonsense despite having been told multiple times 
> > > that you are dead wrong.
> > 
> > Yes I have, and your right there is a difference, and a big
> > difference .. Above I was referring only to the TSC clocksource, since
> > that's an apples to apples comparison .. I would never compare the TSC
> > to the acpi_pm, that's no contest ..
> > 
> > The acpi_pm as sched_clock() with hackbench was about %25 slower, the
> > pit was 10x slower approximately. (I did this months ago.)
> 
> The whole issue is that you don't have any control over what clocksource
> you'll end up with. If it so happens that pmtimer gets selected your
> whole box will crawl if its used liberaly, like the patch under
> discussion does.

You can have control over it, which I think the whole point of this
discussion ..

> So, having two interfaces, one fast and one accurate is the right answer
> IMHO.

In the case of lockstat you have two cases fast and functional, and
non-functional .. Right now your patch has no slow and functional state.

The non-functional state is even the majority from my perspective.

> And I agree, that if the arch has a fast clock but doesn't use it for
> sched_clock() that would be a shortcoming of that arch.

As I said before there is no reason why and architectures should be
forced to implement sched_clock() .. Is there some specific reason why
you think it should be mandatory?

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 15:26                   ` Daniel Walker
  2007-06-01 15:52                     ` Peter Zijlstra
@ 2007-06-01 18:19                     ` Ingo Molnar
  2007-06-01 19:30                       ` Daniel Walker
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2007-06-01 18:19 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner


* Daniel Walker <dwalker@mvista.com> wrote:

> > > > I see sched_clock() as fast first, accurate second. Whereas the 
> > > > clocksource thing is accurate first, fast second.
> > > 
> > > This is true .. However, if there is a speed different it's small.
> > 
> > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
> > sched_clock()? What you write is so wrong that it's not even funny. 
> > You keep repeating this nonsense despite having been told multiple 
> > times that you are dead wrong.
> 
> Yes I have, and your right there is a difference, and a big difference 
> .. Above I was referring only to the TSC clocksource, since that's an 
> apples to apples comparison .. I would never compare the TSC to the 
> acpi_pm, that's no contest ..

You still dont get it i think: in real life we end up using the TSC in 
sched_clock() _much more often_ than we end up using the TSC for 
clocksource! So your flawed suggestion does not fix anything, it in fact 
introduces a really bad regression: instead of using the TSC (or 
jiffies) we'd end up using the pmtimer or hpet for every lock operation 
when lockstat is enabled, bringing the box to a screeching halt in 
essence.

so what you suggest has a far worse effect on the _majority_ of systems 
that are even interested in running lockstat, than the case you 
mentioned that some seldom-used arch which is lazy about sched_clock() 
falls back to jiffies granularity. It's not a big deal: the stats will 
have the same granularity. (the op counts in lockstat will still be 
quite useful)

sched_clock() is a 'fast but occasionally inaccurate clock', while the 
GTOD clocksource is an accurate clock (but very often slow).

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 16:11                       ` Daniel Walker
@ 2007-06-01 18:30                         ` Ingo Molnar
  2007-06-01 19:25                           ` Matt Mackall
  2007-06-01 19:30                           ` Daniel Walker
  2007-06-01 18:43                         ` Peter Zijlstra
  1 sibling, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2007-06-01 18:30 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner


* Daniel Walker <dwalker@mvista.com> wrote:

> > So, having two interfaces, one fast and one accurate is the right 
> > answer IMHO.
> 
> In the case of lockstat you have two cases fast and functional, and 
> non-functional .. Right now your patch has no slow and functional 
> state.

let me explain it to you:

1) there is absolutely no problem here to begin with. If a rare 
architecture is lazy enough to not bother implementing a finegrained 
sched_clock() then it certainly does not care about the granularity of 
lockstat fields either. If it does, it can improve scheduling and get 
more finegrained lockstat by implementing a proper sched_clock() 
function - all for the same price! ;-)

2) the 'solution' you suggested for this non-problem is _far worse_ than 
the granularity non-problem, on the _majority_ of server systems today! 
Think about it! Your suggestion would make lockstat _totally unusable_. 
Not "slow and functional" like you claim but "dead-slow and unusable".

in light of all this it is puzzling to me how you can still call Peter's 
code "non-functional" with a straight face. I have just tried lockstat 
with jiffies granular sched_clock() and it was still fully functional. 
So if you want to report some bug then please do it in a proper form.

> As I said before there is no reason why and architectures should be 
> forced to implement sched_clock() .. Is there some specific reason why 
> you think it should be mandatory?

Easy: it's not mandatory, but it's certainly "nice" even today, even 
without lockstat. It will get you:

 - better scheduling
 - better printk timestamps
 - higher-quality blktrace timestamps

With lockstat, append "more finegrained lockstat output" to that list of 
benefits too. That's why every sane server architecture has a 
sched_clock() implementation - go check the kernel source. Now i wouldnt 
mind to clean the API up and call it get_stat_clock() or whatever - but 
that was not your suggestion at all - your suggestion was flawed: to 
implement sched_clock() via the GTOD clocksource.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 16:11                       ` Daniel Walker
  2007-06-01 18:30                         ` Ingo Molnar
@ 2007-06-01 18:43                         ` Peter Zijlstra
  2007-06-01 18:51                           ` Ingo Molnar
  2007-06-01 19:30                           ` Daniel Walker
  1 sibling, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2007-06-01 18:43 UTC (permalink / raw)
  To: Daniel Walker
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:

> > The whole issue is that you don't have any control over what clocksource
> > you'll end up with. If it so happens that pmtimer gets selected your
> > whole box will crawl if its used liberaly, like the patch under
> > discussion does.
> 
> You can have control over it, which I think the whole point of this
> discussion ..

No you don't, clocksource will gladly discard the TSC when its not found
stable enough (the majority of the systems today). While it would be
good enough for sched_clock().




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 18:43                         ` Peter Zijlstra
@ 2007-06-01 18:51                           ` Ingo Molnar
  2007-06-01 19:30                           ` Daniel Walker
  1 sibling, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2007-06-01 18:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Walker, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
> 
> > > The whole issue is that you don't have any control over what clocksource
> > > you'll end up with. If it so happens that pmtimer gets selected your
> > > whole box will crawl if its used liberaly, like the patch under
> > > discussion does.
> > 
> > You can have control over it, which I think the whole point of this
> > discussion ..
> 
> No you don't, clocksource will gladly discard the TSC when its not 
> found stable enough (the majority of the systems today). While it 
> would be good enough for sched_clock().

yeah, precisely. [ There is another thing as well: most embedded 
architectures do not even implement LOCKDEP_SUPPORT today, so it wouldnt 
be possible to enable lockstat on them anyway. So this whole topic is 
ridiculous to begin with. How about fixing some real, non-imaginery bugs 
instead? ;-) ]

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 18:30                         ` Ingo Molnar
@ 2007-06-01 19:25                           ` Matt Mackall
  2007-06-01 19:30                           ` Daniel Walker
  1 sibling, 0 replies; 32+ messages in thread
From: Matt Mackall @ 2007-06-01 19:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Daniel Walker, Peter Zijlstra, Steven Rostedt, linux-kernel,
	Andrew Morton, Jason Baron, Thomas Gleixner

On Fri, Jun 01, 2007 at 08:30:53PM +0200, Ingo Molnar wrote:
>  - better scheduling
>  - better printk timestamps
>  - higher-quality blktrace timestamps
- more entropy in /dev/random

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 18:19                     ` Ingo Molnar
@ 2007-06-01 19:30                       ` Daniel Walker
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Walker @ 2007-06-01 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 20:19 +0200, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > > > > I see sched_clock() as fast first, accurate second. Whereas the 
> > > > > clocksource thing is accurate first, fast second.
> > > > 
> > > > This is true .. However, if there is a speed different it's small.
> > > 
> > > Ugh. Have you ever compared pmtimer (or even hpet) against TSC based 
> > > sched_clock()? What you write is so wrong that it's not even funny. 
> > > You keep repeating this nonsense despite having been told multiple 
> > > times that you are dead wrong.
> > 
> > Yes I have, and your right there is a difference, and a big difference 
> > .. Above I was referring only to the TSC clocksource, since that's an 
> > apples to apples comparison .. I would never compare the TSC to the 
> > acpi_pm, that's no contest ..
> 
> You still dont get it i think: in real life we end up using the TSC in 
> sched_clock() _much more often_ than we end up using the TSC for 
> clocksource! So your flawed suggestion does not fix anything, it in fact 
> introduces a really bad regression: instead of using the TSC (or 
> jiffies) we'd end up using the pmtimer or hpet for every lock operation 
> when lockstat is enabled, bringing the box to a screeching halt in 
> essence.

My position isn't that we should use the high level clocksource
interface as it is now without changes .. That's never been my position
since I've been working with it.. The high level interface will need to
evolve. 

I'm saying we should use the clocksource structure as the main hook into
the low level architecture code. 

> so what you suggest has a far worse effect on the _majority_ of systems 
> that are even interested in running lockstat, than the case you 
> mentioned that some seldom-used arch which is lazy about sched_clock() 
> falls back to jiffies granularity. It's not a big deal: the stats will 
> have the same granularity. (the op counts in lockstat will still be 
> quite useful)

My suggestion is only as good as the implementation .. Your making some
fairly sweeping assumption about how lockstat _would_ use the
clocksources in it's final form ..

So clearly lockstat has a contraint which is that it can't use slow
clocks.. 

> sched_clock() is a 'fast but occasionally inaccurate clock', while the 
> GTOD clocksource is an accurate clock (but very often slow).

I think we're just taking different perspectives .. The tsc clocksource
is just as fast as the tsc sched_clock() , you can interchange the two
without ill effects .. That's one perspective ..

You can use a yet to be written API that uses the GTOD and a
clocksource, and allows slow clocksources to be used in place of fast
ones with really bad effects, that's another perspective ..

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 18:30                         ` Ingo Molnar
  2007-06-01 19:25                           ` Matt Mackall
@ 2007-06-01 19:30                           ` Daniel Walker
  1 sibling, 0 replies; 32+ messages in thread
From: Daniel Walker @ 2007-06-01 19:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 20:30 +0200, Ingo Molnar wrote:
> * Daniel Walker <dwalker@mvista.com> wrote:
> 
> > > So, having two interfaces, one fast and one accurate is the right 
> > > answer IMHO.
> > 
> > In the case of lockstat you have two cases fast and functional, and 
> > non-functional .. Right now your patch has no slow and functional 
> > state.
> 
> let me explain it to you:
> 
> 1) there is absolutely no problem here to begin with. If a rare 
> architecture is lazy enough to not bother implementing a finegrained 
> sched_clock() then it certainly does not care about the granularity of 
> lockstat fields either. If it does, it can improve scheduling and get 
> more finegrained lockstat by implementing a proper sched_clock() 
> function - all for the same price! ;-)

There is a problem, which we are discussing ... sched_clock() can be
lowres in lots of different situations, and lockstat fails to account
for that .. That in turn makes it's timing non-functional.

> 2) the 'solution' you suggested for this non-problem is _far worse_ than 
> the granularity non-problem, on the _majority_ of server systems today! 
> Think about it! Your suggestion would make lockstat _totally unusable_. 
> Not "slow and functional" like you claim but "dead-slow and unusable".

I'm not sure how to respond to this.. You taking a big ball of
assumptions, and molding it into what ever you want ..

> in light of all this it is puzzling to me how you can still call Peter's 
> code "non-functional" with a straight face. I have just tried lockstat 
> with jiffies granular sched_clock() and it was still fully functional. 
> So if you want to report some bug then please do it in a proper form.

Clearly you can't have sane microsecond level timestamps with a clock
that doesn't support microsecond resolution.. This is even something
Peter acknowledged in his first email to me.

> > As I said before there is no reason why and architectures should be 
> > forced to implement sched_clock() .. Is there some specific reason why 
> > you think it should be mandatory?
> 
> Easy: it's not mandatory, but it's certainly "nice" even today, even 
> without lockstat. It will get you:
> 
>  - better scheduling
>  - better printk timestamps
>  - higher-quality blktrace timestamps
> 
> With lockstat, append "more finegrained lockstat output" to that list of 
> benefits too. That's why every sane server architecture has a 
> sched_clock() implementation - go check the kernel source. Now i wouldnt 
> mind to clean the API up and call it get_stat_clock() or whatever - but 
> that was not your suggestion at all - your suggestion was flawed: to 
> implement sched_clock() via the GTOD clocksource.

At this point it's not clear to me you know what my suggestion was ..
Your saying you want a better API for sched_clock(), and yes I agree
with that 100% sched_clock() needs a better API .. The paragraph above
it looks like your on the verge of agreeing with me ..

You think my words are puzzling, try it from this end.. 

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/5] lockstat: core infrastructure
  2007-06-01 18:43                         ` Peter Zijlstra
  2007-06-01 18:51                           ` Ingo Molnar
@ 2007-06-01 19:30                           ` Daniel Walker
  1 sibling, 0 replies; 32+ messages in thread
From: Daniel Walker @ 2007-06-01 19:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, linux-kernel, Andrew Morton,
	Jason Baron, Thomas Gleixner

On Fri, 2007-06-01 at 20:43 +0200, Peter Zijlstra wrote:
> On Fri, 2007-06-01 at 09:11 -0700, Daniel Walker wrote:
> > On Fri, 2007-06-01 at 17:52 +0200, Peter Zijlstra wrote:
> 
> > > The whole issue is that you don't have any control over what clocksource
> > > you'll end up with. If it so happens that pmtimer gets selected your
> > > whole box will crawl if its used liberaly, like the patch under
> > > discussion does.
> > 
> > You can have control over it, which I think the whole point of this
> > discussion ..
> 
> No you don't, clocksource will gladly discard the TSC when its not found
> stable enough (the majority of the systems today). While it would be
> good enough for sched_clock().

Your misreading the sentence above "You can have control over it" , this
means that you _can_ make lockstat use the TSC or disable itself when
the TSC is unstable.. Clock management is secondary to me, and we can
change it.. What matters more is if the "struct clocksource" provides a
better method for accessing lowlevel clocks than sched_clock() .. My
contention is that it does provide a better method.

Daniel


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2007-06-01 19:33 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-29 12:52 [PATCH 0/5] lock contention tracking -v3 Peter Zijlstra
2007-05-29 12:52 ` [PATCH 1/5] fix raw_spinlock_t vs lockdep Peter Zijlstra
2007-05-29 12:52 ` [PATCH 2/5] lockdep: sanitise CONFIG_PROVE_LOCKING Peter Zijlstra
2007-05-29 13:21   ` Christoph Hellwig
2007-05-29 14:16     ` Ingo Molnar
2007-05-30  3:14       ` Andrew Morton
2007-05-29 12:52 ` [PATCH 3/5] lockstat: core infrastructure Peter Zijlstra
2007-05-29 20:28   ` Daniel Walker
2007-05-30 13:03     ` Peter Zijlstra
2007-05-30 13:24     ` Ingo Molnar
2007-05-30 13:40       ` Steven Rostedt
2007-05-30 13:49         ` Ingo Molnar
2007-05-30 17:06           ` Daniel Walker
2007-05-30 17:16             ` Peter Zijlstra
2007-05-30 17:25               ` Daniel Walker
2007-06-01 13:12                 ` Ingo Molnar
2007-06-01 15:26                   ` Daniel Walker
2007-06-01 15:52                     ` Peter Zijlstra
2007-06-01 16:11                       ` Daniel Walker
2007-06-01 18:30                         ` Ingo Molnar
2007-06-01 19:25                           ` Matt Mackall
2007-06-01 19:30                           ` Daniel Walker
2007-06-01 18:43                         ` Peter Zijlstra
2007-06-01 18:51                           ` Ingo Molnar
2007-06-01 19:30                           ` Daniel Walker
2007-06-01 18:19                     ` Ingo Molnar
2007-06-01 19:30                       ` Daniel Walker
2007-06-01 14:25                 ` Andi Kleen
2007-05-30 15:20       ` Daniel Walker
2007-05-30  3:43   ` Andrew Morton
2007-05-29 12:52 ` [PATCH 4/5] lockstat: human readability tweaks Peter Zijlstra
2007-05-29 12:52 ` [PATCH 5/5] lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox