[PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line
@ 2011-09-22 22:09 Steven Rostedt
  2011-09-22 22:09 ` [PATCH 01/21] tracing: Clean up tb_fmt to not give faulty compile warning Steven Rostedt
                   ` (20 more replies)
  0 siblings, 21 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner


Ingo,

I've had lots of these patches for a while now and was holding off
till kernel.org became back on line. But that is taking much longer
than I expected, and the patches are starting to build up even though
I've been working more on -rt lately. I may start working on mainline
again soon so need to get rid of the queue I currently have.

Sorry for the big patch set, but it is really more a bunch of smaller
patch sets that have accumulated.

I bit the bullet and created a github account. You can pull from that
for now.

Thanks!

-- Steve

Please pull the latest tip/perf/core tree, which can be found at:

  git://github.com/rostedt/linux.git
tip/perf/core

Head SHA1: 963616e48ea9cf58a1c3217457612cfa2be70b2c


Jiri Olsa (10):
      tracing/filter: Use static allocation for filter predicates
      tracing/filter: Separate predicate init and filter addition
      tracing/filter: Remove field_name from filter_pred struct
      tracing/filter: Simplify tracepoint event lookup
      tracing/filter: Unify predicate tree walking, change check_pred_tree function to use it
      tracing/filter: Change count_leafs function to use walk_pred_tree
      tracing/filter: Change fold_pred_tree function to use walk_pred_tree
      tracing/filter: Change fold_pred function to use walk_pred_tree
      tracing/filter: Change filter_match_preds function to use walk_pred_tree
      tracing/filter: Add startup tests for events filter

Kevin Winchester (1):
      x86: jump_label: arch_jump_label_text_poke_early: add missing __init

Mathieu Desnoyers (1):
      Tracepoint: Dissociate from module mutex

Steven Rostedt (7):
      tracing: Clean up tb_fmt to not give faulty compile warning
      tracing: Add preempt disable for filter self test
      tracing: Add a counter clock for those that do not trust clocks
      tracing: Fix preemptirqsoff tracer to not stop at preempt off
      tracing: Account for preempt off in preempt_schedule()
      tracing: Do not allocate buffer for trace_marker
      tracing: Add optional percpu buffers for trace_printk()

Vaibhav Nagarnaik (2):
      trace: Add a new readonly entry to report total buffer size
      trace: Add ring buffer stats to measure rate of events

----
 Documentation/kernel-parameters.txt     |   11 +
 arch/x86/kernel/jump_label.c            |    2 +-
 include/linux/module.h                  |   12 -
 include/linux/ring_buffer.h             |    2 +
 include/linux/trace_clock.h             |    1 +
 include/linux/tracepoint.h              |   25 +-
 kernel/module.c                         |   47 --
 kernel/sched.c                          |    9 +
 kernel/trace/Makefile                   |    2 +
 kernel/trace/ring_buffer.c              |   70 +++-
 kernel/trace/trace.c                    |  183 ++++++-
 kernel/trace/trace.h                    |   15 +-
 kernel/trace/trace_clock.c              |   12 +
 kernel/trace/trace_events_filter.c      |  795 ++++++++++++++++++-------------
 kernel/trace/trace_events_filter_test.h |   50 ++
 kernel/trace/trace_irqsoff.c            |    4 +-
 kernel/trace/trace_printk.c             |   19 +-
 kernel/tracepoint.c                     |  169 ++++++-
 18 files changed, 953 insertions(+), 475 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 01/21] tracing: Clean up tb_fmt to not give faulty compile warning
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 02/21] Tracepoint: Dissociate from module mutex Steven Rostedt
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: 0001-tracing-Clean-up-tb_fmt-to-not-give-faulty-compile-w.patch --]
[-- Type: text/plain, Size: 1347 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

gcc incorrectly states that the variable "fmt" is uninitialized when
CC_OPITMIZE_FOR_SIZE is set.

Instead of just blindly setting fmt to NULL, the code is cleaned up
a little to be a bit easier for humans to follow, as well as gcc
to know the variables are initialized.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_printk.c |   19 ++++++++++---------
 1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/trace/trace_printk.c b/kernel/trace/trace_printk.c
index 1f06468..6fd4ffd 100644
--- a/kernel/trace/trace_printk.c
+++ b/kernel/trace/trace_printk.c
@@ -59,18 +59,19 @@ void hold_module_trace_bprintk_format(const char **start, const char **end)
 			continue;
 		}
 
+		fmt = NULL;
 		tb_fmt = kmalloc(sizeof(*tb_fmt), GFP_KERNEL);
-		if (tb_fmt)
+		if (tb_fmt) {
 			fmt = kmalloc(strlen(*iter) + 1, GFP_KERNEL);
-		if (tb_fmt && fmt) {
-			list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
-			strcpy(fmt, *iter);
-			tb_fmt->fmt = fmt;
-			*iter = tb_fmt->fmt;
-		} else {
-			kfree(tb_fmt);
-			*iter = NULL;
+			if (fmt) {
+				list_add_tail(&tb_fmt->list, &trace_bprintk_fmt_list);
+				strcpy(fmt, *iter);
+				tb_fmt->fmt = fmt;
+			} else
+				kfree(tb_fmt);
 		}
+		*iter = fmt;
+
 	}
 	mutex_unlock(&btrace_mutex);
 }
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 02/21] Tracepoint: Dissociate from module mutex
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
  2011-09-22 22:09 ` [PATCH 01/21] tracing: Clean up tb_fmt to not give faulty compile warning Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 03/21] x86: jump_label: arch_jump_label_text_poke_early: add missing __init Steven Rostedt
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Mathieu Desnoyers, Jason Baron, Lai Jiangshan,
	Peter Zijlstra, Masami Hiramatsu

[-- Attachment #1: 0002-Tracepoint-Dissociate-from-module-mutex.patch --]
[-- Type: text/plain, Size: 14453 bytes --]

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

Copy the information needed from struct module into a local module list
held within tracepoint.c from within the module coming/going notifier.

This vastly simplifies locking of tracepoint registration /
unregistration, because we don't have to take the module mutex to
register and unregister tracepoints anymore. Steven Rostedt ran into
dependency problems related to modules mutex vs kprobes mutex vs ftrace
mutex vs tracepoint mutex that seems to be hard to fix without removing
this dependency between tracepoint and module mutex. (note: it should be
investigated whether kprobes could benefit of being dissociated from the
modules mutex too.)

This also fixes module handling of tracepoint list iterators, because it
was expecting the list to be sorted by pointer address. Given we have
control on our own list now, it's OK to sort this list which has
tracepoints as its only purpose. The reason why this sorting is required
is to handle the fact that seq files (and any read() operation from
user-space) cannot hold the tracepoint mutex across multiple calls, so
list entries may vanish between calls. With sorting, the tracepoint
iterator becomes usable even if the list don't contain the exact item
pointed to by the iterator anymore.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Jason Baron <jbaron@redhat.com>
CC: Ingo Molnar <mingo@elte.hu>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20110810191839.GC8525@Krystal
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/module.h     |   12 ---
 include/linux/tracepoint.h |   25 +++----
 kernel/module.c            |   47 ------------
 kernel/tracepoint.c        |  169 ++++++++++++++++++++++++++++++++++++++------
 4 files changed, 158 insertions(+), 95 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index 1c30087..8639216 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -580,9 +580,6 @@ int unregister_module_notifier(struct notifier_block * nb);
 
 extern void print_modules(void);
 
-extern void module_update_tracepoints(void);
-extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);
-
 #else /* !CONFIG_MODULES... */
 #define EXPORT_SYMBOL(sym)
 #define EXPORT_SYMBOL_GPL(sym)
@@ -698,15 +695,6 @@ static inline int unregister_module_notifier(struct notifier_block * nb)
 static inline void print_modules(void)
 {
 }
-
-static inline void module_update_tracepoints(void)
-{
-}
-
-static inline int module_get_iter_tracepoints(struct tracepoint_iter *iter)
-{
-	return 0;
-}
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d530a44..df0a779 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -54,8 +54,18 @@ extern int tracepoint_probe_unregister_noupdate(const char *name, void *probe,
 						void *data);
 extern void tracepoint_probe_update_all(void);
 
+#ifdef CONFIG_MODULES
+struct tp_module {
+	struct list_head list;
+	unsigned int num_tracepoints;
+	struct tracepoint * const *tracepoints_ptrs;
+};
+#endif /* CONFIG_MODULES */
+
 struct tracepoint_iter {
-	struct module *module;
+#ifdef CONFIG_MODULES
+	struct tp_module *module;
+#endif /* CONFIG_MODULES */
 	struct tracepoint * const *tracepoint;
 };
 
@@ -63,8 +73,6 @@ extern void tracepoint_iter_start(struct tracepoint_iter *iter);
 extern void tracepoint_iter_next(struct tracepoint_iter *iter);
 extern void tracepoint_iter_stop(struct tracepoint_iter *iter);
 extern void tracepoint_iter_reset(struct tracepoint_iter *iter);
-extern int tracepoint_get_iter_range(struct tracepoint * const **tracepoint,
-	struct tracepoint * const *begin, struct tracepoint * const *end);
 
 /*
  * tracepoint_synchronize_unregister must be called between the last tracepoint
@@ -78,17 +86,6 @@ static inline void tracepoint_synchronize_unregister(void)
 
 #define PARAMS(args...) args
 
-#ifdef CONFIG_TRACEPOINTS
-extern
-void tracepoint_update_probe_range(struct tracepoint * const *begin,
-	struct tracepoint * const *end);
-#else
-static inline
-void tracepoint_update_probe_range(struct tracepoint * const *begin,
-	struct tracepoint * const *end)
-{ }
-#endif /* CONFIG_TRACEPOINTS */
-
 #endif /* _LINUX_TRACEPOINT_H */
 
 /*
diff --git a/kernel/module.c b/kernel/module.c
index 04379f92..93342d9 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3487,50 +3487,3 @@ void module_layout(struct module *mod,
 }
 EXPORT_SYMBOL(module_layout);
 #endif
-
-#ifdef CONFIG_TRACEPOINTS
-void module_update_tracepoints(void)
-{
-	struct module *mod;
-
-	mutex_lock(&module_mutex);
-	list_for_each_entry(mod, &modules, list)
-		if (!mod->taints)
-			tracepoint_update_probe_range(mod->tracepoints_ptrs,
-				mod->tracepoints_ptrs + mod->num_tracepoints);
-	mutex_unlock(&module_mutex);
-}
-
-/*
- * Returns 0 if current not found.
- * Returns 1 if current found.
- */
-int module_get_iter_tracepoints(struct tracepoint_iter *iter)
-{
-	struct module *iter_mod;
-	int found = 0;
-
-	mutex_lock(&module_mutex);
-	list_for_each_entry(iter_mod, &modules, list) {
-		if (!iter_mod->taints) {
-			/*
-			 * Sorted module list
-			 */
-			if (iter_mod < iter->module)
-				continue;
-			else if (iter_mod > iter->module)
-				iter->tracepoint = NULL;
-			found = tracepoint_get_iter_range(&iter->tracepoint,
-				iter_mod->tracepoints_ptrs,
-				iter_mod->tracepoints_ptrs
-					+ iter_mod->num_tracepoints);
-			if (found) {
-				iter->module = iter_mod;
-				break;
-			}
-		}
-	}
-	mutex_unlock(&module_mutex);
-	return found;
-}
-#endif
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index b219f14..db110b8 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -34,11 +34,16 @@ extern struct tracepoint * const __stop___tracepoints_ptrs[];
 static const int tracepoint_debug;
 
 /*
- * tracepoints_mutex nests inside module_mutex. Tracepoints mutex protects the
- * builtin and module tracepoints and the hash table.
+ * Tracepoints mutex protects the builtin and module tracepoints and the hash
+ * table, as well as the local module list.
  */
 static DEFINE_MUTEX(tracepoints_mutex);
 
+#ifdef CONFIG_MODULES
+/* Local list of struct module */
+static LIST_HEAD(tracepoint_module_list);
+#endif /* CONFIG_MODULES */
+
 /*
  * Tracepoint hash table, containing the active tracepoints.
  * Protected by tracepoints_mutex.
@@ -292,9 +297,10 @@ static void disable_tracepoint(struct tracepoint *elem)
  * @end: end of the range
  *
  * Updates the probe callback corresponding to a range of tracepoints.
+ * Called with tracepoints_mutex held.
  */
-void tracepoint_update_probe_range(struct tracepoint * const *begin,
-				   struct tracepoint * const *end)
+static void tracepoint_update_probe_range(struct tracepoint * const *begin,
+					  struct tracepoint * const *end)
 {
 	struct tracepoint * const *iter;
 	struct tracepoint_entry *mark_entry;
@@ -302,7 +308,6 @@ void tracepoint_update_probe_range(struct tracepoint * const *begin,
 	if (!begin)
 		return;
 
-	mutex_lock(&tracepoints_mutex);
 	for (iter = begin; iter < end; iter++) {
 		mark_entry = get_tracepoint((*iter)->name);
 		if (mark_entry) {
@@ -312,11 +317,27 @@ void tracepoint_update_probe_range(struct tracepoint * const *begin,
 			disable_tracepoint(*iter);
 		}
 	}
-	mutex_unlock(&tracepoints_mutex);
 }
 
+#ifdef CONFIG_MODULES
+void module_update_tracepoints(void)
+{
+	struct tp_module *tp_mod;
+
+	list_for_each_entry(tp_mod, &tracepoint_module_list, list)
+		tracepoint_update_probe_range(tp_mod->tracepoints_ptrs,
+			tp_mod->tracepoints_ptrs + tp_mod->num_tracepoints);
+}
+#else /* CONFIG_MODULES */
+void module_update_tracepoints(void)
+{
+}
+#endif /* CONFIG_MODULES */
+
+
 /*
  * Update probes, removing the faulty probes.
+ * Called with tracepoints_mutex held.
  */
 static void tracepoint_update_probes(void)
 {
@@ -359,11 +380,12 @@ int tracepoint_probe_register(const char *name, void *probe, void *data)
 
 	mutex_lock(&tracepoints_mutex);
 	old = tracepoint_add_probe(name, probe, data);
-	mutex_unlock(&tracepoints_mutex);
-	if (IS_ERR(old))
+	if (IS_ERR(old)) {
+		mutex_unlock(&tracepoints_mutex);
 		return PTR_ERR(old);
-
+	}
 	tracepoint_update_probes();		/* may update entry */
+	mutex_unlock(&tracepoints_mutex);
 	release_probes(old);
 	return 0;
 }
@@ -402,11 +424,12 @@ int tracepoint_probe_unregister(const char *name, void *probe, void *data)
 
 	mutex_lock(&tracepoints_mutex);
 	old = tracepoint_remove_probe(name, probe, data);
-	mutex_unlock(&tracepoints_mutex);
-	if (IS_ERR(old))
+	if (IS_ERR(old)) {
+		mutex_unlock(&tracepoints_mutex);
 		return PTR_ERR(old);
-
+	}
 	tracepoint_update_probes();		/* may update entry */
+	mutex_unlock(&tracepoints_mutex);
 	release_probes(old);
 	return 0;
 }
@@ -489,9 +512,8 @@ void tracepoint_probe_update_all(void)
 	if (!list_empty(&old_probes))
 		list_replace_init(&old_probes, &release_probes);
 	need_update = 0;
-	mutex_unlock(&tracepoints_mutex);
-
 	tracepoint_update_probes();
+	mutex_unlock(&tracepoints_mutex);
 	list_for_each_entry_safe(pos, next, &release_probes, u.list) {
 		list_del(&pos->u.list);
 		call_rcu_sched(&pos->u.rcu, rcu_free_old_probes);
@@ -509,7 +531,7 @@ EXPORT_SYMBOL_GPL(tracepoint_probe_update_all);
  * Will return the first tracepoint in the range if the input tracepoint is
  * NULL.
  */
-int tracepoint_get_iter_range(struct tracepoint * const **tracepoint,
+static int tracepoint_get_iter_range(struct tracepoint * const **tracepoint,
 	struct tracepoint * const *begin, struct tracepoint * const *end)
 {
 	if (!*tracepoint && begin != end) {
@@ -520,11 +542,12 @@ int tracepoint_get_iter_range(struct tracepoint * const **tracepoint,
 		return 1;
 	return 0;
 }
-EXPORT_SYMBOL_GPL(tracepoint_get_iter_range);
 
+#ifdef CONFIG_MODULES
 static void tracepoint_get_iter(struct tracepoint_iter *iter)
 {
 	int found = 0;
+	struct tp_module *iter_mod;
 
 	/* Core kernel tracepoints */
 	if (!iter->module) {
@@ -534,12 +557,43 @@ static void tracepoint_get_iter(struct tracepoint_iter *iter)
 		if (found)
 			goto end;
 	}
-	/* tracepoints in modules. */
-	found = module_get_iter_tracepoints(iter);
+	/* Tracepoints in modules */
+	mutex_lock(&tracepoints_mutex);
+	list_for_each_entry(iter_mod, &tracepoint_module_list, list) {
+		/*
+		 * Sorted module list
+		 */
+		if (iter_mod < iter->module)
+			continue;
+		else if (iter_mod > iter->module)
+			iter->tracepoint = NULL;
+		found = tracepoint_get_iter_range(&iter->tracepoint,
+			iter_mod->tracepoints_ptrs,
+			iter_mod->tracepoints_ptrs
+				+ iter_mod->num_tracepoints);
+		if (found) {
+			iter->module = iter_mod;
+			break;
+		}
+	}
+	mutex_unlock(&tracepoints_mutex);
 end:
 	if (!found)
 		tracepoint_iter_reset(iter);
 }
+#else /* CONFIG_MODULES */
+static void tracepoint_get_iter(struct tracepoint_iter *iter)
+{
+	int found = 0;
+
+	/* Core kernel tracepoints */
+	found = tracepoint_get_iter_range(&iter->tracepoint,
+			__start___tracepoints_ptrs,
+			__stop___tracepoints_ptrs);
+	if (!found)
+		tracepoint_iter_reset(iter);
+}
+#endif /* CONFIG_MODULES */
 
 void tracepoint_iter_start(struct tracepoint_iter *iter)
 {
@@ -566,26 +620,98 @@ EXPORT_SYMBOL_GPL(tracepoint_iter_stop);
 
 void tracepoint_iter_reset(struct tracepoint_iter *iter)
 {
+#ifdef CONFIG_MODULES
 	iter->module = NULL;
+#endif /* CONFIG_MODULES */
 	iter->tracepoint = NULL;
 }
 EXPORT_SYMBOL_GPL(tracepoint_iter_reset);
 
 #ifdef CONFIG_MODULES
+static int tracepoint_module_coming(struct module *mod)
+{
+	struct tp_module *tp_mod, *iter;
+	int ret = 0;
+
+	/*
+	 * We skip modules that tain the kernel, especially those with different
+	 * module header (for forced load), to make sure we don't cause a crash.
+	 */
+	if (mod->taints)
+		return 0;
+	mutex_lock(&tracepoints_mutex);
+	tp_mod = kmalloc(sizeof(struct tp_module), GFP_KERNEL);
+	if (!tp_mod) {
+		ret = -ENOMEM;
+		goto end;
+	}
+	tp_mod->num_tracepoints = mod->num_tracepoints;
+	tp_mod->tracepoints_ptrs = mod->tracepoints_ptrs;
+
+	/*
+	 * tracepoint_module_list is kept sorted by struct module pointer
+	 * address for iteration on tracepoints from a seq_file that can release
+	 * the mutex between calls.
+	 */
+	list_for_each_entry_reverse(iter, &tracepoint_module_list, list) {
+		BUG_ON(iter == tp_mod);	/* Should never be in the list twice */
+		if (iter < tp_mod) {
+			/* We belong to the location right after iter. */
+			list_add(&tp_mod->list, &iter->list);
+			goto module_added;
+		}
+	}
+	/* We belong to the beginning of the list */
+	list_add(&tp_mod->list, &tracepoint_module_list);
+module_added:
+	tracepoint_update_probe_range(mod->tracepoints_ptrs,
+		mod->tracepoints_ptrs + mod->num_tracepoints);
+end:
+	mutex_unlock(&tracepoints_mutex);
+	return ret;
+}
+
+static int tracepoint_module_going(struct module *mod)
+{
+	struct tp_module *pos;
+
+	mutex_lock(&tracepoints_mutex);
+	tracepoint_update_probe_range(mod->tracepoints_ptrs,
+		mod->tracepoints_ptrs + mod->num_tracepoints);
+	list_for_each_entry(pos, &tracepoint_module_list, list) {
+		if (pos->tracepoints_ptrs == mod->tracepoints_ptrs) {
+			list_del(&pos->list);
+			kfree(pos);
+			break;
+		}
+	}
+	/*
+	 * In the case of modules that were tainted at "coming", we'll simply
+	 * walk through the list without finding it. We cannot use the "tainted"
+	 * flag on "going", in case a module taints the kernel only after being
+	 * loaded.
+	 */
+	mutex_unlock(&tracepoints_mutex);
+	return 0;
+}
 
 int tracepoint_module_notify(struct notifier_block *self,
 			     unsigned long val, void *data)
 {
 	struct module *mod = data;
+	int ret = 0;
 
 	switch (val) {
 	case MODULE_STATE_COMING:
+		ret = tracepoint_module_coming(mod);
+		break;
+	case MODULE_STATE_LIVE:
+		break;
 	case MODULE_STATE_GOING:
-		tracepoint_update_probe_range(mod->tracepoints_ptrs,
-			mod->tracepoints_ptrs + mod->num_tracepoints);
+		ret = tracepoint_module_going(mod);
 		break;
 	}
-	return 0;
+	return ret;
 }
 
 struct notifier_block tracepoint_module_nb = {
@@ -598,7 +724,6 @@ static int init_tracepoints(void)
 	return register_module_notifier(&tracepoint_module_nb);
 }
 __initcall(init_tracepoints);
-
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_HAVE_SYSCALL_TRACEPOINTS
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 03/21] x86: jump_label: arch_jump_label_text_poke_early: add missing __init
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
  2011-09-22 22:09 ` [PATCH 01/21] tracing: Clean up tb_fmt to not give faulty compile warning Steven Rostedt
  2011-09-22 22:09 ` [PATCH 02/21] Tracepoint: Dissociate from module mutex Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 04/21] tracing/filter: Use static allocation for filter predicates Steven Rostedt
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Kevin Winchester, jbaron, mingo, hpa, x86

[-- Attachment #1: 0003-x86-jump_label-arch_jump_label_text_poke_early-add-m.patch --]
[-- Type: text/plain, Size: 1137 bytes --]

From: Kevin Winchester <kjwinchester@gmail.com>

arch_jump_label_text_poke_early calls text_poke_early, which is
an __init function.  Thus arch_jump_label_text_poke_early should
be the same.

Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Cc: jbaron@redhat.com
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1313539478-30303-1-git-send-email-kjwinchester@gmail.com

[ Use __init_or_module instead of __init ]

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 arch/x86/kernel/jump_label.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/jump_label.c b/arch/x86/kernel/jump_label.c
index 3fee346..cacdd46 100644
--- a/arch/x86/kernel/jump_label.c
+++ b/arch/x86/kernel/jump_label.c
@@ -42,7 +42,7 @@ void arch_jump_label_transform(struct jump_entry *entry,
 	put_online_cpus();
 }
 
-void arch_jump_label_text_poke_early(jump_label_t addr)
+void __init_or_module arch_jump_label_text_poke_early(jump_label_t addr)
 {
 	text_poke_early((void *)addr, ideal_nops[NOP_ATOMIC5],
 			JUMP_LABEL_NOP_SIZE);
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 04/21] tracing/filter: Use static allocation for filter predicates
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (2 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 03/21] x86: jump_label: arch_jump_label_text_poke_early: add missing __init Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 05/21] tracing/filter: Separate predicate init and filter addition Steven Rostedt
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0004-tracing-filter-Use-static-allocation-for-filter-pred.patch --]
[-- Type: text/plain, Size: 2900 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Don't dynamically allocate filter_pred struct, use static memory.
This way we can get rid of the code managing the dynamic filter_pred
struct object.

The create_pred function integrates create_logical_pred function.
This way the static predicate memory is returned only from
one place.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-2-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |   57 ++++++++++--------------------------
 1 files changed, 16 insertions(+), 41 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 256764e..cb295a1 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -630,11 +630,7 @@ find_event_field(struct ftrace_event_call *call, char *name)
 
 static void filter_free_pred(struct filter_pred *pred)
 {
-	if (!pred)
-		return;
-
 	kfree(pred->field_name);
-	kfree(pred);
 }
 
 static void filter_clear_pred(struct filter_pred *pred)
@@ -1302,39 +1298,30 @@ parse_operand:
 	return 0;
 }
 
-static struct filter_pred *create_pred(int op, char *operand1, char *operand2)
+static struct filter_pred *create_pred(struct filter_parse_state *ps,
+				       int op, char *operand1, char *operand2)
 {
-	struct filter_pred *pred;
+	static struct filter_pred pred;
 
-	pred = kzalloc(sizeof(*pred), GFP_KERNEL);
-	if (!pred)
-		return NULL;
+	memset(&pred, 0, sizeof(pred));
+	pred.op = op;
 
-	pred->field_name = kstrdup(operand1, GFP_KERNEL);
-	if (!pred->field_name) {
-		kfree(pred);
+	if (op == OP_AND || op == OP_OR)
+		return &pred;
+
+	if (!operand1 || !operand2) {
+		parse_error(ps, FILT_ERR_MISSING_FIELD, 0);
 		return NULL;
 	}
 
-	strcpy(pred->regex.pattern, operand2);
-	pred->regex.len = strlen(pred->regex.pattern);
-
-	pred->op = op;
-
-	return pred;
-}
-
-static struct filter_pred *create_logical_pred(int op)
-{
-	struct filter_pred *pred;
-
-	pred = kzalloc(sizeof(*pred), GFP_KERNEL);
-	if (!pred)
+	pred.field_name = kstrdup(operand1, GFP_KERNEL);
+	if (!pred.field_name)
 		return NULL;
 
-	pred->op = op;
+	strcpy(pred.regex.pattern, operand2);
+	pred.regex.len = strlen(pred.regex.pattern);
 
-	return pred;
+	return &pred;
 }
 
 static int check_preds(struct filter_parse_state *ps)
@@ -1643,19 +1630,7 @@ static int replace_preds(struct ftrace_event_call *call,
 			goto fail;
 		}
 
-		if (elt->op == OP_AND || elt->op == OP_OR) {
-			pred = create_logical_pred(elt->op);
-			goto add_pred;
-		}
-
-		if (!operand1 || !operand2) {
-			parse_error(ps, FILT_ERR_MISSING_FIELD, 0);
-			err = -EINVAL;
-			goto fail;
-		}
-
-		pred = create_pred(elt->op, operand1, operand2);
-add_pred:
+		pred = create_pred(ps, elt->op, operand1, operand2);
 		if (!pred) {
 			err = -ENOMEM;
 			goto fail;
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 05/21] tracing/filter: Separate predicate init and filter addition
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (3 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 04/21] tracing/filter: Use static allocation for filter predicates Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 06/21] tracing/filter: Remove field_name from filter_pred struct Steven Rostedt
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0005-tracing-filter-Separate-predicate-init-and-filter-ad.patch --]
[-- Type: text/plain, Size: 4477 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Making the code cleaner by having one function to fully prepare
the predicate (create_pred), and another to add the predicate to
the filter (filter_add_pred).

As a benefit, this way the dry_run flag stays only inside the
replace_preds function and is not passed deeper.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-3-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |   56 +++++++++++++++---------------------
 1 files changed, 23 insertions(+), 33 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index cb295a1..61c8dec 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -685,8 +685,7 @@ __pop_pred_stack(struct pred_stack *stack)
 static int filter_set_pred(struct event_filter *filter,
 			   int idx,
 			   struct pred_stack *stack,
-			   struct filter_pred *src,
-			   filter_pred_fn_t fn)
+			   struct filter_pred *src)
 {
 	struct filter_pred *dest = &filter->preds[idx];
 	struct filter_pred *left;
@@ -698,7 +697,6 @@ static int filter_set_pred(struct event_filter *filter,
 		if (!dest->field_name)
 			return -ENOMEM;
 	}
-	dest->fn = fn;
 	dest->index = idx;
 
 	if (dest->op == OP_OR || dest->op == OP_AND) {
@@ -836,12 +834,10 @@ static void filter_free_subsystem_filters(struct event_subsystem *system)
 	}
 }
 
-static int filter_add_pred_fn(struct filter_parse_state *ps,
-			      struct ftrace_event_call *call,
-			      struct event_filter *filter,
-			      struct filter_pred *pred,
-			      struct pred_stack *stack,
-			      filter_pred_fn_t fn)
+static int filter_add_pred(struct filter_parse_state *ps,
+			   struct event_filter *filter,
+			   struct filter_pred *pred,
+			   struct pred_stack *stack)
 {
 	int idx, err;
 
@@ -852,7 +848,7 @@ static int filter_add_pred_fn(struct filter_parse_state *ps,
 
 	idx = filter->n_preds;
 	filter_clear_pred(&filter->preds[idx]);
-	err = filter_set_pred(filter, idx, stack, pred, fn);
+	err = filter_set_pred(filter, idx, stack, pred);
 	if (err)
 		return err;
 
@@ -933,25 +929,16 @@ static filter_pred_fn_t select_comparison_fn(int op, int field_size,
 	return fn;
 }
 
-static int filter_add_pred(struct filter_parse_state *ps,
-			   struct ftrace_event_call *call,
-			   struct event_filter *filter,
-			   struct filter_pred *pred,
-			   struct pred_stack *stack,
-			   bool dry_run)
+static int init_pred(struct filter_parse_state *ps,
+		     struct ftrace_event_call *call,
+		     struct filter_pred *pred)
+
 {
 	struct ftrace_event_field *field;
-	filter_pred_fn_t fn;
+	filter_pred_fn_t fn = filter_pred_none;
 	unsigned long long val;
 	int ret;
 
-	fn = pred->fn = filter_pred_none;
-
-	if (pred->op == OP_AND)
-		goto add_pred_fn;
-	else if (pred->op == OP_OR)
-		goto add_pred_fn;
-
 	field = find_event_field(call, pred->field_name);
 	if (!field) {
 		parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0);
@@ -997,9 +984,7 @@ static int filter_add_pred(struct filter_parse_state *ps,
 	if (pred->op == OP_NE)
 		pred->not = 1;
 
-add_pred_fn:
-	if (!dry_run)
-		return filter_add_pred_fn(ps, call, filter, pred, stack, fn);
+	pred->fn = fn;
 	return 0;
 }
 
@@ -1299,6 +1284,7 @@ parse_operand:
 }
 
 static struct filter_pred *create_pred(struct filter_parse_state *ps,
+				       struct ftrace_event_call *call,
 				       int op, char *operand1, char *operand2)
 {
 	static struct filter_pred pred;
@@ -1321,7 +1307,7 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
 	strcpy(pred.regex.pattern, operand2);
 	pred.regex.len = strlen(pred.regex.pattern);
 
-	return &pred;
+	return init_pred(ps, call, &pred) ? NULL : &pred;
 }
 
 static int check_preds(struct filter_parse_state *ps)
@@ -1630,16 +1616,20 @@ static int replace_preds(struct ftrace_event_call *call,
 			goto fail;
 		}
 
-		pred = create_pred(ps, elt->op, operand1, operand2);
+		pred = create_pred(ps, call, elt->op, operand1, operand2);
 		if (!pred) {
 			err = -ENOMEM;
 			goto fail;
 		}
-		err = filter_add_pred(ps, call, filter, pred, &stack, dry_run);
-		filter_free_pred(pred);
-		if (err)
-			goto fail;
+		if (!dry_run) {
+			err = filter_add_pred(ps, filter, pred, &stack);
+			if (err) {
+				filter_free_pred(pred);
+				goto fail;
+			}
+		}
 
+		filter_free_pred(pred);
 		operand1 = operand2 = NULL;
 	}
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 06/21] tracing/filter: Remove field_name from filter_pred struct
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (4 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 05/21] tracing/filter: Separate predicate init and filter addition Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 07/21] tracing/filter: Simplify tracepoint event lookup Steven Rostedt
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0006-tracing-filter-Remove-field_name-from-filter_pred-st.patch --]
[-- Type: text/plain, Size: 4890 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

The field_name was used just for finding event's fields. This way we
don't need to care about field_name allocation/free.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-4-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.h               |   11 +-------
 kernel/trace/trace_events_filter.c |   53 ++++++++---------------------------
 2 files changed, 13 insertions(+), 51 deletions(-)

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 616846b..2eb3cf6 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -761,16 +761,7 @@ struct filter_pred {
 	filter_pred_fn_t 	fn;
 	u64 			val;
 	struct regex		regex;
-	/*
-	 * Leaf nodes use field_name, ops is used by AND and OR
-	 * nodes. The field_name is always freed when freeing a pred.
-	 * We can overload field_name for ops and have it freed
-	 * as well.
-	 */
-	union {
-		char		*field_name;
-		unsigned short	*ops;
-	};
+	unsigned short		*ops;
 	int 			offset;
 	int 			not;
 	int 			op;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 61c8dec..97b93f3 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -628,18 +628,6 @@ find_event_field(struct ftrace_event_call *call, char *name)
 	return __find_event_field(head, name);
 }
 
-static void filter_free_pred(struct filter_pred *pred)
-{
-	kfree(pred->field_name);
-}
-
-static void filter_clear_pred(struct filter_pred *pred)
-{
-	kfree(pred->field_name);
-	pred->field_name = NULL;
-	pred->regex.len = 0;
-}
-
 static int __alloc_pred_stack(struct pred_stack *stack, int n_preds)
 {
 	stack->preds = kzalloc(sizeof(*stack->preds)*(n_preds + 1), GFP_KERNEL);
@@ -692,11 +680,6 @@ static int filter_set_pred(struct event_filter *filter,
 	struct filter_pred *right;
 
 	*dest = *src;
-	if (src->field_name) {
-		dest->field_name = kstrdup(src->field_name, GFP_KERNEL);
-		if (!dest->field_name)
-			return -ENOMEM;
-	}
 	dest->index = idx;
 
 	if (dest->op == OP_OR || dest->op == OP_AND) {
@@ -737,11 +720,7 @@ static int filter_set_pred(struct event_filter *filter,
 
 static void __free_preds(struct event_filter *filter)
 {
-	int i;
-
 	if (filter->preds) {
-		for (i = 0; i < filter->a_preds; i++)
-			kfree(filter->preds[i].field_name);
 		kfree(filter->preds);
 		filter->preds = NULL;
 	}
@@ -839,16 +818,14 @@ static int filter_add_pred(struct filter_parse_state *ps,
 			   struct filter_pred *pred,
 			   struct pred_stack *stack)
 {
-	int idx, err;
+	int err;
 
 	if (WARN_ON(filter->n_preds == filter->a_preds)) {
 		parse_error(ps, FILT_ERR_TOO_MANY_PREDS, 0);
 		return -ENOSPC;
 	}
 
-	idx = filter->n_preds;
-	filter_clear_pred(&filter->preds[idx]);
-	err = filter_set_pred(filter, idx, stack, pred);
+	err = filter_set_pred(filter, filter->n_preds, stack, pred);
 	if (err)
 		return err;
 
@@ -930,21 +907,14 @@ static filter_pred_fn_t select_comparison_fn(int op, int field_size,
 }
 
 static int init_pred(struct filter_parse_state *ps,
-		     struct ftrace_event_call *call,
+		     struct ftrace_event_field *field,
 		     struct filter_pred *pred)
 
 {
-	struct ftrace_event_field *field;
 	filter_pred_fn_t fn = filter_pred_none;
 	unsigned long long val;
 	int ret;
 
-	field = find_event_field(call, pred->field_name);
-	if (!field) {
-		parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0);
-		return -EINVAL;
-	}
-
 	pred->offset = field->offset;
 
 	if (!is_legal_op(field, pred->op)) {
@@ -1287,6 +1257,7 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
 				       struct ftrace_event_call *call,
 				       int op, char *operand1, char *operand2)
 {
+	struct ftrace_event_field *field;
 	static struct filter_pred pred;
 
 	memset(&pred, 0, sizeof(pred));
@@ -1300,14 +1271,16 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
 		return NULL;
 	}
 
-	pred.field_name = kstrdup(operand1, GFP_KERNEL);
-	if (!pred.field_name)
+	field = find_event_field(call, operand1);
+	if (!field) {
+		parse_error(ps, FILT_ERR_FIELD_NOT_FOUND, 0);
 		return NULL;
+	}
 
 	strcpy(pred.regex.pattern, operand2);
 	pred.regex.len = strlen(pred.regex.pattern);
 
-	return init_pred(ps, call, &pred) ? NULL : &pred;
+	return init_pred(ps, field, &pred) ? NULL : &pred;
 }
 
 static int check_preds(struct filter_parse_state *ps)
@@ -1618,18 +1591,16 @@ static int replace_preds(struct ftrace_event_call *call,
 
 		pred = create_pred(ps, call, elt->op, operand1, operand2);
 		if (!pred) {
-			err = -ENOMEM;
+			err = -EINVAL;
 			goto fail;
 		}
+
 		if (!dry_run) {
 			err = filter_add_pred(ps, filter, pred, &stack);
-			if (err) {
-				filter_free_pred(pred);
+			if (err)
 				goto fail;
-			}
 		}
 
-		filter_free_pred(pred);
 		operand1 = operand2 = NULL;
 	}
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 07/21] tracing/filter: Simplify tracepoint event lookup
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (5 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 06/21] tracing/filter: Remove field_name from filter_pred struct Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 08/21] tracing/filter: Unify predicate tree walking, change check_pred_tree Steven Rostedt
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0007-tracing-filter-Simplify-tracepoint-event-lookup.patch --]
[-- Type: text/plain, Size: 1278 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

We dont need to perform lookup through the ftrace_events list,
instead we can use the 'tp_event' field.

Each perf_event contains tracepoint event field 'tp_event', which
got initialized during the tracepoint event initialization.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-5-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 97b93f3..0948905 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1894,17 +1894,14 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
 	int err;
 	struct event_filter *filter;
 	struct filter_parse_state *ps;
-	struct ftrace_event_call *call = NULL;
+	struct ftrace_event_call *call;
 
 	mutex_lock(&event_mutex);
 
-	list_for_each_entry(call, &ftrace_events, list) {
-		if (call->event.type == event_id)
-			break;
-	}
+	call = event->tp_event;
 
 	err = -EINVAL;
-	if (&call->list == &ftrace_events)
+	if (!call)
 		goto out_unlock;
 
 	err = -EEXIST;
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 08/21] tracing/filter: Unify predicate tree walking, change check_pred_tree
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (6 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 07/21] tracing/filter: Simplify tracepoint event lookup Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 09/21] tracing/filter: Change count_leafs function to use walk_pred_tree Steven Rostedt
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0008-tracing-filter-Unify-predicate-tree-walking-change-c.patch function to use it --]
[-- Type: text/plain, Size: 4623 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Adding walk_pred_tree function to be used for walking throught
the filter predicates.

For each predicate the callback function is called, allowing
users to add their own functionality or customize their way
through the filter predicates.

Changing check_pred_tree function to use walk_pred_tree.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-6-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |  137 ++++++++++++++++++++++-------------
 1 files changed, 86 insertions(+), 51 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 0948905..5b889d4 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -381,6 +381,63 @@ get_pred_parent(struct filter_pred *pred, struct filter_pred *preds,
 	return pred;
 }
 
+enum walk_return {
+	WALK_PRED_ABORT,
+	WALK_PRED_PARENT,
+	WALK_PRED_DEFAULT,
+};
+
+typedef int (*filter_pred_walkcb_t) (enum move_type move,
+				     struct filter_pred *pred,
+				     int *err, void *data);
+
+static int walk_pred_tree(struct filter_pred *preds,
+			  struct filter_pred *root,
+			  filter_pred_walkcb_t cb, void *data)
+{
+	struct filter_pred *pred = root;
+	enum move_type move = MOVE_DOWN;
+	int done = 0;
+
+	if  (!preds)
+		return -EINVAL;
+
+	do {
+		int err = 0, ret;
+
+		ret = cb(move, pred, &err, data);
+		if (ret == WALK_PRED_ABORT)
+			return err;
+		if (ret == WALK_PRED_PARENT)
+			goto get_parent;
+
+		switch (move) {
+		case MOVE_DOWN:
+			if (pred->left != FILTER_PRED_INVALID) {
+				pred = &preds[pred->left];
+				continue;
+			}
+			goto get_parent;
+		case MOVE_UP_FROM_LEFT:
+			pred = &preds[pred->right];
+			move = MOVE_DOWN;
+			continue;
+		case MOVE_UP_FROM_RIGHT:
+ get_parent:
+			if (pred == root)
+				break;
+			pred = get_pred_parent(pred, preds,
+					       pred->parent,
+					       &move);
+			continue;
+		}
+		done = 1;
+	} while (!done);
+
+	/* We are fine. */
+	return 0;
+}
+
 /*
  * A series of AND or ORs where found together. Instead of
  * climbing up and down the tree branches, an array of the
@@ -1321,6 +1378,23 @@ static int count_preds(struct filter_parse_state *ps)
 	return n_preds;
 }
 
+struct check_pred_data {
+	int count;
+	int max;
+};
+
+static int check_pred_tree_cb(enum move_type move, struct filter_pred *pred,
+			      int *err, void *data)
+{
+	struct check_pred_data *d = data;
+
+	if (WARN_ON(d->count++ > d->max)) {
+		*err = -EINVAL;
+		return WALK_PRED_ABORT;
+	}
+	return WALK_PRED_DEFAULT;
+}
+
 /*
  * The tree is walked at filtering of an event. If the tree is not correctly
  * built, it may cause an infinite loop. Check here that the tree does
@@ -1329,58 +1403,19 @@ static int count_preds(struct filter_parse_state *ps)
 static int check_pred_tree(struct event_filter *filter,
 			   struct filter_pred *root)
 {
-	struct filter_pred *preds;
-	struct filter_pred *pred;
-	enum move_type move = MOVE_DOWN;
-	int count = 0;
-	int done = 0;
-	int max;
-
-	/*
-	 * The max that we can hit a node is three times.
-	 * Once going down, once coming up from left, and
-	 * once coming up from right. This is more than enough
-	 * since leafs are only hit a single time.
-	 */
-	max = 3 * filter->n_preds;
-
-	preds = filter->preds;
-	if  (!preds)
-		return -EINVAL;
-	pred = root;
-
-	do {
-		if (WARN_ON(count++ > max))
-			return -EINVAL;
-
-		switch (move) {
-		case MOVE_DOWN:
-			if (pred->left != FILTER_PRED_INVALID) {
-				pred = &preds[pred->left];
-				continue;
-			}
-			/* A leaf at the root is just a leaf in the tree */
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		case MOVE_UP_FROM_LEFT:
-			pred = &preds[pred->right];
-			move = MOVE_DOWN;
-			continue;
-		case MOVE_UP_FROM_RIGHT:
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		}
-		done = 1;
-	} while (!done);
+	struct check_pred_data data = {
+		/*
+		 * The max that we can hit a node is three times.
+		 * Once going down, once coming up from left, and
+		 * once coming up from right. This is more than enough
+		 * since leafs are only hit a single time.
+		 */
+		.max   = 3 * filter->n_preds,
+		.count = 0,
+	};
 
-	/* We are fine. */
-	return 0;
+	return walk_pred_tree(filter->preds, root,
+			      check_pred_tree_cb, &data);
 }
 
 static int count_leafs(struct filter_pred *preds, struct filter_pred *root)
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 09/21] tracing/filter: Change count_leafs function to use walk_pred_tree
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (7 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 08/21] tracing/filter: Unify predicate tree walking, change check_pred_tree Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 10/21] tracing/filter: Change fold_pred_tree " Steven Rostedt
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0009-tracing-filter-Change-count_leafs-function-to-use-wa.patch --]
[-- Type: text/plain, Size: 2018 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Changing count_leafs function to use unified predicates tree
processing.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-7-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |   47 ++++++++++-------------------------
 1 files changed, 14 insertions(+), 33 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 5b889d4..ebbb261 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1418,43 +1418,24 @@ static int check_pred_tree(struct event_filter *filter,
 			      check_pred_tree_cb, &data);
 }
 
-static int count_leafs(struct filter_pred *preds, struct filter_pred *root)
+static int count_leafs_cb(enum move_type move, struct filter_pred *pred,
+			  int *err, void *data)
 {
-	struct filter_pred *pred;
-	enum move_type move = MOVE_DOWN;
-	int count = 0;
-	int done = 0;
+	int *count = data;
 
-	pred = root;
+	if ((move == MOVE_DOWN) &&
+	    (pred->left == FILTER_PRED_INVALID))
+		(*count)++;
 
-	do {
-		switch (move) {
-		case MOVE_DOWN:
-			if (pred->left != FILTER_PRED_INVALID) {
-				pred = &preds[pred->left];
-				continue;
-			}
-			/* A leaf at the root is just a leaf in the tree */
-			if (pred == root)
-				return 1;
-			count++;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		case MOVE_UP_FROM_LEFT:
-			pred = &preds[pred->right];
-			move = MOVE_DOWN;
-			continue;
-		case MOVE_UP_FROM_RIGHT:
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		}
-		done = 1;
-	} while (!done);
+	return WALK_PRED_DEFAULT;
+}
+
+static int count_leafs(struct filter_pred *preds, struct filter_pred *root)
+{
+	int count = 0, ret;
 
+	ret = walk_pred_tree(preds, root, count_leafs_cb, &count);
+	WARN_ON(ret);
 	return count;
 }
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 10/21] tracing/filter: Change fold_pred_tree function to use walk_pred_tree
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (8 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 09/21] tracing/filter: Change count_leafs function to use walk_pred_tree Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 11/21] tracing/filter: Change fold_pred " Steven Rostedt
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0010-tracing-filter-Change-fold_pred_tree-function-to-use.patch --]
[-- Type: text/plain, Size: 2649 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Changing fold_pred_tree function to use unified predicates tree
processing.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-8-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |   65 +++++++++++-------------------------
 1 files changed, 20 insertions(+), 45 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index ebbb261..d8aa100 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1496,6 +1496,24 @@ static int fold_pred(struct filter_pred *preds, struct filter_pred *root)
 	return 0;
 }
 
+static int fold_pred_tree_cb(enum move_type move, struct filter_pred *pred,
+			     int *err, void *data)
+{
+	struct filter_pred *preds = data;
+
+	if (move != MOVE_DOWN)
+		return WALK_PRED_DEFAULT;
+	if (!(pred->index & FILTER_PRED_FOLD))
+		return WALK_PRED_DEFAULT;
+
+	*err = fold_pred(preds, pred);
+	if (*err)
+		return WALK_PRED_ABORT;
+
+	/* eveyrhing below is folded, continue with parent */
+	return WALK_PRED_PARENT;
+}
+
 /*
  * To optimize the processing of the ops, if we have several "ors" or
  * "ands" together, we can put them in an array and process them all
@@ -1504,51 +1522,8 @@ static int fold_pred(struct filter_pred *preds, struct filter_pred *root)
 static int fold_pred_tree(struct event_filter *filter,
 			   struct filter_pred *root)
 {
-	struct filter_pred *preds;
-	struct filter_pred *pred;
-	enum move_type move = MOVE_DOWN;
-	int done = 0;
-	int err;
-
-	preds = filter->preds;
-	if  (!preds)
-		return -EINVAL;
-	pred = root;
-
-	do {
-		switch (move) {
-		case MOVE_DOWN:
-			if (pred->index & FILTER_PRED_FOLD) {
-				err = fold_pred(preds, pred);
-				if (err)
-					return err;
-				/* Folded nodes are like leafs */
-			} else if (pred->left != FILTER_PRED_INVALID) {
-				pred = &preds[pred->left];
-				continue;
-			}
-
-			/* A leaf at the root is just a leaf in the tree */
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		case MOVE_UP_FROM_LEFT:
-			pred = &preds[pred->right];
-			move = MOVE_DOWN;
-			continue;
-		case MOVE_UP_FROM_RIGHT:
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		}
-		done = 1;
-	} while (!done);
-
-	return 0;
+	return walk_pred_tree(filter->preds, root, fold_pred_tree_cb,
+			      filter->preds);
 }
 
 static int replace_preds(struct ftrace_event_call *call,
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 11/21] tracing/filter: Change fold_pred function to use walk_pred_tree
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (9 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 10/21] tracing/filter: Change fold_pred_tree " Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 12/21] tracing/filter: Change filter_match_preds function to use Steven Rostedt
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0011-tracing-filter-Change-fold_pred-function-to-use-walk.patch --]
[-- Type: text/plain, Size: 2677 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Changing fold_pred_tree function to use unified predicates tree
processing.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-9-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |   68 +++++++++++++++++------------------
 1 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index d8aa100..f44e68b 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1439,13 +1439,40 @@ static int count_leafs(struct filter_pred *preds, struct filter_pred *root)
 	return count;
 }
 
+struct fold_pred_data {
+	struct filter_pred *root;
+	int count;
+	int children;
+};
+
+static int fold_pred_cb(enum move_type move, struct filter_pred *pred,
+			int *err, void *data)
+{
+	struct fold_pred_data *d = data;
+	struct filter_pred *root = d->root;
+
+	if (move != MOVE_DOWN)
+		return WALK_PRED_DEFAULT;
+	if (pred->left != FILTER_PRED_INVALID)
+		return WALK_PRED_DEFAULT;
+
+	if (WARN_ON(d->count == d->children)) {
+		*err = -EINVAL;
+		return WALK_PRED_ABORT;
+	}
+
+	pred->index &= ~FILTER_PRED_FOLD;
+	root->ops[d->count++] = pred->index;
+	return WALK_PRED_DEFAULT;
+}
+
 static int fold_pred(struct filter_pred *preds, struct filter_pred *root)
 {
-	struct filter_pred *pred;
-	enum move_type move = MOVE_DOWN;
-	int count = 0;
+	struct fold_pred_data data = {
+		.root  = root,
+		.count = 0,
+	};
 	int children;
-	int done = 0;
 
 	/* No need to keep the fold flag */
 	root->index &= ~FILTER_PRED_FOLD;
@@ -1463,37 +1490,8 @@ static int fold_pred(struct filter_pred *preds, struct filter_pred *root)
 		return -ENOMEM;
 
 	root->val = children;
-
-	pred = root;
-	do {
-		switch (move) {
-		case MOVE_DOWN:
-			if (pred->left != FILTER_PRED_INVALID) {
-				pred = &preds[pred->left];
-				continue;
-			}
-			if (WARN_ON(count == children))
-				return -EINVAL;
-			pred->index &= ~FILTER_PRED_FOLD;
-			root->ops[count++] = pred->index;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		case MOVE_UP_FROM_LEFT:
-			pred = &preds[pred->right];
-			move = MOVE_DOWN;
-			continue;
-		case MOVE_UP_FROM_RIGHT:
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		}
-		done = 1;
-	} while (!done);
-
-	return 0;
+	data.children = children;
+	return walk_pred_tree(preds, root, fold_pred_cb, &data);
 }
 
 static int fold_pred_tree_cb(enum move_type move, struct filter_pred *pred,
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 12/21] tracing/filter: Change filter_match_preds function to use
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (10 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 11/21] tracing/filter: Change fold_pred " Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 13/21] tracing/filter: Add startup tests for events filter Steven Rostedt
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0012-tracing-filter-Change-filter_match_preds-function-to.patch walk_pred_tree --]
[-- Type: text/plain, Size: 4481 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Changing filter_match_preds function to use unified predicates tree
processing.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-10-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |  124 +++++++++++++++++-------------------
 1 files changed, 58 insertions(+), 66 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index f44e68b..319c3ca 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -467,99 +467,91 @@ static int process_ops(struct filter_pred *preds,
 
 	for (i = 0; i < op->val; i++) {
 		pred = &preds[op->ops[i]];
-		match = pred->fn(pred, rec);
+		if (!WARN_ON_ONCE(!pred->fn))
+			match = pred->fn(pred, rec);
 		if (!!match == type)
 			return match;
 	}
 	return match;
 }
 
+struct filter_match_preds_data {
+	struct filter_pred *preds;
+	int match;
+	void *rec;
+};
+
+static int filter_match_preds_cb(enum move_type move, struct filter_pred *pred,
+				 int *err, void *data)
+{
+	struct filter_match_preds_data *d = data;
+
+	*err = 0;
+	switch (move) {
+	case MOVE_DOWN:
+		/* only AND and OR have children */
+		if (pred->left != FILTER_PRED_INVALID) {
+			/* If ops is set, then it was folded. */
+			if (!pred->ops)
+				return WALK_PRED_DEFAULT;
+			/* We can treat folded ops as a leaf node */
+			d->match = process_ops(d->preds, pred, d->rec);
+		} else {
+			if (!WARN_ON_ONCE(!pred->fn))
+				d->match = pred->fn(pred, d->rec);
+		}
+
+		return WALK_PRED_PARENT;
+	case MOVE_UP_FROM_LEFT:
+		/*
+		 * Check for short circuits.
+		 *
+		 * Optimization: !!match == (pred->op == OP_OR)
+		 *   is the same as:
+		 * if ((match && pred->op == OP_OR) ||
+		 *     (!match && pred->op == OP_AND))
+		 */
+		if (!!d->match == (pred->op == OP_OR))
+			return WALK_PRED_PARENT;
+		break;
+	case MOVE_UP_FROM_RIGHT:
+		break;
+	}
+
+	return WALK_PRED_DEFAULT;
+}
+
 /* return 1 if event matches, 0 otherwise (discard) */
 int filter_match_preds(struct event_filter *filter, void *rec)
 {
-	int match = -1;
-	enum move_type move = MOVE_DOWN;
 	struct filter_pred *preds;
-	struct filter_pred *pred;
 	struct filter_pred *root;
-	int n_preds;
-	int done = 0;
+	struct filter_match_preds_data data = {
+		/* match is currently meaningless */
+		.match = -1,
+		.rec   = rec,
+	};
+	int n_preds, ret;
 
 	/* no filter is considered a match */
 	if (!filter)
 		return 1;
 
 	n_preds = filter->n_preds;
-
 	if (!n_preds)
 		return 1;
 
 	/*
 	 * n_preds, root and filter->preds are protect with preemption disabled.
 	 */
-	preds = rcu_dereference_sched(filter->preds);
 	root = rcu_dereference_sched(filter->root);
 	if (!root)
 		return 1;
 
-	pred = root;
-
-	/* match is currently meaningless */
-	match = -1;
-
-	do {
-		switch (move) {
-		case MOVE_DOWN:
-			/* only AND and OR have children */
-			if (pred->left != FILTER_PRED_INVALID) {
-				/* If ops is set, then it was folded. */
-				if (!pred->ops) {
-					/* keep going to down the left side */
-					pred = &preds[pred->left];
-					continue;
-				}
-				/* We can treat folded ops as a leaf node */
-				match = process_ops(preds, pred, rec);
-			} else
-				match = pred->fn(pred, rec);
-			/* If this pred is the only pred */
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		case MOVE_UP_FROM_LEFT:
-			/*
-			 * Check for short circuits.
-			 *
-			 * Optimization: !!match == (pred->op == OP_OR)
-			 *   is the same as:
-			 * if ((match && pred->op == OP_OR) ||
-			 *     (!match && pred->op == OP_AND))
-			 */
-			if (!!match == (pred->op == OP_OR)) {
-				if (pred == root)
-					break;
-				pred = get_pred_parent(pred, preds,
-						       pred->parent, &move);
-				continue;
-			}
-			/* now go down the right side of the tree. */
-			pred = &preds[pred->right];
-			move = MOVE_DOWN;
-			continue;
-		case MOVE_UP_FROM_RIGHT:
-			/* We finished this equation. */
-			if (pred == root)
-				break;
-			pred = get_pred_parent(pred, preds,
-					       pred->parent, &move);
-			continue;
-		}
-		done = 1;
-	} while (!done);
-
-	return match;
+	data.preds = preds = rcu_dereference_sched(filter->preds);
+	ret = walk_pred_tree(preds, root, filter_match_preds_cb, &data);
+	WARN_ON(ret);
+	return data.match;
 }
 EXPORT_SYMBOL_GPL(filter_match_preds);
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 13/21] tracing/filter: Add startup tests for events filter
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (11 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 12/21] tracing/filter: Change filter_match_preds function to use Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 14/21] tracing: Add preempt disable for filter self test Steven Rostedt
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0013-tracing-filter-Add-startup-tests-for-events-filter.patch --]
[-- Type: text/plain, Size: 9282 bytes --]

From: Jiri Olsa <jolsa@redhat.com>

Adding automated tests running as late_initcall. Tests are
compiled in with CONFIG_FTRACE_STARTUP_TEST option.

Adding test event "ftrace_test_filter" used to simulate
filter processing during event occurance.

String filters are compiled and tested against several
test events with different values.

Also testing that evaluation of explicit predicates is ommited
due to the lazy filter evaluation.

Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Link: http://lkml.kernel.org/r/1313072754-4620-11-git-send-email-jolsa@redhat.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/Makefile                   |    2 +
 kernel/trace/trace.h                    |    3 +
 kernel/trace/trace_events_filter.c      |  209 +++++++++++++++++++++++++++++++
 kernel/trace/trace_events_filter_test.h |   50 ++++++++
 4 files changed, 264 insertions(+), 0 deletions(-)
 create mode 100644 kernel/trace/trace_events_filter_test.h

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..b384ed5 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -15,6 +15,8 @@ ifdef CONFIG_TRACING_BRANCHES
 KBUILD_CFLAGS += -DDISABLE_BRANCH_PROFILING
 endif
 
+CFLAGS_trace_events_filter.o := -I$(src)
+
 #
 # Make the trace clocks available generally: it's infrastructure
 # relied on by ptrace for example:
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 2eb3cf6..4c7540a 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -762,6 +762,9 @@ struct filter_pred {
 	u64 			val;
 	struct regex		regex;
 	unsigned short		*ops;
+#ifdef CONFIG_FTRACE_STARTUP_TEST
+	struct ftrace_event_field *field;
+#endif
 	int 			offset;
 	int 			not;
 	int 			op;
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 319c3ca..6a642e2 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -1329,6 +1329,9 @@ static struct filter_pred *create_pred(struct filter_parse_state *ps,
 	strcpy(pred.regex.pattern, operand2);
 	pred.regex.len = strlen(pred.regex.pattern);
 
+#ifdef CONFIG_FTRACE_STARTUP_TEST
+	pred.field = field;
+#endif
 	return init_pred(ps, field, &pred) ? NULL : &pred;
 }
 
@@ -1926,3 +1929,209 @@ out_unlock:
 
 #endif /* CONFIG_PERF_EVENTS */
 
+#ifdef CONFIG_FTRACE_STARTUP_TEST
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+#define CREATE_TRACE_POINTS
+#include "trace_events_filter_test.h"
+
+static int test_get_filter(char *filter_str, struct ftrace_event_call *call,
+			   struct event_filter **pfilter)
+{
+	struct event_filter *filter;
+	struct filter_parse_state *ps;
+	int err = -ENOMEM;
+
+	filter = __alloc_filter();
+	if (!filter)
+		goto out;
+
+	ps = kzalloc(sizeof(*ps), GFP_KERNEL);
+	if (!ps)
+		goto free_filter;
+
+	parse_init(ps, filter_ops, filter_str);
+	err = filter_parse(ps);
+	if (err)
+		goto free_ps;
+
+	err = replace_preds(call, filter, ps, filter_str, false);
+	if (!err)
+		*pfilter = filter;
+
+ free_ps:
+	filter_opstack_clear(ps);
+	postfix_clear(ps);
+	kfree(ps);
+
+ free_filter:
+	if (err)
+		__free_filter(filter);
+
+ out:
+	return err;
+}
+
+#define DATA_REC(m, va, vb, vc, vd, ve, vf, vg, vh, nvisit) \
+{ \
+	.filter = FILTER, \
+	.rec    = { .a = va, .b = vb, .c = vc, .d = vd, \
+		    .e = ve, .f = vf, .g = vg, .h = vh }, \
+	.match  = m, \
+	.not_visited = nvisit, \
+}
+#define YES 1
+#define NO  0
+
+static struct test_filter_data_t {
+	char *filter;
+	struct ftrace_raw_ftrace_test_filter rec;
+	int match;
+	char *not_visited;
+} test_filter_data[] = {
+#define FILTER "a == 1 && b == 1 && c == 1 && d == 1 && " \
+	       "e == 1 && f == 1 && g == 1 && h == 1"
+	DATA_REC(YES, 1, 1, 1, 1, 1, 1, 1, 1, ""),
+	DATA_REC(NO,  0, 1, 1, 1, 1, 1, 1, 1, "bcdefgh"),
+	DATA_REC(NO,  1, 1, 1, 1, 1, 1, 1, 0, ""),
+#undef FILTER
+#define FILTER "a == 1 || b == 1 || c == 1 || d == 1 || " \
+	       "e == 1 || f == 1 || g == 1 || h == 1"
+	DATA_REC(NO,  0, 0, 0, 0, 0, 0, 0, 0, ""),
+	DATA_REC(YES, 0, 0, 0, 0, 0, 0, 0, 1, ""),
+	DATA_REC(YES, 1, 0, 0, 0, 0, 0, 0, 0, "bcdefgh"),
+#undef FILTER
+#define FILTER "(a == 1 || b == 1) && (c == 1 || d == 1) && " \
+	       "(e == 1 || f == 1) && (g == 1 || h == 1)"
+	DATA_REC(NO,  0, 0, 1, 1, 1, 1, 1, 1, "dfh"),
+	DATA_REC(YES, 0, 1, 0, 1, 0, 1, 0, 1, ""),
+	DATA_REC(YES, 1, 0, 1, 0, 0, 1, 0, 1, "bd"),
+	DATA_REC(NO,  1, 0, 1, 0, 0, 1, 0, 0, "bd"),
+#undef FILTER
+#define FILTER "(a == 1 && b == 1) || (c == 1 && d == 1) || " \
+	       "(e == 1 && f == 1) || (g == 1 && h == 1)"
+	DATA_REC(YES, 1, 0, 1, 1, 1, 1, 1, 1, "efgh"),
+	DATA_REC(YES, 0, 0, 0, 0, 0, 0, 1, 1, ""),
+	DATA_REC(NO,  0, 0, 0, 0, 0, 0, 0, 1, ""),
+#undef FILTER
+#define FILTER "(a == 1 && b == 1) && (c == 1 && d == 1) && " \
+	       "(e == 1 && f == 1) || (g == 1 && h == 1)"
+	DATA_REC(YES, 1, 1, 1, 1, 1, 1, 0, 0, "gh"),
+	DATA_REC(NO,  0, 0, 0, 0, 0, 0, 0, 1, ""),
+	DATA_REC(YES, 1, 1, 1, 1, 1, 0, 1, 1, ""),
+#undef FILTER
+#define FILTER "((a == 1 || b == 1) || (c == 1 || d == 1) || " \
+	       "(e == 1 || f == 1)) && (g == 1 || h == 1)"
+	DATA_REC(YES, 1, 1, 1, 1, 1, 1, 0, 1, "bcdef"),
+	DATA_REC(NO,  0, 0, 0, 0, 0, 0, 0, 0, ""),
+	DATA_REC(YES, 1, 1, 1, 1, 1, 0, 1, 1, "h"),
+#undef FILTER
+#define FILTER "((((((((a == 1) && (b == 1)) || (c == 1)) && (d == 1)) || " \
+	       "(e == 1)) && (f == 1)) || (g == 1)) && (h == 1))"
+	DATA_REC(YES, 1, 1, 1, 1, 1, 1, 1, 1, "ceg"),
+	DATA_REC(NO,  0, 1, 0, 1, 0, 1, 0, 1, ""),
+	DATA_REC(NO,  1, 0, 1, 0, 1, 0, 1, 0, ""),
+#undef FILTER
+#define FILTER "((((((((a == 1) || (b == 1)) && (c == 1)) || (d == 1)) && " \
+	       "(e == 1)) || (f == 1)) && (g == 1)) || (h == 1))"
+	DATA_REC(YES, 1, 1, 1, 1, 1, 1, 1, 1, "bdfh"),
+	DATA_REC(YES, 0, 1, 0, 1, 0, 1, 0, 1, ""),
+	DATA_REC(YES, 1, 0, 1, 0, 1, 0, 1, 0, "bdfh"),
+};
+
+#undef DATA_REC
+#undef FILTER
+#undef YES
+#undef NO
+
+#define DATA_CNT (sizeof(test_filter_data)/sizeof(struct test_filter_data_t))
+
+static int test_pred_visited;
+
+static int test_pred_visited_fn(struct filter_pred *pred, void *event)
+{
+	struct ftrace_event_field *field = pred->field;
+
+	test_pred_visited = 1;
+	printk(KERN_INFO "\npred visited %s\n", field->name);
+	return 1;
+}
+
+static int test_walk_pred_cb(enum move_type move, struct filter_pred *pred,
+			     int *err, void *data)
+{
+	char *fields = data;
+
+	if ((move == MOVE_DOWN) &&
+	    (pred->left == FILTER_PRED_INVALID)) {
+		struct ftrace_event_field *field = pred->field;
+
+		if (!field) {
+			WARN(1, "all leafs should have field defined");
+			return WALK_PRED_DEFAULT;
+		}
+		if (!strchr(fields, *field->name))
+			return WALK_PRED_DEFAULT;
+
+		WARN_ON(!pred->fn);
+		pred->fn = test_pred_visited_fn;
+	}
+	return WALK_PRED_DEFAULT;
+}
+
+static __init int ftrace_test_event_filter(void)
+{
+	int i;
+
+	printk(KERN_INFO "Testing ftrace filter: ");
+
+	for (i = 0; i < DATA_CNT; i++) {
+		struct event_filter *filter = NULL;
+		struct test_filter_data_t *d = &test_filter_data[i];
+		int err;
+
+		err = test_get_filter(d->filter, &event_ftrace_test_filter,
+				      &filter);
+		if (err) {
+			printk(KERN_INFO
+			       "Failed to get filter for '%s', err %d\n",
+			       d->filter, err);
+			break;
+		}
+
+		if (*d->not_visited)
+			walk_pred_tree(filter->preds, filter->root,
+				       test_walk_pred_cb,
+				       d->not_visited);
+
+		test_pred_visited = 0;
+		err = filter_match_preds(filter, &d->rec);
+
+		__free_filter(filter);
+
+		if (test_pred_visited) {
+			printk(KERN_INFO
+			       "Failed, unwanted pred visited for filter %s\n",
+			       d->filter);
+			break;
+		}
+
+		if (err != d->match) {
+			printk(KERN_INFO
+			       "Failed to match filter '%s', expected %d\n",
+			       d->filter, d->match);
+			break;
+		}
+	}
+
+	if (i == DATA_CNT)
+		printk(KERN_CONT "OK\n");
+
+	return 0;
+}
+
+late_initcall(ftrace_test_event_filter);
+
+#endif /* CONFIG_FTRACE_STARTUP_TEST */
diff --git a/kernel/trace/trace_events_filter_test.h b/kernel/trace/trace_events_filter_test.h
new file mode 100644
index 0000000..bfd4dba
--- /dev/null
+++ b/kernel/trace/trace_events_filter_test.h
@@ -0,0 +1,50 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM test
+
+#if !defined(_TRACE_TEST_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TEST_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(ftrace_test_filter,
+
+	TP_PROTO(int a, int b, int c, int d, int e, int f, int g, int h),
+
+	TP_ARGS(a, b, c, d, e, f, g, h),
+
+	TP_STRUCT__entry(
+		__field(int, a)
+		__field(int, b)
+		__field(int, c)
+		__field(int, d)
+		__field(int, e)
+		__field(int, f)
+		__field(int, g)
+		__field(int, h)
+	),
+
+	TP_fast_assign(
+		__entry->a = a;
+		__entry->b = b;
+		__entry->c = c;
+		__entry->d = d;
+		__entry->e = e;
+		__entry->f = f;
+		__entry->g = g;
+		__entry->h = h;
+	),
+
+	TP_printk("a %d, b %d, c %d, d %d, e %d, f %d, g %d, h %d",
+		  __entry->a, __entry->b, __entry->c, __entry->d,
+		  __entry->e, __entry->f, __entry->g, __entry->h)
+);
+
+#endif /* _TRACE_TEST_H || TRACE_HEADER_MULTI_READ */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_events_filter_test
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 14/21] tracing: Add preempt disable for filter self test
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (12 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 13/21] tracing/filter: Add startup tests for events filter Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 15/21] trace: Add a new readonly entry to report total buffer size Steven Rostedt
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Jiri Olsa

[-- Attachment #1: 0014-tracing-Add-preempt-disable-for-filter-self-test.patch --]
[-- Type: text/plain, Size: 1223 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

The self testing for event filters does not really need preemption
disabled as there are no races at the time of testing, but the functions
it calls uses rcu_dereference_sched() which will complain if preemption
is not disabled.

Cc: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_events_filter.c |    6 ++++++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 6a642e2..816d3d0 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -2101,6 +2101,11 @@ static __init int ftrace_test_event_filter(void)
 			break;
 		}
 
+		/*
+		 * The preemption disabling is not really needed for self
+		 * tests, but the rcu dereference will complain without it.
+		 */
+		preempt_disable();
 		if (*d->not_visited)
 			walk_pred_tree(filter->preds, filter->root,
 				       test_walk_pred_cb,
@@ -2108,6 +2113,7 @@ static __init int ftrace_test_event_filter(void)
 
 		test_pred_visited = 0;
 		err = filter_match_preds(filter, &d->rec);
+		preempt_enable();
 
 		__free_filter(filter);
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 15/21] trace: Add a new readonly entry to report total buffer size
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (13 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 14/21] tracing: Add preempt disable for filter self test Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 16/21] trace: Add ring buffer stats to measure rate of events Steven Rostedt
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Vaibhav Nagarnaik, Michael Rubin, David Sharp

[-- Attachment #1: 0015-trace-Add-a-new-readonly-entry-to-report-total-buffe.patch --]
[-- Type: text/plain, Size: 2541 bytes --]

From: Vaibhav Nagarnaik <vnagarnaik@google.com>

The current file "buffer_size_kb" reports the size of per-cpu buffer and
not the overall memory allocated which could be misleading. A new file
"buffer_total_size_kb" adds up all the enabled CPU buffer sizes and
reports it. This is only a readonly entry.

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-2-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e5df02c..0117678 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3569,6 +3569,30 @@ tracing_entries_write(struct file *filp, const char __user *ubuf,
 }
 
 static ssize_t
+tracing_total_entries_read(struct file *filp, char __user *ubuf,
+				size_t cnt, loff_t *ppos)
+{
+	struct trace_array *tr = filp->private_data;
+	char buf[64];
+	int r, cpu;
+	unsigned long size = 0, expanded_size = 0;
+
+	mutex_lock(&trace_types_lock);
+	for_each_tracing_cpu(cpu) {
+		size += tr->entries >> 10;
+		if (!ring_buffer_expanded)
+			expanded_size += trace_buf_size >> 10;
+	}
+	if (ring_buffer_expanded)
+		r = sprintf(buf, "%lu\n", size);
+	else
+		r = sprintf(buf, "%lu (expanded: %lu)\n", size, expanded_size);
+	mutex_unlock(&trace_types_lock);
+
+	return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
+}
+
+static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
 			  size_t cnt, loff_t *ppos)
 {
@@ -3739,6 +3763,12 @@ static const struct file_operations tracing_entries_fops = {
 	.llseek		= generic_file_llseek,
 };
 
+static const struct file_operations tracing_total_entries_fops = {
+	.open		= tracing_open_generic,
+	.read		= tracing_total_entries_read,
+	.llseek		= generic_file_llseek,
+};
+
 static const struct file_operations tracing_free_buffer_fops = {
 	.write		= tracing_free_buffer_write,
 	.release	= tracing_free_buffer_release,
@@ -4450,6 +4480,9 @@ static __init int tracer_init_debugfs(void)
 	trace_create_file("buffer_size_kb", 0644, d_tracer,
 			&global_trace, &tracing_entries_fops);
 
+	trace_create_file("buffer_total_size_kb", 0444, d_tracer,
+			&global_trace, &tracing_total_entries_fops);
+
 	trace_create_file("free_buffer", 0644, d_tracer,
 			&global_trace, &tracing_free_buffer_fops);
 
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 16/21] trace: Add ring buffer stats to measure rate of events
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (14 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 15/21] trace: Add a new readonly entry to report total buffer size Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 17/21] tracing: Add a counter clock for those that do not trust clocks Steven Rostedt
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Vaibhav Nagarnaik, Michael Rubin, David Sharp

[-- Attachment #1: 0016-trace-Add-ring-buffer-stats-to-measure-rate-of-event.patch --]
[-- Type: text/plain, Size: 8149 bytes --]

From: Vaibhav Nagarnaik <vnagarnaik@google.com>

The stats file under per_cpu folder provides the number of entries,
overruns and other statistics about the CPU ring buffer. However, the
numbers do not provide any indication of how full the ring buffer is in
bytes compared to the overall size in bytes. Also, it is helpful to know
the rate at which the cpu buffer is filling up.

This patch adds an entry "bytes: " in printed stats for per_cpu ring
buffer which provides the actual bytes consumed in the ring buffer. This
field includes the number of bytes used by recorded events and the
padding bytes added when moving the tail pointer to next page.

It also adds the following time stamps:
"oldest event ts:" - the oldest timestamp in the ring buffer
"now ts:"  - the timestamp at the time of reading

The field "now ts" provides a consistent time snapshot to the userspace
when being read. This is read from the same trace clock used by tracing
event timestamps.

Together, these values provide the rate at which the buffer is filling
up, from the formula:
bytes / (now_ts - oldest_event_ts)

Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: David Sharp <dhsharp@google.com>
Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.com
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/ring_buffer.h |    2 +
 kernel/trace/ring_buffer.c  |   70 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/trace/trace.c        |   13 ++++++++
 3 files changed, 84 insertions(+), 1 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index b891de9..67be037 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -154,6 +154,8 @@ void ring_buffer_record_enable(struct ring_buffer *buffer);
 void ring_buffer_record_disable_cpu(struct ring_buffer *buffer, int cpu);
 void ring_buffer_record_enable_cpu(struct ring_buffer *buffer, int cpu);
 
+unsigned long ring_buffer_oldest_event_ts(struct ring_buffer *buffer, int cpu);
+unsigned long ring_buffer_bytes_cpu(struct ring_buffer *buffer, int cpu);
 unsigned long ring_buffer_entries(struct ring_buffer *buffer);
 unsigned long ring_buffer_overruns(struct ring_buffer *buffer);
 unsigned long ring_buffer_entries_cpu(struct ring_buffer *buffer, int cpu);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 731201b..acf6b68 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -488,12 +488,14 @@ struct ring_buffer_per_cpu {
 	struct buffer_page		*reader_page;
 	unsigned long			lost_events;
 	unsigned long			last_overrun;
+	local_t				entries_bytes;
 	local_t				commit_overrun;
 	local_t				overrun;
 	local_t				entries;
 	local_t				committing;
 	local_t				commits;
 	unsigned long			read;
+	unsigned long			read_bytes;
 	u64				write_stamp;
 	u64				read_stamp;
 };
@@ -1708,6 +1710,7 @@ rb_handle_head_page(struct ring_buffer_per_cpu *cpu_buffer,
 		 * the counters.
 		 */
 		local_add(entries, &cpu_buffer->overrun);
+		local_sub(BUF_PAGE_SIZE, &cpu_buffer->entries_bytes);
 
 		/*
 		 * The entries will be zeroed out when we move the
@@ -1863,6 +1866,9 @@ rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
 	event = __rb_page_index(tail_page, tail);
 	kmemcheck_annotate_bitfield(event, bitfield);
 
+	/* account for padding bytes */
+	local_add(BUF_PAGE_SIZE - tail, &cpu_buffer->entries_bytes);
+
 	/*
 	 * Save the original length to the meta data.
 	 * This will be used by the reader to add lost event
@@ -2054,6 +2060,9 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
 	if (!tail)
 		tail_page->page->time_stamp = ts;
 
+	/* account for these added bytes */
+	local_add(length, &cpu_buffer->entries_bytes);
+
 	return event;
 }
 
@@ -2076,6 +2085,7 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 	if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
 		unsigned long write_mask =
 			local_read(&bpage->write) & ~RB_WRITE_MASK;
+		unsigned long event_length = rb_event_length(event);
 		/*
 		 * This is on the tail page. It is possible that
 		 * a write could come in and move the tail page
@@ -2085,8 +2095,11 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
 		old_index += write_mask;
 		new_index += write_mask;
 		index = local_cmpxchg(&bpage->write, old_index, new_index);
-		if (index == old_index)
+		if (index == old_index) {
+			/* update counters */
+			local_sub(event_length, &cpu_buffer->entries_bytes);
 			return 1;
+		}
 	}
 
 	/* could not discard */
@@ -2661,6 +2674,58 @@ rb_num_of_entries(struct ring_buffer_per_cpu *cpu_buffer)
 }
 
 /**
+ * ring_buffer_oldest_event_ts - get the oldest event timestamp from the buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to read from.
+ */
+unsigned long ring_buffer_oldest_event_ts(struct ring_buffer *buffer, int cpu)
+{
+	unsigned long flags;
+	struct ring_buffer_per_cpu *cpu_buffer;
+	struct buffer_page *bpage;
+	unsigned long ret;
+
+	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	spin_lock_irqsave(&cpu_buffer->reader_lock, flags);
+	/*
+	 * if the tail is on reader_page, oldest time stamp is on the reader
+	 * page
+	 */
+	if (cpu_buffer->tail_page == cpu_buffer->reader_page)
+		bpage = cpu_buffer->reader_page;
+	else
+		bpage = rb_set_head_page(cpu_buffer);
+	ret = bpage->page->time_stamp;
+	spin_unlock_irqrestore(&cpu_buffer->reader_lock, flags);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_oldest_event_ts);
+
+/**
+ * ring_buffer_bytes_cpu - get the number of bytes consumed in a cpu buffer
+ * @buffer: The ring buffer
+ * @cpu: The per CPU buffer to read from.
+ */
+unsigned long ring_buffer_bytes_cpu(struct ring_buffer *buffer, int cpu)
+{
+	struct ring_buffer_per_cpu *cpu_buffer;
+	unsigned long ret;
+
+	if (!cpumask_test_cpu(cpu, buffer->cpumask))
+		return 0;
+
+	cpu_buffer = buffer->buffers[cpu];
+	ret = local_read(&cpu_buffer->entries_bytes) - cpu_buffer->read_bytes;
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_bytes_cpu);
+
+/**
  * ring_buffer_entries_cpu - get the number of entries in a cpu buffer
  * @buffer: The ring buffer
  * @cpu: The per CPU buffer to get the entries from.
@@ -3527,11 +3592,13 @@ rb_reset_cpu(struct ring_buffer_per_cpu *cpu_buffer)
 	cpu_buffer->reader_page->read = 0;
 
 	local_set(&cpu_buffer->commit_overrun, 0);
+	local_set(&cpu_buffer->entries_bytes, 0);
 	local_set(&cpu_buffer->overrun, 0);
 	local_set(&cpu_buffer->entries, 0);
 	local_set(&cpu_buffer->committing, 0);
 	local_set(&cpu_buffer->commits, 0);
 	cpu_buffer->read = 0;
+	cpu_buffer->read_bytes = 0;
 
 	cpu_buffer->write_stamp = 0;
 	cpu_buffer->read_stamp = 0;
@@ -3918,6 +3985,7 @@ int ring_buffer_read_page(struct ring_buffer *buffer,
 	} else {
 		/* update the entry counter */
 		cpu_buffer->read += rb_page_entries(reader);
+		cpu_buffer->read_bytes += BUF_PAGE_SIZE;
 
 		/* swap the pages */
 		rb_init_page(bpage);
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 0117678..b419070 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4056,6 +4056,8 @@ tracing_stats_read(struct file *filp, char __user *ubuf,
 	struct trace_array *tr = &global_trace;
 	struct trace_seq *s;
 	unsigned long cnt;
+	unsigned long long t;
+	unsigned long usec_rem;
 
 	s = kmalloc(sizeof(*s), GFP_KERNEL);
 	if (!s)
@@ -4072,6 +4074,17 @@ tracing_stats_read(struct file *filp, char __user *ubuf,
 	cnt = ring_buffer_commit_overrun_cpu(tr->buffer, cpu);
 	trace_seq_printf(s, "commit overrun: %ld\n", cnt);
 
+	cnt = ring_buffer_bytes_cpu(tr->buffer, cpu);
+	trace_seq_printf(s, "bytes: %ld\n", cnt);
+
+	t = ns2usecs(ring_buffer_oldest_event_ts(tr->buffer, cpu));
+	usec_rem = do_div(t, USEC_PER_SEC);
+	trace_seq_printf(s, "oldest event ts: %5llu.%06lu\n", t, usec_rem);
+
+	t = ns2usecs(ring_buffer_time_stamp(tr->buffer, cpu));
+	usec_rem = do_div(t, USEC_PER_SEC);
+	trace_seq_printf(s, "now ts: %5llu.%06lu\n", t, usec_rem);
+
 	count = simple_read_from_buffer(ubuf, count, ppos, s->buffer, s->len);
 
 	kfree(s);
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 17/21] tracing: Add a counter clock for those that do not trust clocks
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (15 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 16/21] trace: Add ring buffer stats to measure rate of events Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 18/21] tracing: Fix preemptirqsoff tracer to not stop at preempt off Steven Rostedt
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner, Peter Zijlstra, Valdis Kletnieks

[-- Attachment #1: 0017-tracing-Add-a-counter-clock-for-those-that-do-not-tr.patch --]
[-- Type: text/plain, Size: 2475 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When debugging tight race conditions, it can be helpful to have a
synchronized tracing method. Although in most cases the global clock
provides this functionality, if timings is not the issue, it is more
comforting to know that the order of events really happened in a precise
order.

Instead of using a clock, add a "counter" that is simply an incrementing
atomic 64bit counter that orders the events as they are perceived to
happen.

The trace_clock_counter() is added from the attempt by Peter Zijlstra
trying to convert the trace_clock_global() to it. I took Peter's counter
code and made trace_clock_counter() instead, and added it to the choice
of clocks. Just echo counter > /debug/tracing/trace_clock to activate
it.

Requested-by: Thomas Gleixner <tglx@linutronix.de>
Requested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-By: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 include/linux/trace_clock.h |    1 +
 kernel/trace/trace.c        |    1 +
 kernel/trace/trace_clock.c  |   12 ++++++++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/trace_clock.h b/include/linux/trace_clock.h
index 7a81303..4eb4902 100644
--- a/include/linux/trace_clock.h
+++ b/include/linux/trace_clock.h
@@ -15,5 +15,6 @@
 extern u64 notrace trace_clock_local(void);
 extern u64 notrace trace_clock(void);
 extern u64 notrace trace_clock_global(void);
+extern u64 notrace trace_clock_counter(void);
 
 #endif /* _LINUX_TRACE_CLOCK_H */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b419070..4b8df0d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -435,6 +435,7 @@ static struct {
 } trace_clocks[] = {
 	{ trace_clock_local,	"local" },
 	{ trace_clock_global,	"global" },
+	{ trace_clock_counter,	"counter" },
 };
 
 int trace_clock_id;
diff --git a/kernel/trace/trace_clock.c b/kernel/trace/trace_clock.c
index 6302747..3947835 100644
--- a/kernel/trace/trace_clock.c
+++ b/kernel/trace/trace_clock.c
@@ -113,3 +113,15 @@ u64 notrace trace_clock_global(void)
 
 	return now;
 }
+
+static atomic64_t trace_counter;
+
+/*
+ * trace_clock_counter(): simply an atomic counter.
+ * Use the trace_counter "counter" for cases where you do not care
+ * about timings, but are interested in strict ordering.
+ */
+u64 notrace trace_clock_counter(void)
+{
+	return atomic64_add_return(1, &trace_counter);
+}
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 18/21] tracing: Fix preemptirqsoff tracer to not stop at preempt off
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (16 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 17/21] tracing: Add a counter clock for those that do not trust clocks Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 19/21] tracing: Account for preempt off in preempt_schedule() Steven Rostedt
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: 0018-tracing-Fix-preemptirqsoff-tracer-to-not-stop-at-pre.patch --]
[-- Type: text/plain, Size: 1065 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

If irqs are disabled when preemption count reaches zero, the
preemptirqsoff tracer should not flag that as the end.

When interrupts are enabled and preemption count is not zero
the preemptirqsoff correctly continues its tracing.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace_irqsoff.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 667aa8c..a1a3359 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -505,13 +505,13 @@ EXPORT_SYMBOL(trace_hardirqs_off_caller);
 #ifdef CONFIG_PREEMPT_TRACER
 void trace_preempt_on(unsigned long a0, unsigned long a1)
 {
-	if (preempt_trace())
+	if (preempt_trace() && !irq_trace())
 		stop_critical_timing(a0, a1);
 }
 
 void trace_preempt_off(unsigned long a0, unsigned long a1)
 {
-	if (preempt_trace())
+	if (preempt_trace() && !irq_trace())
 		start_critical_timing(a0, a1);
 }
 #endif /* CONFIG_PREEMPT_TRACER */
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (17 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 18/21] tracing: Fix preemptirqsoff tracer to not stop at preempt off Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-23 11:00   ` Peter Zijlstra
  2011-09-22 22:09 ` [PATCH 20/21] tracing: Do not allocate buffer for trace_marker Steven Rostedt
  2011-09-22 22:09 ` [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk() Steven Rostedt
  20 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: 0019-tracing-Account-for-preempt-off-in-preempt_schedule.patch --]
[-- Type: text/plain, Size: 1637 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

The preempt_schedule() uses the preempt_disable_notrace() version
because it can cause infinite recursion by the function tracer as
the function tracer uses preempt_enable_notrace() which may call
back into the preempt_schedule() code as the NEED_RESCHED is still
set and the PREEMPT_ACTIVE has not been set yet.

See commit: d1f74e20b5b064a130cd0743a256c2d3cfe84010 that made this
change.

The preemptoff and preemptirqsoff latency tracers require the first
and last preempt count modifiers to enable tracing. But this skips
the checks. Since we can not convert them back to the non notrace
version, we can use the idle() hooks for the latency tracers here.
That is, the start/stop_critical_timings() works well to manually
start and stop the latency tracer for preempt off timings.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/sched.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index ccacdbd..4b096cc 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4435,7 +4435,16 @@ asmlinkage void __sched notrace preempt_schedule(void)
 
 	do {
 		add_preempt_count_notrace(PREEMPT_ACTIVE);
+		/*
+		 * The add/subtract must not be traced by the function
+		 * tracer. But we still want to account for the
+		 * preempt off latency tracer. Since the _notrace versions
+		 * of add/subtract skip the accounting for latency tracer
+		 * we must force it manually.
+		 */
+		start_critical_timings();
 		schedule();
+		stop_critical_timings();
 		sub_preempt_count_notrace(PREEMPT_ACTIVE);
 
 		/*
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 20/21] tracing: Do not allocate buffer for trace_marker
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (18 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 19/21] tracing: Account for preempt off in preempt_schedule() Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-22 22:09 ` [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk() Steven Rostedt
  20 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: 0020-tracing-Do-not-allocate-buffer-for-trace_marker.patch --]
[-- Type: text/plain, Size: 3191 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

When doing intense tracing, the kmalloc inside trace_marker can
introduce side effects to what is being traced. It is best to simply
use a static buffer and grab a mutex to write to it. This keeps
the impact of using the trace_marker() to a minimum.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 kernel/trace/trace.c |   33 +++++++++++++++++----------------
 1 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4b8df0d..e463125 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1442,15 +1442,19 @@ static void __trace_userstack(struct trace_array *tr, unsigned long flags)
 
 #endif /* CONFIG_STACKTRACE */
 
+static char trace_buf[TRACE_BUF_SIZE];
+static char trace_ubuf[TRACE_BUF_SIZE];
+
+static arch_spinlock_t trace_buf_lock =
+	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
+static DEFINE_MUTEX(trace_ubuf_mutex);
+
 /**
  * trace_vbprintk - write binary msg to tracing buffer
  *
  */
 int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 {
-	static arch_spinlock_t trace_buf_lock =
-		(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
-	static u32 trace_buf[TRACE_BUF_SIZE];
 
 	struct ftrace_event_call *call = &event_bprint;
 	struct ring_buffer_event *event;
@@ -1480,7 +1484,7 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 	/* Lockdep uses trace_printk for lock tracing */
 	local_irq_save(flags);
 	arch_spin_lock(&trace_buf_lock);
-	len = vbin_printf(trace_buf, TRACE_BUF_SIZE, fmt, args);
+	len = vbin_printf((u32 *)trace_buf, TRACE_BUF_SIZE/sizeof(int), fmt, args);
 
 	if (len > TRACE_BUF_SIZE || len < 0)
 		goto out_unlock;
@@ -1532,9 +1536,6 @@ int trace_array_printk(struct trace_array *tr,
 int trace_array_vprintk(struct trace_array *tr,
 			unsigned long ip, const char *fmt, va_list args)
 {
-	static arch_spinlock_t trace_buf_lock = __ARCH_SPIN_LOCK_UNLOCKED;
-	static char trace_buf[TRACE_BUF_SIZE];
-
 	struct ftrace_event_call *call = &event_print;
 	struct ring_buffer_event *event;
 	struct ring_buffer *buffer;
@@ -3633,23 +3634,22 @@ static ssize_t
 tracing_mark_write(struct file *filp, const char __user *ubuf,
 					size_t cnt, loff_t *fpos)
 {
-	char *buf;
 	size_t written;
+	char *buf = trace_ubuf;
 
 	if (tracing_disabled)
 		return -EINVAL;
 
-	if (cnt > TRACE_BUF_SIZE)
-		cnt = TRACE_BUF_SIZE;
+	if (cnt >= TRACE_BUF_SIZE)
+		cnt = TRACE_BUF_SIZE - 2; /* \n\0 */
 
-	buf = kmalloc(cnt + 2, GFP_KERNEL);
-	if (buf == NULL)
-		return -ENOMEM;
+	mutex_lock(&trace_ubuf_mutex);
 
 	if (copy_from_user(buf, ubuf, cnt)) {
-		kfree(buf);
-		return -EFAULT;
+		written = -EFAULT;
+		goto out;
 	}
+
 	if (buf[cnt-1] != '\n') {
 		buf[cnt] = '\n';
 		buf[cnt+1] = '\0';
@@ -3657,12 +3657,13 @@ tracing_mark_write(struct file *filp, const char __user *ubuf,
 		buf[cnt] = '\0';
 
 	written = mark_printk("%s", buf);
-	kfree(buf);
 	*fpos += written;
 
 	/* don't tell userspace we wrote more - it might confuse them */
 	if (written > cnt)
 		written = cnt;
+ out:
+	mutex_unlock(&trace_ubuf_mutex);
 
 	return written;
 }
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
                   ` (19 preceding siblings ...)
  2011-09-22 22:09 ` [PATCH 20/21] tracing: Do not allocate buffer for trace_marker Steven Rostedt
@ 2011-09-22 22:09 ` Steven Rostedt
  2011-09-23 11:02   ` Peter Zijlstra
  2011-09-23 11:07   ` Peter Zijlstra
  20 siblings, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-22 22:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ingo Molnar, Andrew Morton, Peter Zijlstra, Frederic Weisbecker,
	Thomas Gleixner

[-- Attachment #1: 0021-tracing-Add-optional-percpu-buffers-for-trace_printk.patch --]
[-- Type: text/plain, Size: 8792 bytes --]

From: Steven Rostedt <srostedt@redhat.com>

Currently, trace_printk() uses a single buffer to write into
to calculate the size and format needed to save the trace. To
do this safely in an SMP environment, a spin_lock() is taken
to only allow one writer at a time to the buffer. But this could
also affect what is being traced, and add synchronization that
would not be there otherwise.

Ideally, using percpu buffers would be useful, but since trace_printk()
is only used in development, having per cpu buffers for something
never used is a waste of space. We could make this a compile option
but trace_print() may also be used for developing modules, on a
distro kernel.

The approach taken here is to dynamically allocate percpu buffers
with a new tracing/options switch "trace_printk_percpu". It can
be allocated and freed at runtime, or "trace_printk_percpu" may also
be specified on the command line and it will be allocated at boot up.

This allows a developer to create percpu buffers for trace_printk()
on a running system, and also free them when not in used.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
 Documentation/kernel-parameters.txt |   11 ++++
 kernel/trace/trace.c                |  105 ++++++++++++++++++++++++++++++++---
 kernel/trace/trace.h                |    1 +
 3 files changed, 108 insertions(+), 9 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e279b72..8fc3e58 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2521,6 +2521,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			to facilitate early boot debugging.
 			See also Documentation/trace/events.txt
 
+	trace_printk_percpu
+			[FTRACE] When using trace_printk(), by default the
+			temporary buffer is a single instance with a spinlock
+			to protect it. By adding this switch or enabling it at
+			runtime with:
+			echo 1 > /sys/kernel/debug/tracing/options/trace_printk_percpu
+			Per CPU buffers are allocated and trace_printk() will write
+			to them without taking any locks. Echo 0 into the above
+			file, and the percpu buffers will safely be freed.
+			Even if they were allocated with this kernel command line.
+
 	tsc=		Disable clocksource stability checks for TSC.
 			Format: <string>
 			[x86] reliable: mark tsc clocksource as reliable, this
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index e463125..8975461 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -343,6 +343,13 @@ unsigned long trace_flags = TRACE_ITER_PRINT_PARENT | TRACE_ITER_PRINTK |
 static int trace_stop_count;
 static DEFINE_SPINLOCK(tracing_start_lock);
 
+static int __init set_percpu_trace_buffer(char *str)
+{
+	trace_flags |= TRACE_ITER_PRINTK_PERCPU;
+	return 1;
+}
+__setup("trace_printk_percpu", set_percpu_trace_buffer);
+
 static void wakeup_work_handler(struct work_struct *work)
 {
 	wake_up(&trace_wait);
@@ -426,6 +433,7 @@ static const char *trace_options[] = {
 	"record-cmd",
 	"overwrite",
 	"disable_on_free",
+	"trace_printk_percpu",
 	NULL
 };
 
@@ -1449,6 +1457,45 @@ static arch_spinlock_t trace_buf_lock =
 	(arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;
 static DEFINE_MUTEX(trace_ubuf_mutex);
 
+/* created for use with alloc_percpu */
+struct trace_buffer_struct {
+	char buffer[TRACE_BUF_SIZE];
+};
+
+static struct trace_buffer_struct *trace_percpu_buffer;
+
+/*
+ * Since the buffers may be freed, we access the trace_percpu_buffer
+ * once, and then use it or not depending on if it exists.
+ * The code that frees it will NULL out trace_percpu_buffer first
+ * and then performa a synchronize_sched() to ensure all users
+ * are done with it.
+ *
+ * Interrupts are disabled around the get and put operations below.
+ */
+static char *get_trace_buf(int *put)
+{
+	struct trace_buffer_struct *percpu_buffer =
+		rcu_dereference_raw(trace_percpu_buffer);
+	struct trace_buffer_struct *buffer;
+
+	if (percpu_buffer) {
+		*put = 1;
+		buffer = per_cpu_ptr(percpu_buffer, smp_processor_id());
+		return buffer->buffer;
+	}
+	*put = 0;
+	arch_spin_lock(&trace_buf_lock);
+	return trace_buf;
+}
+
+static void put_trace_buf(int put)
+{
+	if (put)
+		return;
+	arch_spin_unlock(&trace_buf_lock);
+}
+
 /**
  * trace_vbprintk - write binary msg to tracing buffer
  *
@@ -1462,9 +1509,11 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 	struct trace_array *tr = &global_trace;
 	struct trace_array_cpu *data;
 	struct bprint_entry *entry;
+	char *tbuffer;
 	unsigned long flags;
 	int disable;
 	int cpu, len = 0, size, pc;
+	int put;
 
 	if (unlikely(tracing_selftest_running || tracing_disabled))
 		return 0;
@@ -1483,8 +1532,8 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 
 	/* Lockdep uses trace_printk for lock tracing */
 	local_irq_save(flags);
-	arch_spin_lock(&trace_buf_lock);
-	len = vbin_printf((u32 *)trace_buf, TRACE_BUF_SIZE/sizeof(int), fmt, args);
+	tbuffer = get_trace_buf(&put);
+	len = vbin_printf((u32 *)tbuffer, TRACE_BUF_SIZE/sizeof(int), fmt, args);
 
 	if (len > TRACE_BUF_SIZE || len < 0)
 		goto out_unlock;
@@ -1499,14 +1548,14 @@ int trace_vbprintk(unsigned long ip, const char *fmt, va_list args)
 	entry->ip			= ip;
 	entry->fmt			= fmt;
 
-	memcpy(entry->buf, trace_buf, sizeof(u32) * len);
+	memcpy(entry->buf, tbuffer, sizeof(u32) * len);
 	if (!filter_check_discard(call, entry, buffer, event)) {
 		ring_buffer_unlock_commit(buffer, event);
 		ftrace_trace_stack(buffer, flags, 6, pc);
 	}
 
 out_unlock:
-	arch_spin_unlock(&trace_buf_lock);
+	put_trace_buf(put);
 	local_irq_restore(flags);
 
 out:
@@ -1543,7 +1592,9 @@ int trace_array_vprintk(struct trace_array *tr,
 	int cpu, len = 0, size, pc;
 	struct print_entry *entry;
 	unsigned long irq_flags;
+	char *tbuffer;
 	int disable;
+	int put;
 
 	if (tracing_disabled || tracing_selftest_running)
 		return 0;
@@ -1559,19 +1610,19 @@ int trace_array_vprintk(struct trace_array *tr,
 
 	pause_graph_tracing();
 	raw_local_irq_save(irq_flags);
-	arch_spin_lock(&trace_buf_lock);
-	len = vsnprintf(trace_buf, TRACE_BUF_SIZE, fmt, args);
+	tbuffer = get_trace_buf(&put);
+	len = vsnprintf(tbuffer, TRACE_BUF_SIZE, fmt, args);
 
 	size = sizeof(*entry) + len + 1;
 	buffer = tr->buffer;
-	event = trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
+	event = trace_buffer_lock_reserve(tbuffer, TRACE_PRINT, size,
 					  irq_flags, pc);
 	if (!event)
 		goto out_unlock;
 	entry = ring_buffer_event_data(event);
 	entry->ip = ip;
 
-	memcpy(&entry->buf, trace_buf, len);
+	memcpy(&entry->buf, tbuffer, len);
 	entry->buf[len] = '\0';
 	if (!filter_check_discard(call, entry, buffer, event)) {
 		ring_buffer_unlock_commit(buffer, event);
@@ -1579,7 +1630,7 @@ int trace_array_vprintk(struct trace_array *tr,
 	}
 
  out_unlock:
-	arch_spin_unlock(&trace_buf_lock);
+	put_trace_buf(put);
 	raw_local_irq_restore(irq_flags);
 	unpause_graph_tracing();
  out:
@@ -2621,6 +2672,32 @@ static int set_tracer_option(struct tracer *trace, char *cmp, int neg)
 	return -EINVAL;
 }
 
+static void alloc_percpu_trace_buffer(void)
+{
+	struct trace_buffer_struct *buffers;
+
+	if (!(trace_flags & TRACE_ITER_PRINTK_PERCPU))
+		return; /* Warn? */
+
+	buffers = alloc_percpu(struct trace_buffer_struct);
+	if (WARN(!buffers, "Could not allocate percpu trace_printk buffer")) {
+		trace_flags &= ~TRACE_ITER_PRINTK_PERCPU;
+		return;
+	}
+
+	trace_percpu_buffer = buffers;
+}
+
+static void free_percpu_trace_buffer(void)
+{
+	struct trace_buffer_struct *buffers;
+
+	buffers = trace_percpu_buffer;
+	trace_percpu_buffer = NULL;
+	synchronize_sched();
+	free_percpu(buffers);
+}
+
 static void set_tracer_flags(unsigned int mask, int enabled)
 {
 	/* do nothing if flag is already set */
@@ -2637,6 +2714,13 @@ static void set_tracer_flags(unsigned int mask, int enabled)
 
 	if (mask == TRACE_ITER_OVERWRITE)
 		ring_buffer_change_overwrite(global_trace.buffer, enabled);
+
+	if (mask == TRACE_ITER_PRINTK_PERCPU) {
+		if (enabled)
+			alloc_percpu_trace_buffer();
+		else
+			free_percpu_trace_buffer();
+	}
 }
 
 static ssize_t
@@ -4766,6 +4850,9 @@ __init static int tracer_alloc_buffers(void)
 	/* All seems OK, enable tracing */
 	tracing_disabled = 0;
 
+	if (trace_flags & TRACE_ITER_PRINTK_PERCPU)
+		alloc_percpu_trace_buffer();
+
 	atomic_notifier_chain_register(&panic_notifier_list,
 				       &trace_panic_notifier);
 
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 4c7540a..ca4df2b 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -652,6 +652,7 @@ enum trace_iterator_flags {
 	TRACE_ITER_RECORD_CMD		= 0x100000,
 	TRACE_ITER_OVERWRITE		= 0x200000,
 	TRACE_ITER_STOP_ON_FREE		= 0x400000,
+	TRACE_ITER_PRINTK_PERCPU	= 0x800000,
 };
 
 /*
-- 
1.7.5.4



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-22 22:09 ` [PATCH 19/21] tracing: Account for preempt off in preempt_schedule() Steven Rostedt
@ 2011-09-23 11:00   ` Peter Zijlstra
  2011-09-23 11:19     ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 11:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Thu, 2011-09-22 at 18:09 -0400, Steven Rostedt wrote:
> plain text document attachment
> (0019-tracing-Account-for-preempt-off-in-preempt_schedule.patch)
> From: Steven Rostedt <srostedt@redhat.com>
> 
> The preempt_schedule() uses the preempt_disable_notrace() version
> because it can cause infinite recursion by the function tracer as
> the function tracer uses preempt_enable_notrace() which may call
> back into the preempt_schedule() code as the NEED_RESCHED is still
> set and the PREEMPT_ACTIVE has not been set yet.
> 
> See commit: d1f74e20b5b064a130cd0743a256c2d3cfe84010 that made this
> change.
> 
> The preemptoff and preemptirqsoff latency tracers require the first
> and last preempt count modifiers to enable tracing. But this skips
> the checks. Since we can not convert them back to the non notrace
> version, we can use the idle() hooks for the latency tracers here.
> That is, the start/stop_critical_timings() works well to manually
> start and stop the latency tracer for preempt off timings.
> 
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> ---
>  kernel/sched.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index ccacdbd..4b096cc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4435,7 +4435,16 @@ asmlinkage void __sched notrace preempt_schedule(void)
>  
>  	do {
>  		add_preempt_count_notrace(PREEMPT_ACTIVE);
> +		/*
> +		 * The add/subtract must not be traced by the function
> +		 * tracer. But we still want to account for the
> +		 * preempt off latency tracer. Since the _notrace versions
> +		 * of add/subtract skip the accounting for latency tracer
> +		 * we must force it manually.
> +		 */
> +		start_critical_timings();
>  		schedule();
> +		stop_critical_timings();
>  		sub_preempt_count_notrace(PREEMPT_ACTIVE);
>  
>  		/*

This won't apply, you're patching ancient code.

Anyway, this all stinks, and reading the changelog of
d1f74e20b5b064a130cd0743a256c2d3cfe84010 and the above just makes me
confused.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-22 22:09 ` [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk() Steven Rostedt
@ 2011-09-23 11:02   ` Peter Zijlstra
  2011-09-23 11:07   ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 11:02 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Thu, 2011-09-22 at 18:09 -0400, Steven Rostedt wrote:
> 
> Currently, trace_printk() uses a single buffer to write into
> to calculate the size and format needed to save the trace. To
> do this safely in an SMP environment, a spin_lock() is taken
> to only allow one writer at a time to the buffer. But this could
> also affect what is being traced, and add synchronization that
> would not be there otherwise. 

so trace_printk() isn't NMI safe? #$%@^%@@$%@

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-22 22:09 ` [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk() Steven Rostedt
  2011-09-23 11:02   ` Peter Zijlstra
@ 2011-09-23 11:07   ` Peter Zijlstra
  2011-09-23 11:16     ` Steven Rostedt
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 11:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 13:02 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-22 at 18:09 -0400, Steven Rostedt wrote:
> > 
> > Currently, trace_printk() uses a single buffer to write into
> > to calculate the size and format needed to save the trace. To
> > do this safely in an SMP environment, a spin_lock() is taken
> > to only allow one writer at a time to the buffer. But this could
> > also affect what is being traced, and add synchronization that
> > would not be there otherwise. 
> 
> so trace_printk() isn't NMI safe? #$%@^%@@$%@

better to make all of trace_printk() depend on that extra config, there
is absolutely 0 point in having a broken and fully serialized trace
'fail^wfeature'.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-23 11:07   ` Peter Zijlstra
@ 2011-09-23 11:16     ` Steven Rostedt
  2011-09-23 11:28       ` Peter Zijlstra
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2011-09-23 11:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 13:07 +0200, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 13:02 +0200, Peter Zijlstra wrote:
> > On Thu, 2011-09-22 at 18:09 -0400, Steven Rostedt wrote:
> > > 
> > > Currently, trace_printk() uses a single buffer to write into
> > > to calculate the size and format needed to save the trace. To
> > > do this safely in an SMP environment, a spin_lock() is taken
> > > to only allow one writer at a time to the buffer. But this could
> > > also affect what is being traced, and add synchronization that
> > > would not be there otherwise. 
> > 
> > so trace_printk() isn't NMI safe? #$%@^%@@$%@

It is NMI safe, always was (I use it there too). It has a percpu
recursion detection (always has), thus if an NMI interrupts a current
trace_printk(), the NMI trace_printk() will not print. I could add an
NMI buffer to allow NMIs to print, but so far, we don't usually have
issues with trace_printk(). Heck, I'm not sure printk() wont cause
issues in NMIs. I think trace_printk() is still safer than printk.

> 
> better to make all of trace_printk() depend on that extra config, there
> is absolutely 0 point in having a broken and fully serialized trace
> 'fail^wfeature'.

Not, having per cpu buffers still doesn't allow NMIs to interrupt
trace_printk(). Otherwise the NMI would just corrupt the current percpu
buffer.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 11:00   ` Peter Zijlstra
@ 2011-09-23 11:19     ` Steven Rostedt
  2011-09-23 11:22       ` Peter Zijlstra
  2011-09-23 13:24       ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-23 11:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 13:00 +0200, Peter Zijlstra wrote:
>  
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
> > ---
> >  kernel/sched.c |    9 +++++++++
> >  1 files changed, 9 insertions(+), 0 deletions(-)
> > 
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index ccacdbd..4b096cc 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -4435,7 +4435,16 @@ asmlinkage void __sched notrace preempt_schedule(void)
> >  
> >  	do {
> >  		add_preempt_count_notrace(PREEMPT_ACTIVE);
> > +		/*
> > +		 * The add/subtract must not be traced by the function
> > +		 * tracer. But we still want to account for the
> > +		 * preempt off latency tracer. Since the _notrace versions
> > +		 * of add/subtract skip the accounting for latency tracer
> > +		 * we must force it manually.
> > +		 */
> > +		start_critical_timings();
> >  		schedule();
> > +		stop_critical_timings();
> >  		sub_preempt_count_notrace(PREEMPT_ACTIVE);
> >  
> >  		/*
> 
> This won't apply, you're patching ancient code.

Yes, that's the way git usually works. I could start a new branch for
this. But since we are not doing many pulls while kernel.org is down, my
kernel base has become a bit out of date. I thought that's the way Linus
wants it. I'm sure this would apply with a trivial merge fixup.

But if you want to take this patch, then I could rebase my tree and take
it out.

> 
> Anyway, this all stinks, and reading the changelog of
> d1f74e20b5b064a130cd0743a256c2d3cfe84010 and the above just makes me
> confused.
> 

What would you suggest? Just ignore the latencies that schedule
produces, even though its been one of the top causes of latencies?

-- Steve



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 11:19     ` Steven Rostedt
@ 2011-09-23 11:22       ` Peter Zijlstra
  2011-09-23 12:24         ` Steven Rostedt
  2011-09-23 13:24       ` Peter Zijlstra
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 11:22 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 07:19 -0400, Steven Rostedt wrote:

> What would you suggest? Just ignore the latencies that schedule
> produces, even though its been one of the top causes of latencies?

I would like to actually understand the issue first.. so far all I've
got is confusion.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-23 11:16     ` Steven Rostedt
@ 2011-09-23 11:28       ` Peter Zijlstra
  2011-09-23 11:39         ` Steven Rostedt
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 11:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 07:16 -0400, Steven Rostedt wrote:
> On Fri, 2011-09-23 at 13:07 +0200, Peter Zijlstra wrote:
> > On Fri, 2011-09-23 at 13:02 +0200, Peter Zijlstra wrote:
> > > On Thu, 2011-09-22 at 18:09 -0400, Steven Rostedt wrote:
> > > > 
> > > > Currently, trace_printk() uses a single buffer to write into
> > > > to calculate the size and format needed to save the trace. To
> > > > do this safely in an SMP environment, a spin_lock() is taken
> > > > to only allow one writer at a time to the buffer. But this could
> > > > also affect what is being traced, and add synchronization that
> > > > would not be there otherwise. 
> > > 
> > > so trace_printk() isn't NMI safe? #$%@^%@@$%@
> 
> It is NMI safe, always was (I use it there too). It has a percpu
> recursion detection (always has), thus if an NMI interrupts a current
> trace_printk(), the NMI trace_printk() will not print. I could add an
> NMI buffer to allow NMIs to print, but so far, we don't usually have
> issues with trace_printk(). Heck, I'm not sure printk() wont cause
> issues in NMIs. I think trace_printk() is still safer than printk.

Of course, printk() is most useless, its too slow, and its definitely
not NMI-safe.

If you've only got a single buffer, I don't see why you would need
per-cpu recursion detection, just spin_try_lock() the thing and if you
fail, bail. But that's not what the changelog said, or at least implied.

The possibility of dropping trace output like that is most worrying
though. I can imagine myself tearing my hair out trying to make sense of
a trace, and then wanting to kick you after finding out it lost the
crucial bit.

> > better to make all of trace_printk() depend on that extra config, there
> > is absolutely 0 point in having a broken and fully serialized trace
> > 'fail^wfeature'.
> 
> Not, having per cpu buffers still doesn't allow NMIs to interrupt
> trace_printk(). Otherwise the NMI would just corrupt the current percpu
> buffer.

Multiple buffers sounds about right, not sure if you disable interrupts
over the normal trace path, but ISTR you don't, so you need
task/softirq/irq/nmi buffers per cpu.

Really loosing trace output is not an option, ever (aside from stuff
falling of the end of the buffer). Reliability first, performance
second.



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk()
  2011-09-23 11:28       ` Peter Zijlstra
@ 2011-09-23 11:39         ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-23 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 13:28 +0200, Peter Zijlstra wrote:

> Of course, printk() is most useless, its too slow, and its definitely
> not NMI-safe.
> 
> If you've only got a single buffer, I don't see why you would need
> per-cpu recursion detection, just spin_try_lock() the thing and if you
> fail, bail. But that's not what the changelog said, or at least implied.

Not sure if the raw_spin_locks() had a try lock back when this was
written. The recursion detection was there from day one, and has nothing
to do with this patch. Which is why the changelog said nothing about it.
This patch only is about not having the trace_printk() cause extra
synchronization. This bit me once as trace_printk() cause a race to
disappear.

> 
> The possibility of dropping trace output like that is most worrying
> though. I can imagine myself tearing my hair out trying to make sense of
> a trace, and then wanting to kick you after finding out it lost the
> crucial bit.

Understood, but NMIs are always something to be worried about causing
strange issues.

> 
> > > better to make all of trace_printk() depend on that extra config, there
> > > is absolutely 0 point in having a broken and fully serialized trace
> > > 'fail^wfeature'.
> > 
> > Not, having per cpu buffers still doesn't allow NMIs to interrupt
> > trace_printk(). Otherwise the NMI would just corrupt the current percpu
> > buffer.
> 
> Multiple buffers sounds about right, not sure if you disable interrupts
> over the normal trace path, but ISTR you don't, so you need
> task/softirq/irq/nmi buffers per cpu.

trace_printk() does disable interrupts, so it is only an NMI issue.

> 
> Really loosing trace output is not an option, ever (aside from stuff
> falling of the end of the buffer). Reliability first, performance
> second.
> 

Fine, then I'll add (on top of this patch) a CONFIG_TRACE_PRINTK_FULL or
some such name that will create the per cpu NMI and normal buffers,
which will not miss any interrupts. I just want to limit the unused
memory for 99.99999% of users.

I could make the TRACE_PRINTK_FULL not even disable interrupts and have
the normal/softirq/irq/nmi per cpu buffers. This will be something that
would *not* be enabled in production.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 11:22       ` Peter Zijlstra
@ 2011-09-23 12:24         ` Steven Rostedt
  2011-09-23 14:00           ` Peter Zijlstra
  2011-09-23 14:07           ` Peter Zijlstra
  0 siblings, 2 replies; 34+ messages in thread
From: Steven Rostedt @ 2011-09-23 12:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 13:22 +0200, Peter Zijlstra wrote:
> On Fri, 2011-09-23 at 07:19 -0400, Steven Rostedt wrote:
> 
> > What would you suggest? Just ignore the latencies that schedule
> > produces, even though its been one of the top causes of latencies?
> 
> I would like to actually understand the issue first.. so far all I've
> got is confusion.

Simple. The preemptoff and even preemptirqsoff latency tracers record
every time preemption is disabled and enabled. For preemptoff, this only
includes modification of the preempt count. For preemptirqsoff, it
includes both preempt count increment and interrupts being
disabled/enabled.

Currently, the preempt check is done in add/sub_preempt_count(). But in
preempt_schedule() we call add/sub_preempt_count_notrace() which updates
the preempt_count directly without any of the preempt off/on checks.

The changelog I referenced talked about why we use the notrace versions.
Some function tracing hooks use the preempt_enable/disable_notrace().
Function tracer is not the only user of the function tracing facility.
With the original preempt_diable(), when we have preempt tracing
enabled, the add/sub_preempt_count()s become traced by the function
tracer (which is also a good thing as I've used that info). The issue is
in preempt_schedule() which is called by preempt_enable() if
NEED_RESCHED is set and PREEMPT_ACTIVE is not set. One of the first
things that preempt_schedule() does is call
add_preempt_count(PREEMPT_ACTIVE), to add the PREEMPT_ACTIVE to preempt
count and not come back into preempt_schedule() when interrupted again.

But! If add_preempt_count(PREEPMT_ACTIVE) is traced, we call into the
function tracing mechanism *before* it adds PREEMPT_ACTIVE, and when the
function hook calls preempt_enable_notrace() it will notice the
NEED_RESCHED set and PREEMPT_ACTIVE not set and recurse back into the
preempt_schedule() and boom!

By making preempt_schedule() use notrace we avoid this issue with the
function tracing hooks, but in the mean time, we just lost the check
that preemption was disabled. Since we know that preemption and
interrupts were both enabled before calling into preempt_schedule()
(otherwise it is a bug), we can just tell the latency tracers that
preemption is being disabled manually with the
start/stop_critical_timings(). Note, these function names comes from the
original latency_tracer that was in -rt.

There's another location in the kernel that we need to manually call
into the latency tracer and that's in idle. The cpu_idle() calls
disables preemption then disables interrupts and may call some assembly
instruction that puts the system into idle but wakes up on interrupts.
Then on return, interrupts are enabled and preemption is again enabled.

Since we don't know about this wakeup on interrupts, the latency tracers
would count this idle wait as a latency, which obviously is not what we
want. Which is where the start/stop_critical_timings() was created for.
The preempt_schedule() case is similar in an opposite way. Instead of
not wanting to trace, we want to trace, and the code works for this
location too.

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 11:19     ` Steven Rostedt
  2011-09-23 11:22       ` Peter Zijlstra
@ 2011-09-23 13:24       ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 13:24 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 07:19 -0400, Steven Rostedt wrote:
> Yes, that's the way git usually works

Its also one of the reasons I hate git. In this case the fixup will live
in some merge commit and be totally lost to the normal way of things.

I absolutely detest non empty merge commits.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 12:24         ` Steven Rostedt
@ 2011-09-23 14:00           ` Peter Zijlstra
  2011-09-23 14:07           ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 14:00 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 08:24 -0400, Steven Rostedt wrote:
> On Fri, 2011-09-23 at 13:22 +0200, Peter Zijlstra wrote:
> > On Fri, 2011-09-23 at 07:19 -0400, Steven Rostedt wrote:
> > 
> > > What would you suggest? Just ignore the latencies that schedule
> > > produces, even though its been one of the top causes of latencies?
> > 
> > I would like to actually understand the issue first.. so far all I've
> > got is confusion.
> 
> Simple. The preemptoff and even preemptirqsoff latency tracers record
> every time preemption is disabled and enabled. For preemptoff, this only
> includes modification of the preempt count. For preemptirqsoff, it
> includes both preempt count increment and interrupts being
> disabled/enabled.

Simple, except for this blob below..

> Currently, the preempt check is done in add/sub_preempt_count(). But in
> preempt_schedule() we call add/sub_preempt_count_notrace() which updates
> the preempt_count directly without any of the preempt off/on checks.
> 
> The changelog I referenced talked about why we use the notrace versions.

Which is the important bit, because that's icky and obviously buggy
(otherwise you wouldn't need to fix it), so one cannot judge the fix
without understanding the initial problem. 

> Some function tracing hooks use the preempt_enable/disable_notrace().

That's pretty much a non-statement in that it doesn't explain anything.

> Function tracer is not the only user of the function tracing facility.

So what?!

> With the original preempt_diable(), when we have preempt tracing
> enabled, the add/sub_preempt_count()s become traced by the function
> tracer (which is also a good thing as I've used that info). 

Why and how? It would seem to me marking those notrace would cure some
of the issues.

> The issue is
> in preempt_schedule() which is called by preempt_enable() if
> NEED_RESCHED is set and PREEMPT_ACTIVE is not set. One of the first
> things that preempt_schedule() does is call
> add_preempt_count(PREEMPT_ACTIVE), to add the PREEMPT_ACTIVE to preempt
> count and not come back into preempt_schedule() when interrupted again.

That sentence doesn't really parse for me, but I think you mean to say
that PREEMPT_ACTIVE guards against recursive preemption?

> But! If add_preempt_count(PREEPMT_ACTIVE) is traced, we call into the
> function tracing mechanism *before* it adds PREEMPT_ACTIVE, and when the
> function hook calls preempt_enable_notrace() it will notice the
> NEED_RESCHED set and PREEMPT_ACTIVE not set and recurse back into the
> preempt_schedule() and boom!

Right, so you've got the mess that there's two different notrace options
for the {add,sub}_preempt_count(). One is not doing the
trace_preempt_{off,on} thing, and one is marking the function notrace.

The normal {add,sub}_preempt_count() muck does everything, the
_notrace() one doesn't do either, and here you need one that does one
but not the other.

So you then open code things, but instead of using
trace_preempt_{off,on} you use {start,stop}_critical_timings(), which is
a totally different beast, confusing matters more.

What's worse, there's not a single comment near preempt_schedule()
explaining any of this mess.

> By making preempt_schedule() use notrace we avoid this issue with the
> function tracing hooks, but in the mean time, we just lost the check
> that preemption was disabled.

Uhm, no, by marking preempt_schedule() notrace you don't call into the
function tracer for this call. The preemption check was lost by using
the _notrace() preempt_count fiddling.

Hiding preempt_schedule() from the function tracer isn't exactly nice
though. Although I think we can reconstruct it was called by seeing
schedule() called and seeing N set (except of course that the default
function tracer output doesn't show that (one more reason the the
default output format is broken).

>  Since we know that preemption and
> interrupts were both enabled before calling into preempt_schedule()
> (otherwise it is a bug), we can just tell the latency tracers that
> preemption is being disabled manually with the
> start/stop_critical_timings(). Note, these function names comes from the
> original latency_tracer that was in -rt.

Doesn't matter where they come from, they're still confusing for anybody
trying to make sense of the {add,sub}_preempt_count vs &_notrace
functions.

> There's another location in the kernel that we need to manually call
> into the latency tracer and that's in idle. The cpu_idle() calls
> disables preemption then disables interrupts and may call some assembly
> instruction that puts the system into idle but wakes up on interrupts.
> Then on return, interrupts are enabled and preemption is again enabled.

> Since we don't know about this wakeup on interrupts, the latency tracers
> would count this idle wait as a latency, which obviously is not what we
> want. Which is where the start/stop_critical_timings() was created for.
> The preempt_schedule() case is similar in an opposite way. Instead of
> not wanting to trace, we want to trace, and the code works for this
> location too.

Wouldn't it be much more sensible to mark all that preempt/irq fiddling
in the idle path with _notrace instead? And stick a comment near.

Having two things to do the same thing just doesn't make sense.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 19/21] tracing: Account for preempt off in preempt_schedule()
  2011-09-23 12:24         ` Steven Rostedt
  2011-09-23 14:00           ` Peter Zijlstra
@ 2011-09-23 14:07           ` Peter Zijlstra
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Zijlstra @ 2011-09-23 14:07 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Frederic Weisbecker,
	Thomas Gleixner

On Fri, 2011-09-23 at 16:00 +0200, Peter Zijlstra wrote:
> 
> Wouldn't it be much more sensible to mark all that preempt/irq fiddling
> in the idle path with _notrace instead? And stick a comment near.
> 
> Having two things to do the same thing just doesn't make sense.

That is, it seems to me avoiding the idle path is seen as a non-preempt
region is cleaner than (mostly) fixing it up.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-09-23 14:08 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-22 22:09 [PATCH 00/21] [GIT PULL][v3.2] tracing: queued up stuff waiting for k.org to come back on line Steven Rostedt
2011-09-22 22:09 ` [PATCH 01/21] tracing: Clean up tb_fmt to not give faulty compile warning Steven Rostedt
2011-09-22 22:09 ` [PATCH 02/21] Tracepoint: Dissociate from module mutex Steven Rostedt
2011-09-22 22:09 ` [PATCH 03/21] x86: jump_label: arch_jump_label_text_poke_early: add missing __init Steven Rostedt
2011-09-22 22:09 ` [PATCH 04/21] tracing/filter: Use static allocation for filter predicates Steven Rostedt
2011-09-22 22:09 ` [PATCH 05/21] tracing/filter: Separate predicate init and filter addition Steven Rostedt
2011-09-22 22:09 ` [PATCH 06/21] tracing/filter: Remove field_name from filter_pred struct Steven Rostedt
2011-09-22 22:09 ` [PATCH 07/21] tracing/filter: Simplify tracepoint event lookup Steven Rostedt
2011-09-22 22:09 ` [PATCH 08/21] tracing/filter: Unify predicate tree walking, change check_pred_tree Steven Rostedt
2011-09-22 22:09 ` [PATCH 09/21] tracing/filter: Change count_leafs function to use walk_pred_tree Steven Rostedt
2011-09-22 22:09 ` [PATCH 10/21] tracing/filter: Change fold_pred_tree " Steven Rostedt
2011-09-22 22:09 ` [PATCH 11/21] tracing/filter: Change fold_pred " Steven Rostedt
2011-09-22 22:09 ` [PATCH 12/21] tracing/filter: Change filter_match_preds function to use Steven Rostedt
2011-09-22 22:09 ` [PATCH 13/21] tracing/filter: Add startup tests for events filter Steven Rostedt
2011-09-22 22:09 ` [PATCH 14/21] tracing: Add preempt disable for filter self test Steven Rostedt
2011-09-22 22:09 ` [PATCH 15/21] trace: Add a new readonly entry to report total buffer size Steven Rostedt
2011-09-22 22:09 ` [PATCH 16/21] trace: Add ring buffer stats to measure rate of events Steven Rostedt
2011-09-22 22:09 ` [PATCH 17/21] tracing: Add a counter clock for those that do not trust clocks Steven Rostedt
2011-09-22 22:09 ` [PATCH 18/21] tracing: Fix preemptirqsoff tracer to not stop at preempt off Steven Rostedt
2011-09-22 22:09 ` [PATCH 19/21] tracing: Account for preempt off in preempt_schedule() Steven Rostedt
2011-09-23 11:00   ` Peter Zijlstra
2011-09-23 11:19     ` Steven Rostedt
2011-09-23 11:22       ` Peter Zijlstra
2011-09-23 12:24         ` Steven Rostedt
2011-09-23 14:00           ` Peter Zijlstra
2011-09-23 14:07           ` Peter Zijlstra
2011-09-23 13:24       ` Peter Zijlstra
2011-09-22 22:09 ` [PATCH 20/21] tracing: Do not allocate buffer for trace_marker Steven Rostedt
2011-09-22 22:09 ` [PATCH 21/21] tracing: Add optional percpu buffers for trace_printk() Steven Rostedt
2011-09-23 11:02   ` Peter Zijlstra
2011-09-23 11:07   ` Peter Zijlstra
2011-09-23 11:16     ` Steven Rostedt
2011-09-23 11:28       ` Peter Zijlstra
2011-09-23 11:39         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox