[GIT PULL] SLAB include file dependency fixes + kmemtrace updates

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
@ 2009-04-05 19:39 Ingo Molnar
  2009-04-05 19:56 ` Linus Torvalds
  2009-04-07  1:51 ` Linus Torvalds
  0 siblings, 2 replies; 17+ messages in thread
From: Ingo Molnar @ 2009-04-05 19:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Pekka Enberg, Steven Rostedt, Andrew Morton,
	Thomas Gleixner, Eduard - Gabriel Munteanu

Linus,

Please pull the latest kmemtrace-for-linus git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git kmemtrace-for-linus

We kept this topic separate from the main tracing tree due to the 
unexpectedly wide and messy-looking scope of the fixes Pekka needed 
to do to untangle various slab*.h, rcu*.h and fs.h dependency 
chains.

After the many preparatory fixes, the tree does a few 
straightforward changes to update kmemtrace to new-style tracing 
APIs and to sync up the user-tool-ABI.

( Please note that i rebased the tree exactly once, shortly after it 
  got finished, to make it all bisectable and reviewable: the 
  perfect insight shown in the tree now was IMHO not humanly 
  possible to achieve in advance.

  In reality the real development flow did these changes exactly
  upside down: first we did the obvious change then all the
  non-obvious (and impossible to find via code reading) hidden
  dependency fixes, based on extensive testing.

  It ended up being and looking quite a mess so the rebase was 
  justified IMHO. This looked to me to be the exception that 
  strengthens the rule, but plese let me know if this is frowned
  upon ... )

Known risks:

  - Cross-arch or rare-config build breakages

No known regressions.

 Thanks,

	Ingo

------------------>
Eduard - Gabriel Munteanu (4):
      kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints
      kmemtrace: use tracepoints
      kmemtrace: kmemtrace_alloc() must fill type_id
      kmemtrace: restore original tracing data binary format, improve ABI

Ingo Molnar (6):
      kmemtrace, fs: uninline simple_transaction_set()
      kmemtrace, fs: fix linux/fdtable.h header file dependencies
      kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
      kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
      kmemtrace, rcu: fix rcupreempt.c data structure dependencies
      kmemtrace: small cleanups

Pekka Enberg (9):
      kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h
      kmemtrace, security: fix linux/key.h header file dependencies
      kmemtrace, befs: fix slab.h dependency problem
      kmemtrace, squashfs: fix slab.h dependency problem in squasfs
      kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c
      kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c
      kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c
      kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c
      kmemtrace: trace kfree() calls with NULL or zero-length objects


 fs/befs/debug.c            |    1 +
 fs/libfs.c                 |   16 +++
 fs/squashfs/export.c       |    1 +
 include/linux/fdtable.h    |    4 +-
 include/linux/fs.h         |   35 +-----
 include/linux/key.h        |    1 +
 include/linux/rcuclassic.h |   16 +--
 include/linux/rcupdate.h   |    1 -
 include/linux/rcupreempt.h |   53 ++------
 include/linux/rcutree.h    |   27 +----
 include/linux/security.h   |   24 ++++
 include/linux/slab_def.h   |   10 +-
 include/linux/slub_def.h   |   12 +-
 include/trace/kmemtrace.h  |   92 ++++++-------
 kernel/rcuclassic.c        |   23 +++-
 kernel/rcupreempt.c        |   48 ++++++-
 kernel/rcutree.c           |   20 +++
 kernel/rcutree.h           |   10 ++
 kernel/rcutree_trace.c     |    2 +
 kernel/trace/kmemtrace.c   |  319 ++++++++++++++++++++++++++++++-------------
 kernel/trace/trace.h       |    6 +
 lib/decompress_bunzip2.c   |    1 +
 lib/decompress_inflate.c   |    1 +
 lib/decompress_unlzma.c    |    1 +
 mm/failslab.c              |    1 +
 mm/slab.c                  |   26 ++--
 mm/slob.c                  |   30 ++---
 mm/slub.c                  |   32 ++---
 mm/util.c                  |   16 +++
 29 files changed, 493 insertions(+), 336 deletions(-)
 create mode 100644 kernel/rcutree.h

diff --git a/fs/befs/debug.c b/fs/befs/debug.c
index b8e304a..622e737 100644
--- a/fs/befs/debug.c
+++ b/fs/befs/debug.c
@@ -17,6 +17,7 @@
 #include <linux/spinlock.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/slab.h>
 
 #endif				/* __KERNEL__ */
 
diff --git a/fs/libfs.c b/fs/libfs.c
index 4910a36..cd22319 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -575,6 +575,21 @@ ssize_t memory_read_from_buffer(void *to, size_t count, loff_t *ppos,
  * possibly a read which collects the result - which is stored in a
  * file-local buffer.
  */
+
+void simple_transaction_set(struct file *file, size_t n)
+{
+	struct simple_transaction_argresp *ar = file->private_data;
+
+	BUG_ON(n > SIMPLE_TRANSACTION_LIMIT);
+
+	/*
+	 * The barrier ensures that ar->size will really remain zero until
+	 * ar->data is ready for reading.
+	 */
+	smp_mb();
+	ar->size = n;
+}
+
 char *simple_transaction_get(struct file *file, const char __user *buf, size_t size)
 {
 	struct simple_transaction_argresp *ar;
@@ -820,6 +835,7 @@ EXPORT_SYMBOL(simple_sync_file);
 EXPORT_SYMBOL(simple_unlink);
 EXPORT_SYMBOL(simple_read_from_buffer);
 EXPORT_SYMBOL(memory_read_from_buffer);
+EXPORT_SYMBOL(simple_transaction_set);
 EXPORT_SYMBOL(simple_transaction_get);
 EXPORT_SYMBOL(simple_transaction_read);
 EXPORT_SYMBOL(simple_transaction_release);
diff --git a/fs/squashfs/export.c b/fs/squashfs/export.c
index 69e971d..2b1b8fe 100644
--- a/fs/squashfs/export.c
+++ b/fs/squashfs/export.c
@@ -40,6 +40,7 @@
 #include <linux/dcache.h>
 #include <linux/exportfs.h>
 #include <linux/zlib.h>
+#include <linux/slab.h>
 
 #include "squashfs_fs.h"
 #include "squashfs_fs_sb.h"
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 09d6c5b..a2ec74b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -5,12 +5,14 @@
 #ifndef __LINUX_FDTABLE_H
 #define __LINUX_FDTABLE_H
 
-#include <asm/atomic.h>
 #include <linux/posix_types.h>
 #include <linux/compiler.h>
 #include <linux/spinlock.h>
 #include <linux/rcupdate.h>
 #include <linux/types.h>
+#include <linux/init.h>
+
+#include <asm/atomic.h>
 
 /*
  * The default fd array needs to be at least BITS_PER_LONG,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 61211ad..e4de2b5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2323,19 +2323,7 @@ ssize_t simple_transaction_read(struct file *file, char __user *buf,
 				size_t size, loff_t *pos);
 int simple_transaction_release(struct inode *inode, struct file *file);
 
-static inline void simple_transaction_set(struct file *file, size_t n)
-{
-	struct simple_transaction_argresp *ar = file->private_data;
-
-	BUG_ON(n > SIMPLE_TRANSACTION_LIMIT);
-
-	/*
-	 * The barrier ensures that ar->size will really remain zero until
-	 * ar->data is ready for reading.
-	 */
-	smp_mb();
-	ar->size = n;
-}
+void simple_transaction_set(struct file *file, size_t n);
 
 /*
  * simple attribute files
@@ -2382,27 +2370,6 @@ ssize_t simple_attr_read(struct file *file, char __user *buf,
 ssize_t simple_attr_write(struct file *file, const char __user *buf,
 			  size_t len, loff_t *ppos);
 
-
-#ifdef CONFIG_SECURITY
-static inline char *alloc_secdata(void)
-{
-	return (char *)get_zeroed_page(GFP_KERNEL);
-}
-
-static inline void free_secdata(void *secdata)
-{
-	free_page((unsigned long)secdata);
-}
-#else
-static inline char *alloc_secdata(void)
-{
-	return (char *)1;
-}
-
-static inline void free_secdata(void *secdata)
-{ }
-#endif	/* CONFIG_SECURITY */
-
 struct ctl_table;
 int proc_nr_files(struct ctl_table *table, int write, struct file *filp,
 		  void __user *buffer, size_t *lenp, loff_t *ppos);
diff --git a/include/linux/key.h b/include/linux/key.h
index 21d32a1..e544f46 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -20,6 +20,7 @@
 #include <linux/rbtree.h>
 #include <linux/rcupdate.h>
 #include <linux/sysctl.h>
+#include <linux/rwsem.h>
 #include <asm/atomic.h>
 
 #ifdef __KERNEL__
diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index 80044a4..bfd92e1 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -36,7 +36,6 @@
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
-#include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 
@@ -108,25 +107,14 @@ struct rcu_data {
 	struct rcu_head barrier;
 };
 
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-
 /*
  * Increment the quiescent state counter.
  * The counter is a bit degenerated: We do not need to know
  * how many quiescent states passed, just if there was at least
  * one since the start of the grace period. Thus just a flag.
  */
-static inline void rcu_qsctr_inc(int cpu)
-{
-	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
-	rdp->passed_quiesc = 1;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
-	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
-	rdp->passed_quiesc = 1;
-}
+extern void rcu_qsctr_inc(int cpu);
+extern void rcu_bh_qsctr_inc(int cpu);
 
 extern int rcu_pending(int cpu);
 extern int rcu_needs_cpu(int cpu);
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 528343e..15fbb3c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -36,7 +36,6 @@
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
-#include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 #include <linux/lockdep.h>
diff --git a/include/linux/rcupreempt.h b/include/linux/rcupreempt.h
index 74304b4..fce5227 100644
--- a/include/linux/rcupreempt.h
+++ b/include/linux/rcupreempt.h
@@ -36,34 +36,19 @@
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
-#include <linux/percpu.h>
+#include <linux/smp.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 
-struct rcu_dyntick_sched {
-	int dynticks;
-	int dynticks_snap;
-	int sched_qs;
-	int sched_qs_snap;
-	int sched_dynticks_snap;
-};
-
-DECLARE_PER_CPU(struct rcu_dyntick_sched, rcu_dyntick_sched);
-
-static inline void rcu_qsctr_inc(int cpu)
-{
-	struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
-
-	rdssp->sched_qs++;
-}
-#define rcu_bh_qsctr_inc(cpu)
+extern void rcu_qsctr_inc(int cpu);
+static inline void rcu_bh_qsctr_inc(int cpu) { }
 
 /*
  * Someone might want to pass call_rcu_bh as a function pointer.
  * So this needs to just be a rename and not a macro function.
  *  (no parentheses)
  */
-#define call_rcu_bh	 	call_rcu
+#define call_rcu_bh		call_rcu
 
 /**
  * call_rcu_sched - Queue RCU callback for invocation after sched grace period.
@@ -117,30 +102,12 @@ extern struct rcupreempt_trace *rcupreempt_trace_cpu(int cpu);
 struct softirq_action;
 
 #ifdef CONFIG_NO_HZ
-
-static inline void rcu_enter_nohz(void)
-{
-	static DEFINE_RATELIMIT_STATE(rs, 10 * HZ, 1);
-
-	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
-	__get_cpu_var(rcu_dyntick_sched).dynticks++;
-	WARN_ON_RATELIMIT(__get_cpu_var(rcu_dyntick_sched).dynticks & 0x1, &rs);
-}
-
-static inline void rcu_exit_nohz(void)
-{
-	static DEFINE_RATELIMIT_STATE(rs, 10 * HZ, 1);
-
-	__get_cpu_var(rcu_dyntick_sched).dynticks++;
-	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
-	WARN_ON_RATELIMIT(!(__get_cpu_var(rcu_dyntick_sched).dynticks & 0x1),
-				&rs);
-}
-
-#else /* CONFIG_NO_HZ */
-#define rcu_enter_nohz()	do { } while (0)
-#define rcu_exit_nohz()		do { } while (0)
-#endif /* CONFIG_NO_HZ */
+extern void rcu_enter_nohz(void);
+extern void rcu_exit_nohz(void);
+#else
+# define rcu_enter_nohz()	do { } while (0)
+# define rcu_exit_nohz()	do { } while (0)
+#endif
 
 /*
  * A context switch is a grace period for rcupreempt synchronize_rcu()
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index a722fb6..0cdda00 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -33,7 +33,6 @@
 #include <linux/cache.h>
 #include <linux/spinlock.h>
 #include <linux/threads.h>
-#include <linux/percpu.h>
 #include <linux/cpumask.h>
 #include <linux/seqlock.h>
 
@@ -236,30 +235,8 @@ struct rcu_state {
 #endif /* #ifdef CONFIG_NO_HZ */
 };
 
-extern struct rcu_state rcu_state;
-DECLARE_PER_CPU(struct rcu_data, rcu_data);
-
-extern struct rcu_state rcu_bh_state;
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
-
-/*
- * Increment the quiescent state counter.
- * The counter is a bit degenerated: We do not need to know
- * how many quiescent states passed, just if there was at least
- * one since the start of the grace period. Thus just a flag.
- */
-static inline void rcu_qsctr_inc(int cpu)
-{
-	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
-	rdp->passed_quiesc = 1;
-	rdp->passed_quiesc_completed = rdp->completed;
-}
-static inline void rcu_bh_qsctr_inc(int cpu)
-{
-	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
-	rdp->passed_quiesc = 1;
-	rdp->passed_quiesc_completed = rdp->completed;
-}
+extern void rcu_qsctr_inc(int cpu);
+extern void rcu_bh_qsctr_inc(int cpu);
 
 extern int rcu_pending(int cpu);
 extern int rcu_needs_cpu(int cpu);
diff --git a/include/linux/security.h b/include/linux/security.h
index 54ed157..d5fd616 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -32,6 +32,7 @@
 #include <linux/sched.h>
 #include <linux/key.h>
 #include <linux/xfrm.h>
+#include <linux/gfp.h>
 #include <net/flow.h>
 
 /* Maximum number of letters for an LSM name string */
@@ -2953,5 +2954,28 @@ static inline void securityfs_remove(struct dentry *dentry)
 
 #endif
 
+#ifdef CONFIG_SECURITY
+
+static inline char *alloc_secdata(void)
+{
+	return (char *)get_zeroed_page(GFP_KERNEL);
+}
+
+static inline void free_secdata(void *secdata)
+{
+	free_page((unsigned long)secdata);
+}
+
+#else
+
+static inline char *alloc_secdata(void)
+{
+        return (char *)1;
+}
+
+static inline void free_secdata(void *secdata)
+{ }
+#endif /* CONFIG_SECURITY */
+
 #endif /* ! __LINUX_SECURITY_H */
 
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index f452365..5ac9b0b 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -73,8 +73,8 @@ found:
 
 		ret = kmem_cache_alloc_notrace(cachep, flags);
 
-		kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret,
-				     size, slab_buffer_size(cachep), flags);
+		trace_kmalloc(_THIS_IP_, ret,
+			      size, slab_buffer_size(cachep), flags);
 
 		return ret;
 	}
@@ -128,9 +128,9 @@ found:
 
 		ret = kmem_cache_alloc_node_notrace(cachep, flags, node);
 
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_,
-					  ret, size, slab_buffer_size(cachep),
-					  flags, node);
+		trace_kmalloc_node(_THIS_IP_, ret,
+				   size, slab_buffer_size(cachep),
+				   flags, node);
 
 		return ret;
 	}
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a1f9052..5046f90 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -233,8 +233,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
 	unsigned int order = get_order(size);
 	void *ret = (void *) __get_free_pages(flags | __GFP_COMP, order);
 
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _THIS_IP_, ret,
-			     size, PAGE_SIZE << order, flags);
+	trace_kmalloc(_THIS_IP_, ret, size, PAGE_SIZE << order, flags);
 
 	return ret;
 }
@@ -255,9 +254,7 @@ static __always_inline void *kmalloc(size_t size, gfp_t flags)
 
 			ret = kmem_cache_alloc_notrace(s, flags);
 
-			kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC,
-					     _THIS_IP_, ret,
-					     size, s->size, flags);
+			trace_kmalloc(_THIS_IP_, ret, size, s->size, flags);
 
 			return ret;
 		}
@@ -296,9 +293,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 
 		ret = kmem_cache_alloc_node_notrace(s, flags, node);
 
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC,
-					  _THIS_IP_, ret,
-					  size, s->size, flags, node);
+		trace_kmalloc_node(_THIS_IP_, ret,
+				   size, s->size, flags, node);
 
 		return ret;
 	}
diff --git a/include/trace/kmemtrace.h b/include/trace/kmemtrace.h
index ad8b785..28ee69f 100644
--- a/include/trace/kmemtrace.h
+++ b/include/trace/kmemtrace.h
@@ -9,65 +9,53 @@
 
 #ifdef __KERNEL__
 
+#include <linux/tracepoint.h>
 #include <linux/types.h>
-#include <linux/marker.h>
-
-enum kmemtrace_type_id {
-	KMEMTRACE_TYPE_KMALLOC = 0,	/* kmalloc() or kfree(). */
-	KMEMTRACE_TYPE_CACHE,		/* kmem_cache_*(). */
-	KMEMTRACE_TYPE_PAGES,		/* __get_free_pages() and friends. */
-};
 
 #ifdef CONFIG_KMEMTRACE
-
 extern void kmemtrace_init(void);
-
-extern void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id,
-					     unsigned long call_site,
-					     const void *ptr,
-					     size_t bytes_req,
-					     size_t bytes_alloc,
-					     gfp_t gfp_flags,
-					     int node);
-
-extern void kmemtrace_mark_free(enum kmemtrace_type_id type_id,
-				       unsigned long call_site,
-				       const void *ptr);
-
-#else /* CONFIG_KMEMTRACE */
-
+#else
 static inline void kmemtrace_init(void)
 {
 }
-
-static inline void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id,
-					     unsigned long call_site,
-					     const void *ptr,
-					     size_t bytes_req,
-					     size_t bytes_alloc,
-					     gfp_t gfp_flags,
-					     int node)
-{
-}
-
-static inline void kmemtrace_mark_free(enum kmemtrace_type_id type_id,
-				       unsigned long call_site,
-				       const void *ptr)
-{
-}
-
-#endif /* CONFIG_KMEMTRACE */
-
-static inline void kmemtrace_mark_alloc(enum kmemtrace_type_id type_id,
-					unsigned long call_site,
-					const void *ptr,
-					size_t bytes_req,
-					size_t bytes_alloc,
-					gfp_t gfp_flags)
-{
-	kmemtrace_mark_alloc_node(type_id, call_site, ptr,
-				  bytes_req, bytes_alloc, gfp_flags, -1);
-}
+#endif
+
+DECLARE_TRACE(kmalloc,
+	      TP_PROTO(unsigned long call_site,
+		      const void *ptr,
+		      size_t bytes_req,
+		      size_t bytes_alloc,
+		      gfp_t gfp_flags),
+	      TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags));
+DECLARE_TRACE(kmem_cache_alloc,
+	      TP_PROTO(unsigned long call_site,
+		      const void *ptr,
+		      size_t bytes_req,
+		      size_t bytes_alloc,
+		      gfp_t gfp_flags),
+	      TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags));
+DECLARE_TRACE(kmalloc_node,
+	      TP_PROTO(unsigned long call_site,
+		      const void *ptr,
+		      size_t bytes_req,
+		      size_t bytes_alloc,
+		      gfp_t gfp_flags,
+		      int node),
+	      TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node));
+DECLARE_TRACE(kmem_cache_alloc_node,
+	      TP_PROTO(unsigned long call_site,
+		      const void *ptr,
+		      size_t bytes_req,
+		      size_t bytes_alloc,
+		      gfp_t gfp_flags,
+		      int node),
+	      TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags, node));
+DECLARE_TRACE(kfree,
+	      TP_PROTO(unsigned long call_site, const void *ptr),
+	      TP_ARGS(call_site, ptr));
+DECLARE_TRACE(kmem_cache_free,
+	      TP_PROTO(unsigned long call_site, const void *ptr),
+	      TP_ARGS(call_site, ptr));
 
 #endif /* __KERNEL__ */
 
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index 654c640..0f2b0b3 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -65,6 +65,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
 	.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
 	.cpumask = CPU_BITS_NONE,
 };
+
 static struct rcu_ctrlblk rcu_bh_ctrlblk = {
 	.cur = -300,
 	.completed = -300,
@@ -73,8 +74,26 @@ static struct rcu_ctrlblk rcu_bh_ctrlblk = {
 	.cpumask = CPU_BITS_NONE,
 };
 
-DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
-DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+static DEFINE_PER_CPU(struct rcu_data, rcu_data);
+static DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
+
+/*
+ * Increment the quiescent state counter.
+ * The counter is a bit degenerated: We do not need to know
+ * how many quiescent states passed, just if there was at least
+ * one since the start of the grace period. Thus just a flag.
+ */
+void rcu_qsctr_inc(int cpu)
+{
+	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+	rdp->passed_quiesc = 1;
+}
+
+void rcu_bh_qsctr_inc(int cpu)
+{
+	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
+	rdp->passed_quiesc = 1;
+}
 
 static int blimit = 10;
 static int qhimark = 10000;
diff --git a/kernel/rcupreempt.c b/kernel/rcupreempt.c
index 5d59e85..ce97a4d 100644
--- a/kernel/rcupreempt.c
+++ b/kernel/rcupreempt.c
@@ -147,7 +147,51 @@ struct rcu_ctrlblk {
 	wait_queue_head_t sched_wq;	/* Place for rcu_sched to sleep. */
 };
 
+struct rcu_dyntick_sched {
+	int dynticks;
+	int dynticks_snap;
+	int sched_qs;
+	int sched_qs_snap;
+	int sched_dynticks_snap;
+};
+
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_dyntick_sched, rcu_dyntick_sched) = {
+	.dynticks = 1,
+};
+
+void rcu_qsctr_inc(int cpu)
+{
+	struct rcu_dyntick_sched *rdssp = &per_cpu(rcu_dyntick_sched, cpu);
+
+	rdssp->sched_qs++;
+}
+
+#ifdef CONFIG_NO_HZ
+
+void rcu_enter_nohz(void)
+{
+	static DEFINE_RATELIMIT_STATE(rs, 10 * HZ, 1);
+
+	smp_mb(); /* CPUs seeing ++ must see prior RCU read-side crit sects */
+	__get_cpu_var(rcu_dyntick_sched).dynticks++;
+	WARN_ON_RATELIMIT(__get_cpu_var(rcu_dyntick_sched).dynticks & 0x1, &rs);
+}
+
+void rcu_exit_nohz(void)
+{
+	static DEFINE_RATELIMIT_STATE(rs, 10 * HZ, 1);
+
+	__get_cpu_var(rcu_dyntick_sched).dynticks++;
+	smp_mb(); /* CPUs seeing ++ must see later RCU read-side crit sects */
+	WARN_ON_RATELIMIT(!(__get_cpu_var(rcu_dyntick_sched).dynticks & 0x1),
+				&rs);
+}
+
+#endif /* CONFIG_NO_HZ */
+
+
 static DEFINE_PER_CPU(struct rcu_data, rcu_data);
+
 static struct rcu_ctrlblk rcu_ctrlblk = {
 	.fliplock = __SPIN_LOCK_UNLOCKED(rcu_ctrlblk.fliplock),
 	.completed = 0,
@@ -427,10 +471,6 @@ static void __rcu_advance_callbacks(struct rcu_data *rdp)
 	}
 }
 
-DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_dyntick_sched, rcu_dyntick_sched) = {
-	.dynticks = 1,
-};
-
 #ifdef CONFIG_NO_HZ
 static DEFINE_PER_CPU(int, rcu_update_flag);
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 97ce315..7f32669 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -78,6 +78,26 @@ DEFINE_PER_CPU(struct rcu_data, rcu_data);
 struct rcu_state rcu_bh_state = RCU_STATE_INITIALIZER(rcu_bh_state);
 DEFINE_PER_CPU(struct rcu_data, rcu_bh_data);
 
+/*
+ * Increment the quiescent state counter.
+ * The counter is a bit degenerated: We do not need to know
+ * how many quiescent states passed, just if there was at least
+ * one since the start of the grace period. Thus just a flag.
+ */
+void rcu_qsctr_inc(int cpu)
+{
+	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+	rdp->passed_quiesc = 1;
+	rdp->passed_quiesc_completed = rdp->completed;
+}
+
+void rcu_bh_qsctr_inc(int cpu)
+{
+	struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
+	rdp->passed_quiesc = 1;
+	rdp->passed_quiesc_completed = rdp->completed;
+}
+
 #ifdef CONFIG_NO_HZ
 DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
 	.dynticks_nesting = 1,
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
new file mode 100644
index 0000000..5e872bb
--- /dev/null
+++ b/kernel/rcutree.h
@@ -0,0 +1,10 @@
+
+/*
+ * RCU implementation internal declarations:
+ */
+extern struct rcu_state rcu_state;
+DECLARE_PER_CPU(struct rcu_data, rcu_data);
+
+extern struct rcu_state rcu_bh_state;
+DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
+
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index d6db3e8..4ee954f 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -43,6 +43,8 @@
 #include <linux/debugfs.h>
 #include <linux/seq_file.h>
 
+#include "rcutree.h"
+
 static void print_one_rcu_data(struct seq_file *m, struct rcu_data *rdp)
 {
 	if (!rdp->beenonline)
diff --git a/kernel/trace/kmemtrace.c b/kernel/trace/kmemtrace.c
index ae201b3..5011f4d 100644
--- a/kernel/trace/kmemtrace.c
+++ b/kernel/trace/kmemtrace.c
@@ -6,14 +6,16 @@
  * Copyright (C) 2008 Frederic Weisbecker <fweisbec@gmail.com>
  */
 
-#include <linux/dcache.h>
+#include <linux/tracepoint.h>
+#include <linux/seq_file.h>
 #include <linux/debugfs.h>
+#include <linux/dcache.h>
 #include <linux/fs.h>
-#include <linux/seq_file.h>
+
 #include <trace/kmemtrace.h>
 
-#include "trace.h"
 #include "trace_output.h"
+#include "trace.h"
 
 /* Select an alternative, minimalistic output than the original one */
 #define TRACE_KMEM_OPT_MINIMAL	0x1
@@ -25,14 +27,156 @@ static struct tracer_opt kmem_opts[] = {
 };
 
 static struct tracer_flags kmem_tracer_flags = {
-	.val = 0,
-	.opts = kmem_opts
+	.val			= 0,
+	.opts			= kmem_opts
 };
 
-
-static bool kmem_tracing_enabled __read_mostly;
 static struct trace_array *kmemtrace_array;
 
+/* Trace allocations */
+static inline void kmemtrace_alloc(enum kmemtrace_type_id type_id,
+				   unsigned long call_site,
+				   const void *ptr,
+				   size_t bytes_req,
+				   size_t bytes_alloc,
+				   gfp_t gfp_flags,
+				   int node)
+{
+	struct trace_array *tr = kmemtrace_array;
+	struct kmemtrace_alloc_entry *entry;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry));
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	tracing_generic_entry_update(&entry->ent, 0, 0);
+
+	entry->ent.type		= TRACE_KMEM_ALLOC;
+	entry->type_id		= type_id;
+	entry->call_site	= call_site;
+	entry->ptr		= ptr;
+	entry->bytes_req	= bytes_req;
+	entry->bytes_alloc	= bytes_alloc;
+	entry->gfp_flags	= gfp_flags;
+	entry->node		= node;
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+
+	trace_wake_up();
+}
+
+static inline void kmemtrace_free(enum kmemtrace_type_id type_id,
+				  unsigned long call_site,
+				  const void *ptr)
+{
+	struct trace_array *tr = kmemtrace_array;
+	struct kmemtrace_free_entry *entry;
+	struct ring_buffer_event *event;
+
+	event = ring_buffer_lock_reserve(tr->buffer, sizeof(*entry));
+	if (!event)
+		return;
+	entry	= ring_buffer_event_data(event);
+	tracing_generic_entry_update(&entry->ent, 0, 0);
+
+	entry->ent.type		= TRACE_KMEM_FREE;
+	entry->type_id		= type_id;
+	entry->call_site	= call_site;
+	entry->ptr		= ptr;
+
+	ring_buffer_unlock_commit(tr->buffer, event);
+
+	trace_wake_up();
+}
+
+static void kmemtrace_kmalloc(unsigned long call_site,
+			      const void *ptr,
+			      size_t bytes_req,
+			      size_t bytes_alloc,
+			      gfp_t gfp_flags)
+{
+	kmemtrace_alloc(KMEMTRACE_TYPE_KMALLOC, call_site, ptr,
+			bytes_req, bytes_alloc, gfp_flags, -1);
+}
+
+static void kmemtrace_kmem_cache_alloc(unsigned long call_site,
+				       const void *ptr,
+				       size_t bytes_req,
+				       size_t bytes_alloc,
+				       gfp_t gfp_flags)
+{
+	kmemtrace_alloc(KMEMTRACE_TYPE_CACHE, call_site, ptr,
+			bytes_req, bytes_alloc, gfp_flags, -1);
+}
+
+static void kmemtrace_kmalloc_node(unsigned long call_site,
+				   const void *ptr,
+				   size_t bytes_req,
+				   size_t bytes_alloc,
+				   gfp_t gfp_flags,
+				   int node)
+{
+	kmemtrace_alloc(KMEMTRACE_TYPE_KMALLOC, call_site, ptr,
+			bytes_req, bytes_alloc, gfp_flags, node);
+}
+
+static void kmemtrace_kmem_cache_alloc_node(unsigned long call_site,
+					    const void *ptr,
+					    size_t bytes_req,
+					    size_t bytes_alloc,
+					    gfp_t gfp_flags,
+					    int node)
+{
+	kmemtrace_alloc(KMEMTRACE_TYPE_CACHE, call_site, ptr,
+			bytes_req, bytes_alloc, gfp_flags, node);
+}
+
+static void kmemtrace_kfree(unsigned long call_site, const void *ptr)
+{
+	kmemtrace_free(KMEMTRACE_TYPE_KMALLOC, call_site, ptr);
+}
+
+static void kmemtrace_kmem_cache_free(unsigned long call_site, const void *ptr)
+{
+	kmemtrace_free(KMEMTRACE_TYPE_CACHE, call_site, ptr);
+}
+
+static int kmemtrace_start_probes(void)
+{
+	int err;
+
+	err = register_trace_kmalloc(kmemtrace_kmalloc);
+	if (err)
+		return err;
+	err = register_trace_kmem_cache_alloc(kmemtrace_kmem_cache_alloc);
+	if (err)
+		return err;
+	err = register_trace_kmalloc_node(kmemtrace_kmalloc_node);
+	if (err)
+		return err;
+	err = register_trace_kmem_cache_alloc_node(kmemtrace_kmem_cache_alloc_node);
+	if (err)
+		return err;
+	err = register_trace_kfree(kmemtrace_kfree);
+	if (err)
+		return err;
+	err = register_trace_kmem_cache_free(kmemtrace_kmem_cache_free);
+
+	return err;
+}
+
+static void kmemtrace_stop_probes(void)
+{
+	unregister_trace_kmalloc(kmemtrace_kmalloc);
+	unregister_trace_kmem_cache_alloc(kmemtrace_kmem_cache_alloc);
+	unregister_trace_kmalloc_node(kmemtrace_kmalloc_node);
+	unregister_trace_kmem_cache_alloc_node(kmemtrace_kmem_cache_alloc_node);
+	unregister_trace_kfree(kmemtrace_kfree);
+	unregister_trace_kmem_cache_free(kmemtrace_kmem_cache_free);
+}
+
 static int kmem_trace_init(struct trace_array *tr)
 {
 	int cpu;
@@ -41,14 +185,14 @@ static int kmem_trace_init(struct trace_array *tr)
 	for_each_cpu_mask(cpu, cpu_possible_map)
 		tracing_reset(tr, cpu);
 
-	kmem_tracing_enabled = true;
+	kmemtrace_start_probes();
 
 	return 0;
 }
 
 static void kmem_trace_reset(struct trace_array *tr)
 {
-	kmem_tracing_enabled = false;
+	kmemtrace_stop_probes();
 }
 
 static void kmemtrace_headers(struct seq_file *s)
@@ -66,47 +210,84 @@ static void kmemtrace_headers(struct seq_file *s)
 }
 
 /*
- * The two following functions give the original output from kmemtrace,
- * or something close to....perhaps they need some missing things
+ * The following functions give the original output from kmemtrace,
+ * plus the origin CPU, since reordering occurs in-kernel now.
  */
+
+#define KMEMTRACE_USER_ALLOC	0
+#define KMEMTRACE_USER_FREE	1
+
+struct kmemtrace_user_event {
+	u8			event_id;
+	u8			type_id;
+	u16			event_size;
+	u32			cpu;
+	u64			timestamp;
+	unsigned long		call_site;
+	unsigned long		ptr;
+};
+
+struct kmemtrace_user_event_alloc {
+	size_t			bytes_req;
+	size_t			bytes_alloc;
+	unsigned		gfp_flags;
+	int			node;
+};
+
 static enum print_line_t
-kmemtrace_print_alloc_original(struct trace_iterator *iter,
-				struct kmemtrace_alloc_entry *entry)
+kmemtrace_print_alloc_user(struct trace_iterator *iter,
+			   struct kmemtrace_alloc_entry *entry)
 {
+	struct kmemtrace_user_event_alloc *ev_alloc;
 	struct trace_seq *s = &iter->seq;
-	int ret;
+	struct kmemtrace_user_event *ev;
+
+	ev = trace_seq_reserve(s, sizeof(*ev));
+	if (!ev)
+		return TRACE_TYPE_PARTIAL_LINE;
 
-	/* Taken from the old linux/kmemtrace.h */
-	ret = trace_seq_printf(s, "type_id %d call_site %lu ptr %lu "
-	  "bytes_req %lu bytes_alloc %lu gfp_flags %lu node %d\n",
-	   entry->type_id, entry->call_site, (unsigned long) entry->ptr,
-	   (unsigned long) entry->bytes_req, (unsigned long) entry->bytes_alloc,
-	   (unsigned long) entry->gfp_flags, entry->node);
+	ev->event_id		= KMEMTRACE_USER_ALLOC;
+	ev->type_id		= entry->type_id;
+	ev->event_size		= sizeof(*ev) + sizeof(*ev_alloc);
+	ev->cpu			= iter->cpu;
+	ev->timestamp		= iter->ts;
+	ev->call_site		= entry->call_site;
+	ev->ptr			= (unsigned long)entry->ptr;
 
-	if (!ret)
+	ev_alloc = trace_seq_reserve(s, sizeof(*ev_alloc));
+	if (!ev_alloc)
 		return TRACE_TYPE_PARTIAL_LINE;
 
+	ev_alloc->bytes_req	= entry->bytes_req;
+	ev_alloc->bytes_alloc	= entry->bytes_alloc;
+	ev_alloc->gfp_flags	= entry->gfp_flags;
+	ev_alloc->node		= entry->node;
+
 	return TRACE_TYPE_HANDLED;
 }
 
 static enum print_line_t
-kmemtrace_print_free_original(struct trace_iterator *iter,
-				struct kmemtrace_free_entry *entry)
+kmemtrace_print_free_user(struct trace_iterator *iter,
+			  struct kmemtrace_free_entry *entry)
 {
 	struct trace_seq *s = &iter->seq;
-	int ret;
+	struct kmemtrace_user_event *ev;
 
-	/* Taken from the old linux/kmemtrace.h */
-	ret = trace_seq_printf(s, "type_id %d call_site %lu ptr %lu\n",
-	   entry->type_id, entry->call_site, (unsigned long) entry->ptr);
-
-	if (!ret)
+	ev = trace_seq_reserve(s, sizeof(*ev));
+	if (!ev)
 		return TRACE_TYPE_PARTIAL_LINE;
 
+	ev->event_id		= KMEMTRACE_USER_FREE;
+	ev->type_id		= entry->type_id;
+	ev->event_size		= sizeof(*ev);
+	ev->cpu			= iter->cpu;
+	ev->timestamp		= iter->ts;
+	ev->call_site		= entry->call_site;
+	ev->ptr			= (unsigned long)entry->ptr;
+
 	return TRACE_TYPE_HANDLED;
 }
 
-
 /* The two other following provide a more minimalistic output */
 static enum print_line_t
 kmemtrace_print_alloc_compress(struct trace_iterator *iter,
@@ -178,7 +359,7 @@ kmemtrace_print_alloc_compress(struct trace_iterator *iter,
 
 static enum print_line_t
 kmemtrace_print_free_compress(struct trace_iterator *iter,
-				struct kmemtrace_free_entry *entry)
+			      struct kmemtrace_free_entry *entry)
 {
 	struct trace_seq *s = &iter->seq;
 	int ret;
@@ -239,20 +420,22 @@ static enum print_line_t kmemtrace_print_line(struct trace_iterator *iter)
 	switch (entry->type) {
 	case TRACE_KMEM_ALLOC: {
 		struct kmemtrace_alloc_entry *field;
+
 		trace_assign_type(field, entry);
 		if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL)
 			return kmemtrace_print_alloc_compress(iter, field);
 		else
-			return kmemtrace_print_alloc_original(iter, field);
+			return kmemtrace_print_alloc_user(iter, field);
 	}
 
 	case TRACE_KMEM_FREE: {
 		struct kmemtrace_free_entry *field;
+
 		trace_assign_type(field, entry);
 		if (kmem_tracer_flags.val & TRACE_KMEM_OPT_MINIMAL)
 			return kmemtrace_print_free_compress(iter, field);
 		else
-			return kmemtrace_print_free_original(iter, field);
+			return kmemtrace_print_free_user(iter, field);
 	}
 
 	default:
@@ -260,70 +443,13 @@ static enum print_line_t kmemtrace_print_line(struct trace_iterator *iter)
 	}
 }
 
-/* Trace allocations */
-void kmemtrace_mark_alloc_node(enum kmemtrace_type_id type_id,
-			     unsigned long call_site,
-			     const void *ptr,
-			     size_t bytes_req,
-			     size_t bytes_alloc,
-			     gfp_t gfp_flags,
-			     int node)
-{
-	struct ring_buffer_event *event;
-	struct kmemtrace_alloc_entry *entry;
-	struct trace_array *tr = kmemtrace_array;
-
-	if (!kmem_tracing_enabled)
-		return;
-
-	event = trace_buffer_lock_reserve(tr, TRACE_KMEM_ALLOC,
-					  sizeof(*entry), 0, 0);
-	if (!event)
-		return;
-	entry	= ring_buffer_event_data(event);
-
-	entry->call_site = call_site;
-	entry->ptr = ptr;
-	entry->bytes_req = bytes_req;
-	entry->bytes_alloc = bytes_alloc;
-	entry->gfp_flags = gfp_flags;
-	entry->node	=	node;
-
-	trace_buffer_unlock_commit(tr, event, 0, 0);
-}
-EXPORT_SYMBOL(kmemtrace_mark_alloc_node);
-
-void kmemtrace_mark_free(enum kmemtrace_type_id type_id,
-		       unsigned long call_site,
-		       const void *ptr)
-{
-	struct ring_buffer_event *event;
-	struct kmemtrace_free_entry *entry;
-	struct trace_array *tr = kmemtrace_array;
-
-	if (!kmem_tracing_enabled)
-		return;
-
-	event = trace_buffer_lock_reserve(tr, TRACE_KMEM_FREE,
-					  sizeof(*entry), 0, 0);
-	if (!event)
-		return;
-	entry	= ring_buffer_event_data(event);
-	entry->type_id	= type_id;
-	entry->call_site = call_site;
-	entry->ptr = ptr;
-
-	trace_buffer_unlock_commit(tr, event, 0, 0);
-}
-EXPORT_SYMBOL(kmemtrace_mark_free);
-
 static struct tracer kmem_tracer __read_mostly = {
-	.name		= "kmemtrace",
-	.init		= kmem_trace_init,
-	.reset		= kmem_trace_reset,
-	.print_line	= kmemtrace_print_line,
-	.print_header = kmemtrace_headers,
-	.flags		= &kmem_tracer_flags
+	.name			= "kmemtrace",
+	.init			= kmem_trace_init,
+	.reset			= kmem_trace_reset,
+	.print_line		= kmemtrace_print_line,
+	.print_header		= kmemtrace_headers,
+	.flags			= &kmem_tracer_flags
 };
 
 void kmemtrace_init(void)
@@ -335,5 +461,4 @@ static int __init init_kmem_tracer(void)
 {
 	return register_tracer(&kmem_tracer);
 }
-
 device_initcall(init_kmem_tracer);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cb0ce3f..cbc168f 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -182,6 +182,12 @@ struct trace_power {
 	struct power_trace	state_data;
 };
 
+enum kmemtrace_type_id {
+	KMEMTRACE_TYPE_KMALLOC = 0,	/* kmalloc() or kfree(). */
+	KMEMTRACE_TYPE_CACHE,		/* kmem_cache_*(). */
+	KMEMTRACE_TYPE_PAGES,		/* __get_free_pages() and friends. */
+};
+
 struct kmemtrace_alloc_entry {
 	struct trace_entry	ent;
 	enum kmemtrace_type_id type_id;
diff --git a/lib/decompress_bunzip2.c b/lib/decompress_bunzip2.c
index 5d3ddb5..708e2a8 100644
--- a/lib/decompress_bunzip2.c
+++ b/lib/decompress_bunzip2.c
@@ -50,6 +50,7 @@
 #endif /* !STATIC */
 
 #include <linux/decompress/mm.h>
+#include <linux/slab.h>
 
 #ifndef INT_MAX
 #define INT_MAX 0x7fffffff
diff --git a/lib/decompress_inflate.c b/lib/decompress_inflate.c
index 839a329..e36b296 100644
--- a/lib/decompress_inflate.c
+++ b/lib/decompress_inflate.c
@@ -23,6 +23,7 @@
 #endif /* STATIC */
 
 #include <linux/decompress/mm.h>
+#include <linux/slab.h>
 
 #define INBUF_LEN (16*1024)
 
diff --git a/lib/decompress_unlzma.c b/lib/decompress_unlzma.c
index 546f2f4..32123a1 100644
--- a/lib/decompress_unlzma.c
+++ b/lib/decompress_unlzma.c
@@ -34,6 +34,7 @@
 #endif /* STATIC */
 
 #include <linux/decompress/mm.h>
+#include <linux/slab.h>
 
 #define	MIN(a, b) (((a) < (b)) ? (a) : (b))
 
diff --git a/mm/failslab.c b/mm/failslab.c
index 7c6ea64..9339de5 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -1,4 +1,5 @@
 #include <linux/fault-inject.h>
+#include <linux/gfp.h>
 
 static struct {
 	struct fault_attr attr;
diff --git a/mm/slab.c b/mm/slab.c
index 9ec66c3..b584002 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3565,8 +3565,8 @@ void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
 	void *ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
 
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret,
-			     obj_size(cachep), cachep->buffer_size, flags);
+	trace_kmem_cache_alloc(_RET_IP_, ret,
+			       obj_size(cachep), cachep->buffer_size, flags);
 
 	return ret;
 }
@@ -3627,9 +3627,9 @@ void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	void *ret = __cache_alloc_node(cachep, flags, nodeid,
 				       __builtin_return_address(0));
 
-	kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret,
-				  obj_size(cachep), cachep->buffer_size,
-				  flags, nodeid);
+	trace_kmem_cache_alloc_node(_RET_IP_, ret,
+				    obj_size(cachep), cachep->buffer_size,
+				    flags, nodeid);
 
 	return ret;
 }
@@ -3657,9 +3657,8 @@ __do_kmalloc_node(size_t size, gfp_t flags, int node, void *caller)
 		return cachep;
 	ret = kmem_cache_alloc_node_notrace(cachep, flags, node);
 
-	kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC,
-				  (unsigned long) caller, ret,
-				  size, cachep->buffer_size, flags, node);
+	trace_kmalloc_node((unsigned long) caller, ret,
+			   size, cachep->buffer_size, flags, node);
 
 	return ret;
 }
@@ -3709,9 +3708,8 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 		return cachep;
 	ret = __cache_alloc(cachep, flags, caller);
 
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC,
-			     (unsigned long) caller, ret,
-			     size, cachep->buffer_size, flags);
+	trace_kmalloc((unsigned long) caller, ret,
+		      size, cachep->buffer_size, flags);
 
 	return ret;
 }
@@ -3757,7 +3755,7 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
 	__cache_free(cachep, objp);
 	local_irq_restore(flags);
 
-	kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, objp);
+	trace_kmem_cache_free(_RET_IP_, objp);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
@@ -3775,6 +3773,8 @@ void kfree(const void *objp)
 	struct kmem_cache *c;
 	unsigned long flags;
 
+	trace_kfree(_RET_IP_, objp);
+
 	if (unlikely(ZERO_OR_NULL_PTR(objp)))
 		return;
 	local_irq_save(flags);
@@ -3784,8 +3784,6 @@ void kfree(const void *objp)
 	debug_check_no_obj_freed(objp, obj_size(c));
 	__cache_free(c, (void *)objp);
 	local_irq_restore(flags);
-
-	kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, objp);
 }
 EXPORT_SYMBOL(kfree);
 
diff --git a/mm/slob.c b/mm/slob.c
index 4dd6516..a2d4ab3 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -490,9 +490,8 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node)
 		*m = size;
 		ret = (void *)m + align;
 
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC,
-					  _RET_IP_, ret,
-					  size, size + align, gfp, node);
+		trace_kmalloc_node(_RET_IP_, ret,
+				   size, size + align, gfp, node);
 	} else {
 		unsigned int order = get_order(size);
 
@@ -503,9 +502,8 @@ void *__kmalloc_node(size_t size, gfp_t gfp, int node)
 			page->private = size;
 		}
 
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC,
-					  _RET_IP_, ret,
-					  size, PAGE_SIZE << order, gfp, node);
+		trace_kmalloc_node(_RET_IP_, ret,
+				   size, PAGE_SIZE << order, gfp, node);
 	}
 
 	return ret;
@@ -516,6 +514,8 @@ void kfree(const void *block)
 {
 	struct slob_page *sp;
 
+	trace_kfree(_RET_IP_, block);
+
 	if (unlikely(ZERO_OR_NULL_PTR(block)))
 		return;
 
@@ -526,8 +526,6 @@ void kfree(const void *block)
 		slob_free(m, *m + align);
 	} else
 		put_page(&sp->page);
-
-	kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, block);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -599,16 +597,14 @@ void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 
 	if (c->size < PAGE_SIZE) {
 		b = slob_alloc(c->size, flags, c->align, node);
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE,
-					  _RET_IP_, b, c->size,
-					  SLOB_UNITS(c->size) * SLOB_UNIT,
-					  flags, node);
+		trace_kmem_cache_alloc_node(_RET_IP_, b, c->size,
+					    SLOB_UNITS(c->size) * SLOB_UNIT,
+					    flags, node);
 	} else {
 		b = slob_new_pages(flags, get_order(c->size), node);
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE,
-					  _RET_IP_, b, c->size,
-					  PAGE_SIZE << get_order(c->size),
-					  flags, node);
+		trace_kmem_cache_alloc_node(_RET_IP_, b, c->size,
+					    PAGE_SIZE << get_order(c->size),
+					    flags, node);
 	}
 
 	if (c->ctor)
@@ -646,7 +642,7 @@ void kmem_cache_free(struct kmem_cache *c, void *b)
 		__kmem_cache_free(b, c->size);
 	}
 
-	kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, b);
+	trace_kmem_cache_free(_RET_IP_, b);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
diff --git a/mm/slub.c b/mm/slub.c
index 7aaa121..7ab54ec 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1621,8 +1621,7 @@ void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
 	void *ret = slab_alloc(s, gfpflags, -1, _RET_IP_);
 
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret,
-			     s->objsize, s->size, gfpflags);
+	trace_kmem_cache_alloc(_RET_IP_, ret, s->objsize, s->size, gfpflags);
 
 	return ret;
 }
@@ -1641,8 +1640,8 @@ void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
 	void *ret = slab_alloc(s, gfpflags, node, _RET_IP_);
 
-	kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_CACHE, _RET_IP_, ret,
-				  s->objsize, s->size, gfpflags, node);
+	trace_kmem_cache_alloc_node(_RET_IP_, ret,
+				    s->objsize, s->size, gfpflags, node);
 
 	return ret;
 }
@@ -1767,7 +1766,7 @@ void kmem_cache_free(struct kmem_cache *s, void *x)
 
 	slab_free(s, page, x, _RET_IP_);
 
-	kmemtrace_mark_free(KMEMTRACE_TYPE_CACHE, _RET_IP_, x);
+	trace_kmem_cache_free(_RET_IP_, x);
 }
 EXPORT_SYMBOL(kmem_cache_free);
 
@@ -2702,8 +2701,7 @@ void *__kmalloc(size_t size, gfp_t flags)
 
 	ret = slab_alloc(s, flags, -1, _RET_IP_);
 
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, ret,
-			     size, s->size, flags);
+	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
 	return ret;
 }
@@ -2729,10 +2727,9 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
 	if (unlikely(size > SLUB_MAX_SIZE)) {
 		ret = kmalloc_large_node(size, flags, node);
 
-		kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC,
-					  _RET_IP_, ret,
-					  size, PAGE_SIZE << get_order(size),
-					  flags, node);
+		trace_kmalloc_node(_RET_IP_, ret,
+				   size, PAGE_SIZE << get_order(size),
+				   flags, node);
 
 		return ret;
 	}
@@ -2744,8 +2741,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
 
 	ret = slab_alloc(s, flags, node, _RET_IP_);
 
-	kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, ret,
-				  size, s->size, flags, node);
+	trace_kmalloc_node(_RET_IP_, ret, size, s->size, flags, node);
 
 	return ret;
 }
@@ -2796,6 +2792,8 @@ void kfree(const void *x)
 	struct page *page;
 	void *object = (void *)x;
 
+	trace_kfree(_RET_IP_, x);
+
 	if (unlikely(ZERO_OR_NULL_PTR(x)))
 		return;
 
@@ -2806,8 +2804,6 @@ void kfree(const void *x)
 		return;
 	}
 	slab_free(page->slab, page, object, _RET_IP_);
-
-	kmemtrace_mark_free(KMEMTRACE_TYPE_KMALLOC, _RET_IP_, x);
 }
 EXPORT_SYMBOL(kfree);
 
@@ -3290,8 +3286,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 	ret = slab_alloc(s, gfpflags, -1, caller);
 
 	/* Honor the call site pointer we recieved. */
-	kmemtrace_mark_alloc(KMEMTRACE_TYPE_KMALLOC, caller, ret, size,
-			     s->size, gfpflags);
+	trace_kmalloc(caller, ret, size, s->size, gfpflags);
 
 	return ret;
 }
@@ -3313,8 +3308,7 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
 	ret = slab_alloc(s, gfpflags, node, caller);
 
 	/* Honor the call site pointer we recieved. */
-	kmemtrace_mark_alloc_node(KMEMTRACE_TYPE_KMALLOC, caller, ret,
-				  size, s->size, gfpflags, node);
+	trace_kmalloc_node(caller, ret, size, s->size, gfpflags, node);
 
 	return ret;
 }
diff --git a/mm/util.c b/mm/util.c
index 7c122e4..2599e83 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -4,6 +4,7 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sched.h>
+#include <linux/tracepoint.h>
 #include <asm/uaccess.h>
 
 /**
@@ -236,3 +237,18 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
 	return ret;
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
+
+/* Tracepoints definitions. */
+DEFINE_TRACE(kmalloc);
+DEFINE_TRACE(kmem_cache_alloc);
+DEFINE_TRACE(kmalloc_node);
+DEFINE_TRACE(kmem_cache_alloc_node);
+DEFINE_TRACE(kfree);
+DEFINE_TRACE(kmem_cache_free);
+
+EXPORT_TRACEPOINT_SYMBOL(kmalloc);
+EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
+EXPORT_TRACEPOINT_SYMBOL(kmalloc_node);
+EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc_node);
+EXPORT_TRACEPOINT_SYMBOL(kfree);
+EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-05 19:39 [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Ingo Molnar
@ 2009-04-05 19:56 ` Linus Torvalds
  2009-04-05 20:02   ` Linus Torvalds
  2009-04-07  1:51 ` Linus Torvalds
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2009-04-05 19:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Pekka Enberg, Steven Rostedt, Andrew Morton,
	Thomas Gleixner, Eduard - Gabriel Munteanu



On Sun, 5 Apr 2009, Ingo Molnar wrote:
> 
> ( Please note that i rebased the tree exactly once, shortly after it 
>   got finished, to make it all bisectable and reviewable: the 
>   perfect insight shown in the tree now was IMHO not humanly 
>   possible to achieve in advance.

This is actually the rigth thing to do. Rebasing is not wrong, if it is 
done judiciously (and not on already-exposed stuff).

Rebasing is bad if
 - you do it so late in the game that all test experience is basically 
   worthless
 - after you've pushed out and other people have seen and depend on that 
   branch (and you didn't warn them)
 - you do it to other peoples git commits, so that their tree (that was 
   the source of the commit) now has the same commit duplicated as 
   something else.

I wrote a posting about rebasing to Dave Airlie on dri-devel, that might 
be googleable. Hmm. Here:

	http://www.mail-archive.com/dri-devel@lists.sourceforge.net/msg39091.html

so rebase isn't bad, it's often a great way to fix things as you go along. 
It just goes with a few basic caveats.

			Linus


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-05 19:56 ` Linus Torvalds
@ 2009-04-05 20:02   ` Linus Torvalds
  0 siblings, 0 replies; 17+ messages in thread
From: Linus Torvalds @ 2009-04-05 20:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Pekka Enberg, Steven Rostedt, Andrew Morton,
	Thomas Gleixner, Eduard - Gabriel Munteanu

On Sun, 5 Apr 2009, Linus Torvalds wrote:
> 
> so rebase isn't bad, it's often a great way to fix things as you go along. 
> It just goes with a few basic caveats.

Btw, rebasing gets a really bad name exactly because when those rules are 
violated, it ends up being _really_ painful and almost impossible for 
people to work together because the end result just ends up being some 
kind of patch-queue thing, and that just doesnt' work once the number of 
queue entries reach a certain pain threshold.

And it's happened in ACPI, DRI, ALSA, x86, etc, so those few basic caveats 
are easy to violate and then have everybody hate you. 'git rebase' is very 
useful (especially with '-i'), but it's really a dangerous tool in that it 
can really wreak havoc too.

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-05 19:39 [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Ingo Molnar
  2009-04-05 19:56 ` Linus Torvalds
@ 2009-04-07  1:51 ` Linus Torvalds
  2009-04-07  3:51   ` Linus Torvalds
  2009-04-07  4:58   ` Ingo Molnar
  1 sibling, 2 replies; 17+ messages in thread
From: Linus Torvalds @ 2009-04-07  1:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Pekka Enberg, Steven Rostedt, Andrew Morton,
	Thomas Gleixner, Eduard - Gabriel Munteanu



On Sun, 5 Apr 2009, Ingo Molnar wrote:
> 
> Please pull the latest kmemtrace-for-linus git tree from:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git kmemtrace-for-linus
> 
> We kept this topic separate from the main tracing tree due to the 
> unexpectedly wide and messy-looking scope of the fixes Pekka needed 
> to do to untangle various slab*.h, rcu*.h and fs.h dependency 
> chains.

I'm not sure this is the tree that brings in the problem, but my wife's 
Mac Mini won't boot any more, and it looks like some slub or percpu issue, 
so regardless, roughly the right people are involved in the cc here 
already.

I get odd NUL page faults or GP faults in either __kmalloc, 
__kmalloc_track_caller or kmem_cache_alloc, and they all seem to happen on 
roughly the same code, ie it's something like this:

        movq    752(%r13,%rax,8), %rdx  # <variable>.cpu_slab, c
        movl    24(%rdx), %eax  # <variable>.objsize,
        movl    %eax, -44(%rbp) #, objsize
        movq    (%rdx), %r12    # <variable>.freelist, object
        testq   %r12, %r12      # object
        je      .L617   #,
        mov     20(%rdx), %eax  # <variable>.offset, <variable>.offset
->      movq    (%r12,%rax,8), %rax     #* object, tmp79
        movq    %rax, (%rdx)    # tmp79, <variable>.freelist

where that arrow points to the instruction that seems to be faulting.

I think it's this code:

                object = c->freelist;
                c->freelist = object[c->offset];

and that "object[c->offset]" in particular.

I have not tried to bisect it yet, and I'll do that, but if this sounds 
familiar to anybody, please holler before I waste a lot of time on it.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  1:51 ` Linus Torvalds
@ 2009-04-07  3:51   ` Linus Torvalds
  2009-04-07  4:20     ` Linus Torvalds
  2009-04-07  4:58   ` Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2009-04-07  3:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Eduard - Gabriel Munteanu

On Mon, 6 Apr 2009, Linus Torvalds wrote:
> 
> I'm not sure this is the tree that brings in the problem, but my wife's 
> Mac Mini won't boot any more, and it looks like some slub or percpu issue, 
> so regardless, roughly the right people are involved in the cc here 
> already.
> 
> I get odd NUL page faults or GP faults in either __kmalloc, 
> __kmalloc_track_caller or kmem_cache_alloc

Hmm. Bisected to Andrew's big chunk of merges on April 1st. Right now I 
have

 - bad: 527410ff7fc5d45fe41523c0ba061113dea22017 ("cirrusfb: GD5446 
   fixes")

 - good: 63cd885426872254e82dac2d9e13ea4f720c21dc ("ntfs: remove private 
   wrapper of endian helpers")

and all the commits in between are all from that same -mm series. Very 
interesting. 

Anyway, the SLUB errors and the per-cpu'ness of the thing seems to have 
been a false lead and irrelevant. 

Andrew, I'll continue to bisect. Looks like it might be the epoll changes 
(that's the only really 'core' thing there). Although I don't understand 
why those would make that Mac Mini unhappy, but not affect the other 
machines. So maybe I should just stop guessing until the bisection ends..

			Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  3:51   ` Linus Torvalds
@ 2009-04-07  4:20     ` Linus Torvalds
  2009-04-07  4:45       ` Linus Torvalds
  0 siblings, 1 reply; 17+ messages in thread
From: Linus Torvalds @ 2009-04-07  4:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Eduard - Gabriel Munteanu



On Mon, 6 Apr 2009, Linus Torvalds wrote:
> 
> Andrew, I'll continue to bisect. Looks like it might be the epoll changes 
> (that's the only really 'core' thing there). Although I don't understand 
> why those would make that Mac Mini unhappy, but not affect the other 
> machines. So maybe I should just stop guessing until the bisection ends..

It bisected past them. I'm getting worried that it's timing-related, 
because nothing that remains looks even remotely interesting for that Mac 
mini, but right now:

 - bad: 56fcef75117a153f298b3fe54af31053f53997dd
 - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5

and there's not a whole lot of commits in between.

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  4:20     ` Linus Torvalds
@ 2009-04-07  4:45       ` Linus Torvalds
  2009-04-07  5:02         ` Wu Fengguang
                           ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Linus Torvalds @ 2009-04-07  4:45 UTC (permalink / raw)
  To: Andrew Morton, Avan Anishchuk, Wu Fengguang
  Cc: Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Ingo Molnar, Thomas Gleixner, Eduard - Gabriel Munteanu



On Mon, 6 Apr 2009, Linus Torvalds wrote:
> 
> It bisected past them. I'm getting worried that it's timing-related, 
> because nothing that remains looks even remotely interesting for that Mac 
> mini, but right now:
> 
>  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
>  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> 
> and there's not a whole lot of commits in between.

It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support for 
"mode=" mount option'.

And I checked. Reverting it at the tip fixes it. So no random timing 
fluctuations.

So that commit causes some random SLAB corruption, that then (depending 
apparently on luck) may or may not crash in some odd random places later.

Wu?

		Linus

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  4:45       ` Linus Torvalds
@ 2009-04-07  5:02         ` Wu Fengguang
  2009-04-07  5:28         ` [patch] ramfs: add support for "mode=" mount option, fix Ingo Molnar
  2009-04-07  5:30         ` [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Wu Fengguang
  2 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-04-07  5:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Avan Anishchuk, Linux Kernel Mailing List,
	Pekka Enberg, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Eduard - Gabriel Munteanu

On Tue, Apr 07, 2009 at 12:45:37PM +0800, Linus Torvalds wrote:
> 
> 
> On Mon, 6 Apr 2009, Linus Torvalds wrote:
> > 
> > It bisected past them. I'm getting worried that it's timing-related, 
> > because nothing that remains looks even remotely interesting for that Mac 
> > mini, but right now:
> > 
> >  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
> >  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> > 
> > and there's not a whole lot of commits in between.
> 
> It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support for 
> "mode=" mount option'.
> 
> And I checked. Reverting it at the tip fixes it. So no random timing 
> fluctuations.
> 
> So that commit causes some random SLAB corruption, that then (depending 
> apparently on luck) may or may not crash in some odd random places later.
> 
> Wu?

DON'T PANIC!  -- The Hitchhiker's Guide to the Galaxy

I'm looking into this :)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [patch] ramfs: add support for "mode=" mount option, fix
  2009-04-07  4:45       ` Linus Torvalds
  2009-04-07  5:02         ` Wu Fengguang
@ 2009-04-07  5:28         ` Ingo Molnar
  2009-04-07  5:55           ` Wu Fengguang
  2009-04-07  5:30         ` [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Wu Fengguang
  2 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2009-04-07  5:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Avan Anishchuk, Wu Fengguang,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 6 Apr 2009, Linus Torvalds wrote:
> > 
> > It bisected past them. I'm getting worried that it's timing-related, 
> > because nothing that remains looks even remotely interesting for that Mac 
> > mini, but right now:
> > 
> >  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
> >  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> > 
> > and there's not a whole lot of commits in between.
> 
> It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support 
> for "mode=" mount option'.
> 
> And I checked. Reverting it at the tip fixes it. So no random 
> timing fluctuations.
> 
> So that commit causes some random SLAB corruption, that then 
> (depending apparently on luck) may or may not crash in some odd 
> random places later.

ah - forget my previous mail then.

This commit does have a couple of genuinely odd looking lines.

For example:

+       sb->s_fs_info = fsi;
+
+       err = ramfs_parse_options(data, &fsi->mount_opts);
+       if (err)
+               goto fail;

Say we fail in ramfs_parse_options() and do the 'fail' pattern:

+fail:
+       kfree(fsi);
+       iput(inode);
+       return err;

so we have 'fsi' kfree()'d but dont clear sb->s_fs_info! That's 
almost always a bad practice. And indeed, in the kill_super 
callback:

+static void ramfs_kill_sb(struct super_block *sb)
+{
+       kfree(sb->s_fs_info);

What ensures that this cannot be a double kfree() memory corruption? 
That pointer should have been cleared with something like the patch 
below. (totally untested)

And there's also another, probably just theoretical worry about 
another failure path:

+       fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
+       if (!fsi) {
+               err = -ENOMEM;
+               goto fail;
+       }
+       sb->s_fs_info = fsi;

leaves sb->s_fs_info uninitialized in the failure case, and might 
hit this code unconditionally:

+static void ramfs_kill_sb(struct super_block *sb)
+{
+       kfree(sb->s_fs_info);
+       kill_litter_super(sb);
+}

Leaving this code at the mercy of the external call environment 
initializing sb->s_fs_info. Which if it does not do (or stops 
doing in the future), can trigger a kfree of a random pointer.

(I think ->kill_super() gets called even if ->fill_super() fails, 
but i have not checked closely.)

These kinds of assymetric failure paths are really a red flag during 
review.

VFS infrastructure nit: we have 20 other similar looking but 
slightly differently implemented filesystem options parsers, in each 
filesystem. Might make sense to factor that out a bit and 
standardize it across all filesystems and make it all a bit safer. 
Duplicating code like that is never good IMHO.

	Ingo

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a404fb8..3a6b193 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -221,22 +221,23 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 	save_mount_options(sb, data);
 
 	fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
+	sb->s_fs_info = fsi;
 	if (!fsi) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_fs_info = fsi;
 
 	err = ramfs_parse_options(data, &fsi->mount_opts);
 	if (err)
 		goto fail;
 
-	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
-	sb->s_magic = RAMFS_MAGIC;
-	sb->s_op = &ramfs_ops;
-	sb->s_time_gran = 1;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits	= PAGE_CACHE_SHIFT;
+	sb->s_magic		= RAMFS_MAGIC;
+	sb->s_op		= &ramfs_ops;
+	sb->s_time_gran		= 1;
+
 	inode = ramfs_get_inode(sb, S_IFDIR | fsi->mount_opts.mode, 0);
 	if (!inode) {
 		err = -ENOMEM;
@@ -244,14 +245,16 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 	}
 
 	root = d_alloc_root(inode);
+	sb->s_root = root;
 	if (!root) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_root = root;
+
 	return 0;
 fail:
 	kfree(fsi);
+	sb->s_fs_info = NULL;
 	iput(inode);
 	return err;
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [patch] ramfs: add support for "mode=" mount option, fix
  2009-04-07  5:28         ` [patch] ramfs: add support for "mode=" mount option, fix Ingo Molnar
@ 2009-04-07  5:55           ` Wu Fengguang
  2009-04-07  6:03             ` Ingo Molnar
  2009-04-07  6:20             ` [PATCH] ramfs: add support for "mode=" mount option, fix Ingo Molnar
  0 siblings, 2 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-04-07  5:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Avan Anishchuk,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu

On Tue, Apr 07, 2009 at 01:28:01PM +0800, Ingo Molnar wrote:
> 
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, 6 Apr 2009, Linus Torvalds wrote:
> > > 
> > > It bisected past them. I'm getting worried that it's timing-related, 
> > > because nothing that remains looks even remotely interesting for that Mac 
> > > mini, but right now:
> > > 
> > >  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
> > >  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> > > 
> > > and there's not a whole lot of commits in between.
> > 
> > It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support 
> > for "mode=" mount option'.
> > 
> > And I checked. Reverting it at the tip fixes it. So no random 
> > timing fluctuations.
> > 
> > So that commit causes some random SLAB corruption, that then 
> > (depending apparently on luck) may or may not crash in some odd 
> > random places later.
> 
> ah - forget my previous mail then.
> 
> This commit does have a couple of genuinely odd looking lines.
> 
> For example:
> 
> +       sb->s_fs_info = fsi;
> +
> +       err = ramfs_parse_options(data, &fsi->mount_opts);
> +       if (err)
> +               goto fail;
> 
> Say we fail in ramfs_parse_options() and do the 'fail' pattern:
> 
> +fail:
> +       kfree(fsi);
> +       iput(inode);
> +       return err;
> 
> so we have 'fsi' kfree()'d but dont clear sb->s_fs_info! That's 
> almost always a bad practice. And indeed, in the kill_super 

Sorry - yes, the double kfree() shall be the root cause!

get_sb_nodev() calls kill_sb() after a failed fill_super():

        error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
        if (error) {
                up_write(&s->s_umount);
                deactivate_super(s);
                return error;
        }

> callback:
> 
> +static void ramfs_kill_sb(struct super_block *sb)
> +{
> +       kfree(sb->s_fs_info);
> 
> What ensures that this cannot be a double kfree() memory corruption? 
> That pointer should have been cleared with something like the patch 
> below. (totally untested)
> 
> And there's also another, probably just theoretical worry about 
> another failure path:
> 
> +       fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
> +       if (!fsi) {
> +               err = -ENOMEM;
> +               goto fail;
> +       }
> +       sb->s_fs_info = fsi;
> 
> leaves sb->s_fs_info uninitialized in the failure case, and might 
> hit this code unconditionally:
> 
> +static void ramfs_kill_sb(struct super_block *sb)
> +{
> +       kfree(sb->s_fs_info);
> +       kill_litter_super(sb);
> +}
> 
> Leaving this code at the mercy of the external call environment 
> initializing sb->s_fs_info. Which if it does not do (or stops 
> doing in the future), can trigger a kfree of a random pointer.
> 
> (I think ->kill_super() gets called even if ->fill_super() fails, 
> but i have not checked closely.)

You are right, see above.

> These kinds of assymetric failure paths are really a red flag during 
> review.
> 
> VFS infrastructure nit: we have 20 other similar looking but 
> slightly differently implemented filesystem options parsers, in each 
> filesystem. Might make sense to factor that out a bit and 
> standardize it across all filesystems and make it all a bit safer. 
> Duplicating code like that is never good IMHO.
> 
> 	Ingo
> 

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

The patch looks pretty good and runs OK here.

Thanks,
Fengguang

> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index a404fb8..3a6b193 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -221,22 +221,23 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
>  	save_mount_options(sb, data);
>  
>  	fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
> +	sb->s_fs_info = fsi;
>  	if (!fsi) {
>  		err = -ENOMEM;
>  		goto fail;
>  	}
> -	sb->s_fs_info = fsi;
>  
>  	err = ramfs_parse_options(data, &fsi->mount_opts);
>  	if (err)
>  		goto fail;
>  
> -	sb->s_maxbytes = MAX_LFS_FILESIZE;
> -	sb->s_blocksize = PAGE_CACHE_SIZE;
> -	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
> -	sb->s_magic = RAMFS_MAGIC;
> -	sb->s_op = &ramfs_ops;
> -	sb->s_time_gran = 1;
> +	sb->s_maxbytes		= MAX_LFS_FILESIZE;
> +	sb->s_blocksize		= PAGE_CACHE_SIZE;
> +	sb->s_blocksize_bits	= PAGE_CACHE_SHIFT;
> +	sb->s_magic		= RAMFS_MAGIC;
> +	sb->s_op		= &ramfs_ops;
> +	sb->s_time_gran		= 1;
> +
>  	inode = ramfs_get_inode(sb, S_IFDIR | fsi->mount_opts.mode, 0);
>  	if (!inode) {
>  		err = -ENOMEM;
> @@ -244,14 +245,16 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
>  	}
>  
>  	root = d_alloc_root(inode);
> +	sb->s_root = root;
>  	if (!root) {
>  		err = -ENOMEM;
>  		goto fail;
>  	}
> -	sb->s_root = root;
> +
>  	return 0;
>  fail:
>  	kfree(fsi);
> +	sb->s_fs_info = NULL;
>  	iput(inode);
>  	return err;
>  }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] ramfs: add support for "mode=" mount option, fix
  2009-04-07  5:55           ` Wu Fengguang
@ 2009-04-07  6:03             ` Ingo Molnar
  2009-04-07  6:16               ` [PATCH] ramfs: fix double freeing s_fs_info on failed mount Wu Fengguang
  2009-04-07  6:20             ` [PATCH] ramfs: add support for "mode=" mount option, fix Ingo Molnar
  1 sibling, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2009-04-07  6:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linus Torvalds, Andrew Morton, Avan Anishchuk,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu


* Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Apr 07, 2009 at 01:28:01PM +0800, Ingo Molnar wrote:
> > 
> > * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > 
> > > On Mon, 6 Apr 2009, Linus Torvalds wrote:
> > > > 
> > > > It bisected past them. I'm getting worried that it's timing-related, 
> > > > because nothing that remains looks even remotely interesting for that Mac 
> > > > mini, but right now:
> > > > 
> > > >  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
> > > >  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> > > > 
> > > > and there's not a whole lot of commits in between.
> > > 
> > > It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support 
> > > for "mode=" mount option'.
> > > 
> > > And I checked. Reverting it at the tip fixes it. So no random 
> > > timing fluctuations.
> > > 
> > > So that commit causes some random SLAB corruption, that then 
> > > (depending apparently on luck) may or may not crash in some odd 
> > > random places later.
> > 
> > ah - forget my previous mail then.
> > 
> > This commit does have a couple of genuinely odd looking lines.
> > 
> > For example:
> > 
> > +       sb->s_fs_info = fsi;
> > +
> > +       err = ramfs_parse_options(data, &fsi->mount_opts);
> > +       if (err)
> > +               goto fail;
> > 
> > Say we fail in ramfs_parse_options() and do the 'fail' pattern:
> > 
> > +fail:
> > +       kfree(fsi);
> > +       iput(inode);
> > +       return err;
> > 
> > so we have 'fsi' kfree()'d but dont clear sb->s_fs_info! That's 
> > almost always a bad practice. And indeed, in the kill_super 
> 
> Sorry - yes, the double kfree() shall be the root cause!
> 
> get_sb_nodev() calls kill_sb() after a failed fill_super():
> 
>         error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);
>         if (error) {
>                 up_write(&s->s_umount);
>                 deactivate_super(s);
>                 return error;
>         }
> 
> > callback:
> > 
> > +static void ramfs_kill_sb(struct super_block *sb)
> > +{
> > +       kfree(sb->s_fs_info);
> > 
> > What ensures that this cannot be a double kfree() memory corruption? 
> > That pointer should have been cleared with something like the patch 
> > below. (totally untested)
> > 
> > And there's also another, probably just theoretical worry about 
> > another failure path:
> > 
> > +       fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
> > +       if (!fsi) {
> > +               err = -ENOMEM;
> > +               goto fail;
> > +       }
> > +       sb->s_fs_info = fsi;
> > 
> > leaves sb->s_fs_info uninitialized in the failure case, and might 
> > hit this code unconditionally:
> > 
> > +static void ramfs_kill_sb(struct super_block *sb)
> > +{
> > +       kfree(sb->s_fs_info);
> > +       kill_litter_super(sb);
> > +}
> > 
> > Leaving this code at the mercy of the external call environment 
> > initializing sb->s_fs_info. Which if it does not do (or stops 
> > doing in the future), can trigger a kfree of a random pointer.
> > 
> > (I think ->kill_super() gets called even if ->fill_super() fails, 
> > but i have not checked closely.)
> 
> You are right, see above.
> 
> > These kinds of assymetric failure paths are really a red flag during 
> > review.
> > 
> > VFS infrastructure nit: we have 20 other similar looking but 
> > slightly differently implemented filesystem options parsers, in each 
> > filesystem. Might make sense to factor that out a bit and 
> > standardize it across all filesystems and make it all a bit safer. 
> > Duplicating code like that is never good IMHO.
> > 
> > 	Ingo
> > 
> 
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> The patch looks pretty good and runs OK here.

ok, good - i didnt even build it - you can add my signoff too:

  Signed-off-by: Ingo Molnar <mingo@elte.hu>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] ramfs: fix double freeing s_fs_info on failed mount
  2009-04-07  6:03             ` Ingo Molnar
@ 2009-04-07  6:16               ` Wu Fengguang
  2009-04-07  6:53                 ` Ingo Molnar
  0 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2009-04-07  6:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Avan Anishchuk,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu

From: Ingo Molnar <mingo@elte.hu>

If ramfs mount fails, s_fs_info will be freed twice in ramfs_fill_super()
and ramfs_kill_sb(), leading to kernel oops.

Consolidate and beautify the code.
Make sure s_fs_info and s_root are in known good states.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/ramfs/inode.c |   19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

--- mm.orig/fs/ramfs/inode.c
+++ mm/fs/ramfs/inode.c
@@ -221,22 +221,23 @@ static int ramfs_fill_super(struct super
 	save_mount_options(sb, data);
 
 	fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
+	sb->s_fs_info = fsi;
 	if (!fsi) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_fs_info = fsi;
 
 	err = ramfs_parse_options(data, &fsi->mount_opts);
 	if (err)
 		goto fail;
 
-	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
-	sb->s_magic = RAMFS_MAGIC;
-	sb->s_op = &ramfs_ops;
-	sb->s_time_gran = 1;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits	= PAGE_CACHE_SHIFT;
+	sb->s_magic		= RAMFS_MAGIC;
+	sb->s_op		= &ramfs_ops;
+	sb->s_time_gran		= 1;
+
 	inode = ramfs_get_inode(sb, S_IFDIR | fsi->mount_opts.mode, 0);
 	if (!inode) {
 		err = -ENOMEM;
@@ -244,14 +245,16 @@ static int ramfs_fill_super(struct super
 	}
 
 	root = d_alloc_root(inode);
+	sb->s_root = root;
 	if (!root) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_root = root;
+
 	return 0;
 fail:
 	kfree(fsi);
+	sb->s_fs_info = NULL;
 	iput(inode);
 	return err;
 }

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] ramfs: fix double freeing s_fs_info on failed mount
  2009-04-07  6:16               ` [PATCH] ramfs: fix double freeing s_fs_info on failed mount Wu Fengguang
@ 2009-04-07  6:53                 ` Ingo Molnar
  2009-04-07  7:05                   ` Wu Fengguang
  0 siblings, 1 reply; 17+ messages in thread
From: Ingo Molnar @ 2009-04-07  6:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Linus Torvalds, Andrew Morton, Avan Anishchuk,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu

* Wu Fengguang <fengguang.wu@intel.com> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> 
> If ramfs mount fails, s_fs_info will be freed twice in 
> ramfs_fill_super() and ramfs_kill_sb(), leading to kernel oops.
> 
> Consolidate and beautify the code. Make sure s_fs_info and s_root 
> are in known good states.
> 
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>

Nit: the commit is missing a Reported-by :)

Linus might not insist on seeing his name mentioned yet another time 
in a commit, but it's generally good practice to always add bug 
report info and names.

Note that in this case the really hard work was there: Linus had to 
spend at least 2 hours on tracking down and bisecting this bug. (and 
Linus probably did this super-fast compared to the average tester - 
most other bug reporters spend a day or more on bisection, limited 
by lack of practice and by the slowness of kernel builds on ordinary 
hardware.)

So the real human effort was spent there, not in my 5 minutes on 
fixing the bug that Linus served on a plate - while the commit only 
credits me. That's not fair :)

See the tip:out-of-tree local commit i made and sent out.

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] ramfs: fix double freeing s_fs_info on failed mount
  2009-04-07  6:53                 ` Ingo Molnar
@ 2009-04-07  7:05                   ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-04-07  7:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andrew Morton, Avan Anishchuk,
	Linux Kernel Mailing List, Pekka Enberg, Steven Rostedt,
	Thomas Gleixner, Eduard - Gabriel Munteanu

On Tue, Apr 07, 2009 at 02:53:08PM +0800, Ingo Molnar wrote:
> 
> * Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > From: Ingo Molnar <mingo@elte.hu>
> > 
> > If ramfs mount fails, s_fs_info will be freed twice in 
> > ramfs_fill_super() and ramfs_kill_sb(), leading to kernel oops.
> > 
> > Consolidate and beautify the code. Make sure s_fs_info and s_root 
> > are in known good states.
> > 
> > Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> 
> Nit: the commit is missing a Reported-by :)
> 
> Linus might not insist on seeing his name mentioned yet another time 
> in a commit, but it's generally good practice to always add bug 
> report info and names.
> 
> Note that in this case the really hard work was there: Linus had to 
> spend at least 2 hours on tracking down and bisecting this bug. (and 
> Linus probably did this super-fast compared to the average tester - 
> most other bug reporters spend a day or more on bisection, limited 
> by lack of practice and by the slowness of kernel builds on ordinary 
> hardware.)
> 
> So the real human effort was spent there, not in my 5 minutes on 
> fixing the bug that Linus served on a plate - while the commit only 
> credits me. That's not fair :)

Good point! Thank you very much for mentoring me on this whole process!

> See the tip:out-of-tree local commit i made and sent out.

Yes I've seen that, very comprehensive changelog and solid&pretty code!

Best regards,
Fengguang Wu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] ramfs: add support for "mode=" mount option, fix
  2009-04-07  5:55           ` Wu Fengguang
  2009-04-07  6:03             ` Ingo Molnar
@ 2009-04-07  6:20             ` Ingo Molnar
  1 sibling, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2009-04-07  6:20 UTC (permalink / raw)
  To: Wu Fengguang, Linus Torvalds
  Cc: Andrew Morton, Avan Anishchuk, Linux Kernel Mailing List,
	Pekka Enberg, Steven Rostedt, Thomas Gleixner,
	Eduard - Gabriel Munteanu




>From 7baa5532398708976ada2502ad11b37f872f6e9e Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Tue, 7 Apr 2009 07:28:01 +0200
Subject: [PATCH] ramfs: add support for "mode=" mount option, fix

Linus reported odd boot failures that showed memory corruption
patterns in the SLUB code and bisected it down to
c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support
for "mode=" mount option'.

That commit does have a couple of genuinely odd looking lines.

For example:

+       sb->s_fs_info = fsi;
+
+       err = ramfs_parse_options(data, &fsi->mount_opts);
+       if (err)
+               goto fail;

Say we fail in ramfs_parse_options() and do the 'fail' pattern:

+fail:
+       kfree(fsi);
+       iput(inode);
+       return err;

so we have 'fsi' kfree()'d but dont clear sb->s_fs_info! That's
almost always a bad practice. And indeed, in the kill_super
callback:

+static void ramfs_kill_sb(struct super_block *sb)
+{
+       kfree(sb->s_fs_info);

What ensures that this cannot be a double kfree() memory corruption?

->kill_super() gets called even if ->fill_super() fails, so this
is a double kfree() when there are no ramfs mount options - which
results in memory corruption.

And there's also another, probably just theoretical worry about
another failure path:

+       fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
+       if (!fsi) {
+               err = -ENOMEM;
+               goto fail;
+       }
+       sb->s_fs_info = fsi;

leaves sb->s_fs_info uninitialized in the failure case, and might
hit this code unconditionally:

+static void ramfs_kill_sb(struct super_block *sb)
+{
+       kfree(sb->s_fs_info);
+       kill_litter_super(sb);
+}

Leaving this code at the mercy of the external call environment
initializing sb->s_fs_info. Which if it does not do (or stops
doing in the future), can trigger a kfree of a random pointer.

So improve this code a bit too, and improve the readability of other
initializations here a bit.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avan Anishchuk <matimatik@gmail.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 fs/ramfs/inode.c |   19 +++++++++++--------
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a404fb8..3a6b193 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -221,22 +221,23 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 	save_mount_options(sb, data);
 
 	fsi = kzalloc(sizeof(struct ramfs_fs_info), GFP_KERNEL);
+	sb->s_fs_info = fsi;
 	if (!fsi) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_fs_info = fsi;
 
 	err = ramfs_parse_options(data, &fsi->mount_opts);
 	if (err)
 		goto fail;
 
-	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
-	sb->s_magic = RAMFS_MAGIC;
-	sb->s_op = &ramfs_ops;
-	sb->s_time_gran = 1;
+	sb->s_maxbytes		= MAX_LFS_FILESIZE;
+	sb->s_blocksize		= PAGE_CACHE_SIZE;
+	sb->s_blocksize_bits	= PAGE_CACHE_SHIFT;
+	sb->s_magic		= RAMFS_MAGIC;
+	sb->s_op		= &ramfs_ops;
+	sb->s_time_gran		= 1;
+
 	inode = ramfs_get_inode(sb, S_IFDIR | fsi->mount_opts.mode, 0);
 	if (!inode) {
 		err = -ENOMEM;
@@ -244,14 +245,16 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 	}
 
 	root = d_alloc_root(inode);
+	sb->s_root = root;
 	if (!root) {
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_root = root;
+
 	return 0;
 fail:
 	kfree(fsi);
+	sb->s_fs_info = NULL;
 	iput(inode);
 	return err;
 }

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  4:45       ` Linus Torvalds
  2009-04-07  5:02         ` Wu Fengguang
  2009-04-07  5:28         ` [patch] ramfs: add support for "mode=" mount option, fix Ingo Molnar
@ 2009-04-07  5:30         ` Wu Fengguang
  2 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2009-04-07  5:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Avan Anishchuk, Linux Kernel Mailing List,
	Pekka Enberg, Steven Rostedt, Ingo Molnar, Thomas Gleixner,
	Eduard - Gabriel Munteanu

On Tue, Apr 07, 2009 at 12:45:37PM +0800, Linus Torvalds wrote:
> 
> 
> On Mon, 6 Apr 2009, Linus Torvalds wrote:
> > 
> > It bisected past them. I'm getting worried that it's timing-related, 
> > because nothing that remains looks even remotely interesting for that Mac 
> > mini, but right now:
> > 
> >  - bad: 56fcef75117a153f298b3fe54af31053f53997dd
> >  - good: bb233fdfc7b7cefe45bfa2e8d1b24e79c60a48e5
> > 
> > and there's not a whole lot of commits in between.
> 
> It's c3b1b1cbf002e65a3cabd479e68b5f35886a26db: 'ramfs: add support for 
> "mode=" mount option'.
> 
> And I checked. Reverting it at the tip fixes it. So no random timing 
> fluctuations.
> 
> So that commit causes some random SLAB corruption, that then (depending 
> apparently on luck) may or may not crash in some odd random places later.
> 
> Wu?

Maybe this bug?

Thanks,
Fengguang
---
ramfs: fix double freeing s_fs_info

On a failed ramfs mount, s_fs_info will be freed twice in ramfs_fill_super()
and ramfs_kill_sb(). Fix it by saving s_fs_info only on success.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index a404fb8..7dbb433 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -225,7 +225,6 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 		err = -ENOMEM;
 		goto fail;
 	}
-	sb->s_fs_info = fsi;
 
 	err = ramfs_parse_options(data, &fsi->mount_opts);
 	if (err)
@@ -249,6 +248,7 @@ static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
 		goto fail;
 	}
 	sb->s_root = root;
+	sb->s_fs_info = fsi;
 	return 0;
 fail:
 	kfree(fsi);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [GIT PULL] SLAB include file dependency fixes + kmemtrace updates
  2009-04-07  1:51 ` Linus Torvalds
  2009-04-07  3:51   ` Linus Torvalds
@ 2009-04-07  4:58   ` Ingo Molnar
  1 sibling, 0 replies; 17+ messages in thread
From: Ingo Molnar @ 2009-04-07  4:58 UTC (permalink / raw)
  To: Linus Torvalds, Tejun Heo, H. Peter Anvin, Rusty Russell,
	Peter Zijlstra, Vegard Nossum
  Cc: linux-kernel, Pekka Enberg, Steven Rostedt, Andrew Morton,
	Thomas Gleixner, Eduard - Gabriel Munteanu

(more folks Cc:-ed)

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git kmemtrace-for-linus
> > 
> > We kept this topic separate from the main tracing tree due to 
> > the unexpectedly wide and messy-looking scope of the fixes Pekka 
> > needed to do to untangle various slab*.h, rcu*.h and fs.h 
> > dependency chains.
> 
> I'm not sure this is the tree that brings in the problem, but my 
> wife's Mac Mini won't boot any more, and it looks like some slub 
> or percpu issue, so regardless, roughly the right people are 
> involved in the cc here already.
> 
> I get odd NUL page faults or GP faults in either __kmalloc, 
> __kmalloc_track_caller or kmem_cache_alloc, and they all seem to 
> happen on roughly the same code, ie it's something like this:
> 
>         movq    752(%r13,%rax,8), %rdx  # <variable>.cpu_slab, c
>         movl    24(%rdx), %eax  # <variable>.objsize,
>         movl    %eax, -44(%rbp) #, objsize
>         movq    (%rdx), %r12    # <variable>.freelist, object
>         testq   %r12, %r12      # object
>         je      .L617   #,
>         mov     20(%rdx), %eax  # <variable>.offset, <variable>.offset
> ->      movq    (%r12,%rax,8), %rax     #* object, tmp79
>         movq    %rax, (%rdx)    # tmp79, <variable>.freelist
> 
> where that arrow points to the instruction that seems to be faulting.
> 
> I think it's this code:
> 
>                 object = c->freelist;
>                 c->freelist = object[c->offset];
> 
> and that "object[c->offset]" in particular.
> 
> I have not tried to bisect it yet, and I'll do that, but if this 
> sounds familiar to anybody, please holler before I waste a lot of 
> time on it.

Hm, this would suggest some sort of memory or data structure 
corruption.

There's no such pending bug (we wouldnt have pushed if there was 
anything of this severity). The historic track record:

 - the kmemtrace hooks have been 100% problem free since last 
   August. I mismerged them two times as SLUB changed upstream 
   frequently them but there was no runtime failure that i can 
   remember.

 - percpu changes have a more spotty past: and the #GP might
   suggest something there: we can get a #GP if we go outside the 
   %gs offset range and get a non-canonical address. All sorts of 
   bugs have been observed here: runtime failures with #GP and 
   memory corruption as well and linker bugs.

   To investigate+exclude this angle, a precise .config, sha1, gcc 
   and binutils version would be needed, to reproduce your exact
   kernel image. A dmesg would be helpful too - it's probably an EFI 
   bootup which is rare, but i can try to take your bootup memory 
   map dump in the dmesg and stuff it into an exactmap=<...> set of
   simulated memory environment - maybe that tickles the bug here 
   too.

 - [ stackprotector connects to percpu and got re-enabled - please
     double check you have it off in your config. ]

 - [ cpumask changes have sometimes produced runtime crashes, and 
     once a memory corruption - but only with 
     CONFIG_CPUMASK_OFFSTACK=y which i doubt you have enabled. ]

So unless you have a good crash pattern with a smoking gun, and if 
it's reproducible but bisection does not lead anywhere (which your 
later mails suggest), it might make sense to boot up with the full 
array of memory related debugging checks enabled:

  CONFIG_DEBUG_PAGEALLOC=y
  CONFIG_SLUB_DEBUG=y
  CONFIG_SLUB_DEBUG_ON=y

If the bug is timing or kernel image layout sensitive, this might 
hide it though.

Does it reproduce with maxcpus=1? If yes, it would weaken the percpu 
angle - paradoxially most of the percpu trouble we had during 
development was uniform and affected UP too.

Plus if it's a genuine memory corruptor and not timing sensitive and 
all other efforts fail, then there's also kmemcheck to try:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git kmemcheck

( Note: i merged this branch up to latest -git 30 seconds ago with 5 
  conflict resolutions half awake, but it will all be perfect, rest 
  assured. [ If not - a build failure or so - have a look at the 
  conflict resolutions ] If you try this you might have to tweak the 
  .config a bit to make CONFIG_KMEMCHECK appear - it's dependent on 
  a few things. Also - a kmemcheck false positive might hit the 
  bootup sooner than a genuine memory corruption so even if it emits 
  something, it might not be genuinely interesting. )

	Ingo

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-04-07  7:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-05 19:39 [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Ingo Molnar
2009-04-05 19:56 ` Linus Torvalds
2009-04-05 20:02   ` Linus Torvalds
2009-04-07  1:51 ` Linus Torvalds
2009-04-07  3:51   ` Linus Torvalds
2009-04-07  4:20     ` Linus Torvalds
2009-04-07  4:45       ` Linus Torvalds
2009-04-07  5:02         ` Wu Fengguang
2009-04-07  5:28         ` [patch] ramfs: add support for "mode=" mount option, fix Ingo Molnar
2009-04-07  5:55           ` Wu Fengguang
2009-04-07  6:03             ` Ingo Molnar
2009-04-07  6:16               ` [PATCH] ramfs: fix double freeing s_fs_info on failed mount Wu Fengguang
2009-04-07  6:53                 ` Ingo Molnar
2009-04-07  7:05                   ` Wu Fengguang
2009-04-07  6:20             ` [PATCH] ramfs: add support for "mode=" mount option, fix Ingo Molnar
2009-04-07  5:30         ` [GIT PULL] SLAB include file dependency fixes + kmemtrace updates Wu Fengguang
2009-04-07  4:58   ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox