public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* PREEMPT/PREEMPT_RT question
@ 2005-07-12 16:30 Paul E. McKenney
  2005-07-12 17:05 ` Steven Rostedt
  2005-07-12 19:26 ` Ingo Molnar
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-12 16:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: mingo, dipankar, shemminger

Hello!

OK, counter-flip RCU actually survives a pair of overnight runs on
CONFIG_PREEMPT running on 4-CPU machines, and also survives five
kernbenches and an LTP on another 4-CPU machine.  (Overnight-run script
later in this message, FWIW.)

So, time to get serious about a bit of code cleanup:

o	The heavyweight atomic operations in rcu_read_lock() and
	rcu_read_unlock() are not needed in UP kernels, since
	interrupts are disabled.

	Is there already something like smp_atomic_inc() and
	smp_atomic_dec() that generate atomic_inc()/atomic_dec() in
	SMP kernels, but ++/-- in UP kernels?  If not, any reasons
	not to add them, for example, as follows?

		#ifdef CONFIG_SMP
		#define smp_atomic_inc(v) atomic_inc(v)
		#define smp_atomic_dec(v) atomic_dec(v)
		#else /* #ifdef CONFIG_SMP */
		#define smp_atomic_inc(v) ((v)++)
		#define smp_atomic_dec(v) ((v)++)
		#endif /* #else #ifdef CONFIG_SMP */

	Since interrupts must be disabled for these to be safe,
	my guess is that I should define them locally in rcupdate.c.
	If there turns out to be a general need for them, they can
	be moved somewhere more public.

	Objections?

o	In order to get things to work in both CONFIG_PREEMPT and
	CONFIG_PREEMPT_RT, I ended up using the following:

		#ifdef CONFIG_PREEMPT_RT

		#define rcu_spinlock_t _raw_spinlock_t
		#define rcu_spin_lock(l, f) _raw_spin_lock(l)
		#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
		#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
		#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED

		#else /* #ifdef CONFIG_PREEMPT_RT */

		#define rcu_spinlock_t spinlock_t
		#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
		#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
		#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
		#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED

		#endif /* #else #ifdef CONFIG_PREEMPT_RT */

	Then using rcu_spin_lock() &c everywhere.  The problem is
	that (as near as I can tell) the only way to prevent interrupts 
	from running on the current CPU in CONFIG_PREEMPT kernels is
	to use the _irq spinlock primitives, but _raw_spin_lock() does
	the job in CONFIG_PREEMPT_RT (since interrupts are run in process
	context, right).  I could use _irq in both, but that would
	unnecessarily degrade interrupt latency in CONFIG_PREEMPT_RT.

	Suggestions???

Some remaining shortcomings of the current code:

o	Untested on CONFIG_PREEMPT_RT (working on this, suggestions
	for 4-CPU-stable CONFIG_PREEMPT_RT versions most welcome).

o	Uses atomic operations and memory barriers in the
	rcu_read_lock() and rcu_read_unlock() fastpaths.
	Getting rid of the atomic operations in UP is easy,
	in SMP in the common case is reasonably straightforward,
	getting rid of memory barriers is harder, but is likely
	doable.

o	Global callback management slows things down on SMP
	(need to move to per-CPU callback management).

o	Gross debug code needs pruning (working on this).

o	Current implementation is probably too aggressive about
	finishing out grace periods -- one of the machines
	averaged one grace period every 29 milliseconds over a
	15-hour time period.  This likely understates the
	aggressiveness, since the test is quite bursty.

o	Given the small amount testing, there are undoubtably
	more bugs.

All that aside, only a moderately severe degree of insanity is
required to run this code.

Improvements since last time:

o	Boots and runs reasonably reliably.

o	Got rid of the workqueue stuff, simplifying grace-period
	detection code.  Also allows synchronize_sched() to be
	common code between "classic" and realtime RCU
	implementations.

o	Enabled preemption in RCU read-side critical sections
	(even in CONFIG_PREEMPT kernels).

Test script (assumes that you have a linux-2.6.11.tar.bz2 file in
the current directory), relies on the fact that dcache's heavy use
of RCU:

	#!/bin/sh

	while :
	do
		i=0
		while expr $i '<' 10
		do
			tar -xjf linux-2.6.11.tar.bz2
			echo tar status $?
			mv linux-2.6.11 argh$i
			echo $i : `grep dentry_cache /proc/slabinfo`
			i=`expr $i + 1`
		done
		for i in argh[0-9]
		do
			rm -rf $i &
		done
		wait
		echo rm : `grep dentry_cache /proc/slabinfo`
	done

Current (ugly) patch:

Not-signed-off-by: <paulmck@us.ibm.com>

diff -urpN -X dontdiff linux-2.6.12-rc6/arch/i386/Kconfig linux-2.6.12-rc6-ctrRCU/arch/i386/Kconfig
--- linux-2.6.12-rc6/arch/i386/Kconfig	2005-06-17 16:34:16.000000000 -0700
+++ linux-2.6.12-rc6-ctrRCU/arch/i386/Kconfig	2005-07-11 10:50:47.000000000 -0700
@@ -523,6 +523,31 @@ config PREEMPT
 	  Say Y here if you are building a kernel for a desktop, embedded
 	  or real-time system.  Say N if you are unsure.
 
+config PREEMPT_RCU
+	bool "Preemptible RCU read-side critical sections"
+	depends on PREEMPT
+	default y
+	help
+	  This option reduces the latency of the kernel when reacting to
+	  real-time or interactive events by allowing a low priority process to
+	  be preempted even if it is in an RCU read-side critical section.
+	  This allows applications to run more reliably even when the system is
+	  under load.
+
+	  Say Y here if you enjoy debugging random oopses.
+	  Say N if, for whatever reason, you want your kernel to actually work.
+
+config RCU_STATS
+	bool "/proc stats for preemptible RCU read-side critical sections"
+	depends on PREEMPT_RCU
+	default y
+	help
+	  This option provides /proc stats to provide debugging info for
+	  the preemptible realtime RCU implementation.
+
+	  Say Y here if you want to see RCU stats in /proc
+	  Say N if you are unsure.
+
 config PREEMPT_BKL
 	bool "Preempt The Big Kernel Lock"
 	depends on PREEMPT
diff -urpN -X dontdiff linux-2.6.12-rc6/fs/proc/proc_misc.c linux-2.6.12-rc6-ctrRCU/fs/proc/proc_misc.c
--- linux-2.6.12-rc6/fs/proc/proc_misc.c	2005-06-17 16:35:03.000000000 -0700
+++ linux-2.6.12-rc6-ctrRCU/fs/proc/proc_misc.c	2005-07-11 11:46:04.000000000 -0700
@@ -549,6 +549,38 @@ void create_seq_entry(char *name, mode_t
 		entry->proc_fops = f;
 }
 
+#ifdef CONFIG_RCU_STATS
+int rcu_read_proc(char *page, char **start, off_t off,
+		  int count, int *eof, void *data)
+{
+	int len;
+	extern int rcu_read_proc_data(char *page);
+
+	len = rcu_read_proc_data(page);
+	return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
+int rcu_read_proc_gp(char *page, char **start, off_t off,
+		     int count, int *eof, void *data)
+{
+	int len;
+	extern int rcu_read_proc_gp_data(char *page);
+
+	len = rcu_read_proc_gp_data(page);
+	return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
+int rcu_read_proc_ptrs(char *page, char **start, off_t off,
+		       int count, int *eof, void *data)
+{
+	int len;
+	extern int rcu_read_proc_ptrs_data(char *page);
+
+	len = rcu_read_proc_ptrs_data(page);
+	return proc_calc_metrics(page, start, off, count, eof, len);
+}
+#endif /* #ifdef CONFIG_RCU_STATS */
+
 void __init proc_misc_init(void)
 {
 	struct proc_dir_entry *entry;
@@ -571,6 +603,11 @@ void __init proc_misc_init(void)
 		{"cmdline",	cmdline_read_proc},
 		{"locks",	locks_read_proc},
 		{"execdomains",	execdomains_read_proc},
+#ifdef CONFIG_RCU_STATS
+		{"rcustats",	rcu_read_proc},
+		{"rcugp",	rcu_read_proc_gp},
+		{"rcuptrs",	rcu_read_proc_ptrs},
+#endif /* #ifdef CONFIG_RCU_STATS */
 		{NULL,}
 	};
 	for (p = simple_ones; p->name; p++)
diff -urpN -X dontdiff linux-2.6.12-rc6/include/linux/rcupdate.h linux-2.6.12-rc6-ctrRCU/include/linux/rcupdate.h
--- linux-2.6.12-rc6/include/linux/rcupdate.h	2005-06-17 16:35:13.000000000 -0700
+++ linux-2.6.12-rc6-ctrRCU/include/linux/rcupdate.h	2005-07-01 14:55:35.000000000 -0700
@@ -59,6 +59,7 @@ struct rcu_head {
 } while (0)
 
 
+#ifndef CONFIG_PREEMPT_RCU
 
 /* Global control variables for rcupdate callback mechanism. */
 struct rcu_ctrlblk {
@@ -192,6 +193,18 @@ static inline int rcu_pending(int cpu)
  */
 #define rcu_read_unlock()	preempt_enable()
 
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+
+#define rcu_qsctr_inc(cpu)
+#define rcu_bh_qsctr_inc(cpu)
+#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
+
+extern void rcu_read_lock(void);
+extern void rcu_read_unlock(void);
+extern int rcu_pending(int cpu);
+
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */
+
 /*
  * So where is rcu_write_lock()?  It does not exist, as there is no
  * way for writers to lock out RCU readers.  This is a feature, not
@@ -213,14 +226,22 @@ static inline int rcu_pending(int cpu)
  * can use just rcu_read_lock().
  *
  */
+#ifndef CONFIG_PREEMPT_RCU
 #define rcu_read_lock_bh()	local_bh_disable()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+#define rcu_read_lock_bh()	{ rcu_read_lock(); local_bh_disable(); }
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */
 
 /*
  * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
  *
  * See rcu_read_lock_bh() for more information.
  */
+#ifndef CONFIG_PREEMPT_RCU
 #define rcu_read_unlock_bh()	local_bh_enable()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+#define rcu_read_unlock_bh()	{ local_bh_enable(); rcu_read_unlock(); }
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an
@@ -269,7 +290,11 @@ static inline int rcu_pending(int cpu)
  * synchronize_kernel() API.  In contrast, synchronize_rcu() only
  * guarantees that rcu_read_lock() sections will have completed.
  */
+#ifndef CONFIG_PREEMPT_RCU
 #define synchronize_sched() synchronize_rcu()
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+extern void synchronize_sched(void);
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */
 
 extern void rcu_init(void);
 extern void rcu_check_callbacks(int cpu, int user);
diff -urpN -X dontdiff linux-2.6.12-rc6/include/linux/sched.h linux-2.6.12-rc6-ctrRCU/include/linux/sched.h
--- linux-2.6.12-rc6/include/linux/sched.h	2005-06-17 16:35:13.000000000 -0700
+++ linux-2.6.12-rc6-ctrRCU/include/linux/sched.h	2005-06-22 17:25:49.000000000 -0700
@@ -740,6 +740,11 @@ struct task_struct {
 	nodemask_t mems_allowed;
 	int cpuset_mems_generation;
 #endif
+#ifdef CONFIG_PREEMPT_RCU
+	int rcu_read_lock_nesting;
+	atomic_t *rcu_flipctr1;
+	atomic_t *rcu_flipctr2;
+#endif
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -urpN -X dontdiff linux-2.6.12-rc6/kernel/rcupdate.c linux-2.6.12-rc6-ctrRCU/kernel/rcupdate.c
--- linux-2.6.12-rc6/kernel/rcupdate.c	2005-06-17 16:35:15.000000000 -0700
+++ linux-2.6.12-rc6-ctrRCU/kernel/rcupdate.c	2005-07-11 16:27:10.000000000 -0700
@@ -47,6 +47,46 @@
 #include <linux/rcupdate.h>
 #include <linux/cpu.h>
 
+struct rcu_synchronize {
+	struct rcu_head head;
+	struct completion completion;
+};
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(struct rcu_head  *head)
+{
+	struct rcu_synchronize *rcu;
+
+	rcu = container_of(head, struct rcu_synchronize, head);
+	complete(&rcu->completion);
+}
+
+/**
+ * synchronize_rcu - wait until a grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full grace
+ * period has elapsed, in other words after all currently executing RCU
+ * read-side critical sections have completed.  RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ *
+ * If your read-side code is not protected by rcu_read_lock(), do -not-
+ * use synchronize_rcu().
+ */
+void synchronize_rcu(void)
+{
+	struct rcu_synchronize rcu;
+
+	init_completion(&rcu.completion);
+	/* Will wake me after RCU finished */
+	call_rcu(&rcu.head, wakeme_after_rcu);
+
+	/* Wait for it */
+	wait_for_completion(&rcu.completion);
+}
+
+#ifndef CONFIG_PREEMPT_RCU
+
 /* Definition for rcupdate control block. */
 struct rcu_ctrlblk rcu_ctrlblk = 
 	{ .cur = -300, .completed = -300 };
@@ -429,42 +469,392 @@ void __init rcu_init(void)
 	register_cpu_notifier(&rcu_nb);
 }
 
-struct rcu_synchronize {
-	struct rcu_head head;
-	struct completion completion;
+/*
+ * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
+ */
+void synchronize_kernel(void)
+{
+	synchronize_rcu();
+}
+
+module_param(maxbatch, int, 0);
+EXPORT_SYMBOL(call_rcu);  /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(call_rcu_bh);  /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL_GPL(synchronize_rcu);
+EXPORT_SYMBOL(synchronize_kernel);  /* WARNING: GPL-only in April 2006. */
+
+#else /* #ifndef CONFIG_PREEMPT_RCU */
+
+#ifdef CONFIG_PREEMPT_RT
+
+#error "Hey!!!  This fails miserably even in straightforward testing!!!"
+
+#define rcu_spinlock_t _raw_spinlock_t
+#define rcu_spin_lock(l, f) _raw_spin_lock(l)
+#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
+#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
+#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED
+
+#else /* #ifdef CONFIG_PREEMPT_RT */
+
+#define rcu_spinlock_t spinlock_t
+#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
+#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
+#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
+#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
+
+#endif /* #else #ifdef CONFIG_PREEMPT_RT */
+
+struct rcu_data {
+	rcu_spinlock_t	lock;
+	long		batch;
+	struct tasklet_struct rcu_tasklet;
+	struct rcu_head *nextlist;
+	struct rcu_head **nexttail;
+	struct rcu_head *waitlist;
+	struct rcu_head **waittail;
+	struct rcu_head *donelist;
+	struct rcu_head **donetail;
+	long		n_next_length;
+#ifdef CONFIG_RCU_STATS
+	long		n_next_add;
+	long		n_wait_length;
+	long		n_wait_add;
+	long		n_done_length;
+	long		n_done_add;
+	long		n_done_remove;
+	atomic_t	n_done_invoked;
+	long		n_rcu_process_callbacks;
+	long		n_rcu_check_callbacks;
+	long		n_rcu_advance_callbacks1;
+	long		n_rcu_advance_callbacks2;
+	long		n_rcu_advance_callbacks3;
+	long		n_rcu_advance_callbacks4;
+	long		n_rcu_advance_callbacks5;
+	atomic_t	n_synchronize_rcu;
+	atomic_t	n_synchronize_rcu_flip1; /*&&&&*/
+	atomic_t	n_synchronize_rcu_flip2; /*&&&&*/
+	atomic_t	n_synchronize_rcu_flip3; /*&&&&*/
+	atomic_t	n_synchronize_rcu_flip4; /*&&&&*/
+	long		n_synchronize_rcu_flip_early1;
+#endif /* #ifdef CONFIG_RCU_STATS */
+};
+struct rcu_ctrlblk {
+	spinlock_t	fliplock;
+	long		batch;
 };
+static struct rcu_data rcu_data;
+static struct rcu_ctrlblk rcu_ctrlblk = {
+	.fliplock = RCU_SPIN_LOCK_UNLOCKED,
+	.batch = 0,
+};
+static DEFINE_PER_CPU(atomic_t [2], rcu_flipctr) =
+	{ ATOMIC_INIT(0), ATOMIC_INIT(0) };
 
-/* Because of FASTCALL declaration of complete, we use this wrapper */
-static void wakeme_after_rcu(struct rcu_head  *head)
+void
+rcu_read_lock(void)
 {
-	struct rcu_synchronize *rcu;
+	int flipctr;
+	unsigned long oldirq;
 
-	rcu = container_of(head, struct rcu_synchronize, head);
-	complete(&rcu->completion);
+	local_irq_save(oldirq);	/* @@@ inside "if" for PREEMPT_RT... */
+	if (current->rcu_read_lock_nesting++ == 0) {
+
+		/*
+		 * Outermost nesting of rcu_read_lock(), so atomically
+		 * increment the current counter for the current CPU.
+		 */
+
+		flipctr = rcu_ctrlblk.batch & 0x1;
+		smp_read_barrier_depends();
+		current->rcu_flipctr1 = &(__get_cpu_var(rcu_flipctr)[flipctr]);
+		/* Can optimize to non-atomic on fastpath, but start simple. */
+		atomic_inc(current->rcu_flipctr1);
+		smp_mb__after_atomic_inc();  /* might optimize out... */
+		if (unlikely(flipctr != (rcu_ctrlblk.batch & 0x1))) {
+
+			/*
+			 * We raced with grace-period processing (flip).
+			 * Although we cannot be preempted here, there
+			 * could be interrupts, ECC errors and the like,
+			 * so just nail down both sides of the rcu_flipctr
+			 * array for the duration of our RCU read-side
+			 * critical section, preventing a second flip
+			 * from racing with us.  At some point, it would
+			 * be safe to decrement one of the counters, but
+			 * we have no way of knowing when that would be.
+			 * So just decrement them both in rcu_read_unlock().
+			 */
+
+			current->rcu_flipctr2 =
+				&(__get_cpu_var(rcu_flipctr)[!flipctr]);
+			/* Can again optimize to non-atomic on fastpath. */
+			atomic_inc(current->rcu_flipctr2);
+			smp_mb__after_atomic_inc();  /* might optimize out... */
+		}
+	}
+#ifndef CONFIG_PREEMPT_RT
+	/* preempt_disable(); */
+#endif /* #ifndef CONFIG_PREEMPT_RT */
+	local_irq_restore(oldirq);  /*@@@ should be able to precede. */
+}
+
+void
+rcu_read_unlock(void)
+{
+#ifndef CONFIG_PREEMPT_RT
+	unsigned long oldirq;
+
+	local_irq_save(oldirq);  /* @@@ should be able to reverse... */
+	/* preempt_enable(); */
+#endif /* #ifndef CONFIG_PREEMPT_RT */
+	if (--current->rcu_read_lock_nesting == 0) {
+
+		/*
+		 * Just atomically decrement whatever we incremented.
+		 * Might later want to awaken some task waiting for the
+		 * grace period to complete, but keep it simple for the
+		 * moment.
+		 */
+
+		smp_mb__before_atomic_dec();
+		atomic_dec(current->rcu_flipctr1);
+		current->rcu_flipctr1 = NULL;
+		if (unlikely(current->rcu_flipctr2 != NULL)) {
+			atomic_dec(current->rcu_flipctr2);
+			current->rcu_flipctr2 = NULL;
+		}
+	}
+#ifndef CONFIG_PREEMPT_RT
+	local_irq_restore(oldirq);
+#endif /* #ifndef CONFIG_PREEMPT_RT */
+}
+
+static void
+__rcu_advance_callbacks(void)
+{
+
+#ifdef CONFIG_RCU_STATS
+	rcu_data.n_rcu_advance_callbacks1++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+	if (rcu_data.batch != rcu_ctrlblk.batch) {
+#ifdef CONFIG_RCU_STATS
+		rcu_data.n_rcu_advance_callbacks2++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+		if (rcu_data.waitlist != NULL) {
+#ifdef CONFIG_RCU_STATS
+			rcu_data.n_rcu_advance_callbacks3++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+			*rcu_data.donetail = rcu_data.waitlist;
+			rcu_data.donetail = rcu_data.waittail;
+#ifdef CONFIG_RCU_STATS
+			rcu_data.n_done_length += rcu_data.n_wait_length;
+			rcu_data.n_done_add += rcu_data.n_wait_length;
+			rcu_data.n_wait_length = 0;
+#endif /* #ifdef CONFIG_RCU_STATS */
+		}
+		if (rcu_data.nextlist != NULL) {
+#ifdef CONFIG_RCU_STATS
+			rcu_data.n_rcu_advance_callbacks4++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+			rcu_data.waitlist = rcu_data.nextlist;
+			rcu_data.waittail = rcu_data.nexttail;
+			rcu_data.nextlist = NULL;
+			rcu_data.nexttail = &rcu_data.nextlist;
+#ifdef CONFIG_RCU_STATS
+			rcu_data.n_wait_length += rcu_data.n_next_length;
+			rcu_data.n_wait_add += rcu_data.n_next_length;
+#endif /* #ifdef CONFIG_RCU_STATS */
+			rcu_data.n_next_length = 0;
+		} else {
+#ifdef CONFIG_RCU_STATS
+			rcu_data.n_rcu_advance_callbacks5++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+			rcu_data.waitlist = NULL;
+			rcu_data.waittail = &rcu_data.waitlist;
+		}
+		rcu_data.batch = rcu_ctrlblk.batch;
+	}
 }
 
-/**
- * synchronize_rcu - wait until a grace period has elapsed.
- *
- * Control will return to the caller some time after a full grace
- * period has elapsed, in other words after all currently executing RCU
- * read-side critical sections have completed.  RCU read-side critical
- * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
- * and may be nested.
+/*
+ * Attempt a single flip of the counters.  Remember, a single flip does
+ * -not- constitute a grace period.  Instead, the interval between
+ * a pair of consecutive flips is a grace period.
  *
- * If your read-side code is not protected by rcu_read_lock(), do -not-
- * use synchronize_rcu().
+ * If anyone is nuts enough to run this CONFIG_PREEMPT_RCU implementation
+ * on a large SMP, they might want to use a hierarchical organization of
+ * the per-CPU-counter pairs.
+ */
+static void
+rcu_try_flip(void)
+{
+	int cpu;
+	long flipctr;
+
+#ifdef CONFIG_RCU_STATS
+	atomic_inc(&rcu_data.n_synchronize_rcu_flip1);
+#endif /* #ifdef CONFIG_RCU_STATS */
+	flipctr = rcu_data.batch;
+	if (unlikely(!spin_trylock(&rcu_ctrlblk.fliplock)))
+		return;
+#if 0
+	spin_lock(&rcu_ctrlblk.fliplock);
+#endif
+	if (unlikely(flipctr != rcu_data.batch)) {
+	
+		/* Our work is done!  ;-) */
+
+#ifdef CONFIG_RCU_STATS
+		rcu_data.n_synchronize_rcu_flip_early1++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+		spin_unlock(&rcu_ctrlblk.fliplock);
+		return;
+	}
+	flipctr &= 0x1;
+
+	/*
+	 * Check for completion of all RCU read-side critical sections
+	 * that started prior to the previous flip.
+	 */
+
+#ifdef CONFIG_RCU_STATS
+	atomic_inc(&rcu_data.n_synchronize_rcu_flip2);
+#endif /* #ifdef CONFIG_RCU_STATS */
+	for_each_cpu(cpu) {
+		if (atomic_read(&per_cpu(rcu_flipctr, cpu)[!flipctr]) != 0) {
+			spin_unlock(&rcu_ctrlblk.fliplock);
+			return;
+		}
+	}
+
+	/* Do the flip. */
+
+#ifdef CONFIG_RCU_STATS
+	atomic_inc(&rcu_data.n_synchronize_rcu_flip3);
+#endif /* #ifdef CONFIG_RCU_STATS */
+	smp_mb();
+	rcu_ctrlblk.batch++;
+
+#ifdef CONFIG_RCU_STATS
+	atomic_inc(&rcu_data.n_synchronize_rcu_flip4);
+#endif /* #ifdef CONFIG_RCU_STATS */
+	
+	spin_unlock(&rcu_ctrlblk.fliplock);
+}
+
+void
+rcu_check_callbacks(int cpu, int user)
+{
+	unsigned long flags;
+
+	rcu_try_flip();
+	rcu_spin_lock(&rcu_data.lock, flags);
+#ifdef CONFIG_RCU_STATS
+	rcu_data.n_rcu_check_callbacks++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+	__rcu_advance_callbacks();
+	if (rcu_data.donelist == NULL) {
+		rcu_spin_unlock(&rcu_data.lock, flags);
+	} else {
+		rcu_spin_unlock(&rcu_data.lock, flags);
+		tasklet_schedule(&rcu_data.rcu_tasklet);
+	}
+}
+
+static
+void rcu_process_callbacks(unsigned long data)
+{
+	unsigned long flags;
+	struct rcu_head *next, *list;
+
+	rcu_spin_lock(&rcu_data.lock, flags);
+#ifdef CONFIG_RCU_STATS
+	rcu_data.n_rcu_process_callbacks++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+	list = rcu_data.donelist;
+	if (list == NULL) {
+		rcu_spin_unlock(&rcu_data.lock, flags);
+		return;
+	}
+	rcu_data.donelist = NULL;
+	rcu_data.donetail = &rcu_data.donelist;
+#ifdef CONFIG_RCU_STATS
+	rcu_data.n_done_remove += rcu_data.n_done_length;
+	rcu_data.n_done_length = 0;
+#endif /* #ifdef CONFIG_RCU_STATS */
+	rcu_spin_unlock(&rcu_data.lock, flags);
+	while (list) {
+		next = list->next;
+		list->func(list);
+		list = next;
+#ifdef CONFIG_RCU_STATS
+		atomic_inc(&rcu_data.n_done_invoked);
+#endif /* #ifdef CONFIG_RCU_STATS */
+	}
+}
+
+void fastcall
+call_rcu(struct rcu_head *head,
+	 void (*func)(struct rcu_head *rcu))
+{
+	unsigned long flags;
+
+	head->func = func;
+	head->next = NULL;
+	rcu_spin_lock(&rcu_data.lock, flags);
+	__rcu_advance_callbacks();
+	*rcu_data.nexttail = head;
+	rcu_data.nexttail = &head->next;
+#ifdef CONFIG_RCU_STATS
+	rcu_data.n_next_add++;
+#endif /* #ifdef CONFIG_RCU_STATS */
+	rcu_data.n_next_length++;
+	rcu_spin_unlock(&rcu_data.lock, flags);
+}
+
+/*
+ * Crude hack, reduces but does not eliminate possibility of failure.
+ * Needs to wait for all CPUs to pass through a -voluntary- context
+ * switch to eliminate possibility of failure.  (Maybe just crank
+ * priority down...)
  */
-void synchronize_rcu(void)
+void
+synchronize_sched(void)
 {
-	struct rcu_synchronize rcu;
+	cpumask_t oldmask;
+	int cpu;
 
-	init_completion(&rcu.completion);
-	/* Will wake me after RCU finished */
-	call_rcu(&rcu.head, wakeme_after_rcu);
+	if (sched_getaffinity(0, &oldmask) < 0) {
+		oldmask = cpu_possible_map; 
+	}
+	for_each_cpu(cpu) {
+		sched_setaffinity(0, cpumask_of_cpu(cpu));
+		schedule();
+	}
+	sched_setaffinity(0, oldmask);
+}
 
-	/* Wait for it */
-	wait_for_completion(&rcu.completion);
+int
+rcu_pending(int cpu)
+{
+	return (rcu_data.donelist != NULL ||
+		rcu_data.waitlist != NULL ||
+		rcu_data.nextlist != NULL);
+}
+
+void __init rcu_init(void)
+{
+/*&&&&*/printk("WARNING: experimental RCU implementation.\n");
+	rcu_data.lock = RCU_SPIN_LOCK_UNLOCKED;
+	rcu_data.batch = 0;
+	rcu_data.nextlist = NULL;
+	rcu_data.nexttail = &rcu_data.nextlist;
+	rcu_data.waitlist = NULL;
+	rcu_data.waittail = &rcu_data.waitlist;
+	rcu_data.donelist = NULL;
+	rcu_data.donetail = &rcu_data.donelist;
+	tasklet_init(&rcu_data.rcu_tasklet, rcu_process_callbacks, 0UL);
 }
 
 /*
@@ -472,11 +862,71 @@ void synchronize_rcu(void)
  */
 void synchronize_kernel(void)
 {
+	synchronize_sched();
+}
+
+#ifdef CONFIG_RCU_STATS
+int rcu_read_proc_data(char *page)
+{
+	return sprintf(page,
+		       "ggp=%ld lgp=%ld\n"
+		       "sr=%d srf1=%d srf2=%d srf3=%d srf4=%d\n"
+		       "ac1=%ld ac2=%ld ac3=%ld ac4=%ld ac5=%ld\n"
+		       "sre1=%ld\n"
+		       "na=%ld nl=%ld wa=%ld wl=%ld rcc=%ld rpc=%ld dl=%ld dr=%ld di=%d\n",
+		       rcu_ctrlblk.batch,
+		       rcu_data.batch,
+
+		       atomic_read(&rcu_data.n_synchronize_rcu),
+		       atomic_read(&rcu_data.n_synchronize_rcu_flip1),
+		       atomic_read(&rcu_data.n_synchronize_rcu_flip2),
+		       atomic_read(&rcu_data.n_synchronize_rcu_flip3),
+		       atomic_read(&rcu_data.n_synchronize_rcu_flip4),
+
+		       rcu_data.n_rcu_advance_callbacks1,
+		       rcu_data.n_rcu_advance_callbacks2,
+		       rcu_data.n_rcu_advance_callbacks3,
+		       rcu_data.n_rcu_advance_callbacks4,
+		       rcu_data.n_rcu_advance_callbacks5,
+
+		       rcu_data.n_synchronize_rcu_flip_early1,
+
+		       rcu_data.n_next_add,
+		       rcu_data.n_next_length,
+		       rcu_data.n_wait_add,
+		       rcu_data.n_wait_length,
+		       rcu_data.n_rcu_check_callbacks,
+		       rcu_data.n_rcu_process_callbacks,
+		       rcu_data.n_done_length,
+		       rcu_data.n_done_remove,
+		       atomic_read(&rcu_data.n_done_invoked));
+}
+
+int rcu_read_proc_gp_data(char *page)
+{
+	long oldgp = rcu_ctrlblk.batch;
+
 	synchronize_rcu();
+	return sprintf(page, "oldggp=%ld  newggp=%ld\n",
+		       oldgp, rcu_ctrlblk.batch);
 }
 
-module_param(maxbatch, int, 0);
-EXPORT_SYMBOL(call_rcu);  /* WARNING: GPL-only in April 2006. */
-EXPORT_SYMBOL(call_rcu_bh);  /* WARNING: GPL-only in April 2006. */
+int rcu_read_proc_ptrs_data(char *page)
+{
+	return sprintf(page,
+		       "nl=%p/%p nt=%p wl=%p/%p wt=%p dl=%p/%p dt=%p\n",
+		       &rcu_data.nextlist, rcu_data.nextlist, rcu_data.nexttail,
+		       &rcu_data.waitlist, rcu_data.waitlist, rcu_data.waittail,
+		       &rcu_data.donelist, rcu_data.donelist, rcu_data.donetail
+		      );
+}
+#endif /* #ifdef CONFIG_RCU_STATS */
+
+EXPORT_SYMBOL(call_rcu); /* WARNING: GPL-only in April 2006. */
 EXPORT_SYMBOL_GPL(synchronize_rcu);
-EXPORT_SYMBOL(synchronize_kernel);  /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL_GPL(synchronize_sched);
+EXPORT_SYMBOL(rcu_read_lock);  /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(rcu_read_unlock);  /* WARNING: GPL-only in April 2006. */
+EXPORT_SYMBOL(synchronize_kernel);  /* WARNING: Removal in April 2006. */
+
+#endif /* #else #ifndef CONFIG_PREEMPT_RCU */

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 16:30 PREEMPT/PREEMPT_RT question Paul E. McKenney
@ 2005-07-12 17:05 ` Steven Rostedt
  2005-07-12 19:28   ` Paul E. McKenney
  2005-07-12 19:26 ` Ingo Molnar
  1 sibling, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2005-07-12 17:05 UTC (permalink / raw)
  To: paulmck; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, 2005-07-12 at 09:30 -0700, Paul E. McKenney wrote:
> Hello!
> 
> OK, counter-flip RCU actually survives a pair of overnight runs on
> CONFIG_PREEMPT running on 4-CPU machines, and also survives five
> kernbenches and an LTP on another 4-CPU machine.  (Overnight-run script
> later in this message, FWIW.)
> 
> So, time to get serious about a bit of code cleanup:
> 
> o	The heavyweight atomic operations in rcu_read_lock() and
> 	rcu_read_unlock() are not needed in UP kernels, since
> 	interrupts are disabled.
> 
> 	Is there already something like smp_atomic_inc() and
> 	smp_atomic_dec() that generate atomic_inc()/atomic_dec() in
> 	SMP kernels, but ++/-- in UP kernels?  If not, any reasons
> 	not to add them, for example, as follows?
> 
> 		#ifdef CONFIG_SMP
> 		#define smp_atomic_inc(v) atomic_inc(v)
> 		#define smp_atomic_dec(v) atomic_dec(v)
> 		#else /* #ifdef CONFIG_SMP */
> 		#define smp_atomic_inc(v) ((v)++)
> 		#define smp_atomic_dec(v) ((v)++)
> 		#endif /* #else #ifdef CONFIG_SMP */

What's the problem with atomic_inc?  At least on x86, atomic inc is
defined as:

static __inline__ void atomic_inc(atomic_t *v)
{
	__asm__ __volatile__(
		LOCK "incl %0"
		:"=m" (v->counter)
		:"m" (v->counter));
}

With LOCK defined as:

#ifdef CONFIG_SMP
#define LOCK "lock ; "
#else
#define LOCK ""
#endif

So is there a difference on UP between x.counter++ and atomic_inc(&x)?

> 	Since interrupts must be disabled for these to be safe,
> 	my guess is that I should define them locally in rcupdate.c.
> 	If there turns out to be a general need for them, they can
> 	be moved somewhere more public.
> 
> 	Objections?
> 
> o	In order to get things to work in both CONFIG_PREEMPT and
> 	CONFIG_PREEMPT_RT, I ended up using the following:
> 
> 		#ifdef CONFIG_PREEMPT_RT
> 
> 		#define rcu_spinlock_t _raw_spinlock_t
> 		#define rcu_spin_lock(l, f) _raw_spin_lock(l)
> 		#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
> 		#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
> 		#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED
> 
> 		#else /* #ifdef CONFIG_PREEMPT_RT */
> 
> 		#define rcu_spinlock_t spinlock_t
> 		#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
> 		#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
> 		#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
> 		#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
> 
> 		#endif /* #else #ifdef CONFIG_PREEMPT_RT */
> 
> 	Then using rcu_spin_lock() &c everywhere.  The problem is
> 	that (as near as I can tell) the only way to prevent interrupts 
> 	from running on the current CPU in CONFIG_PREEMPT kernels is
> 	to use the _irq spinlock primitives, but _raw_spin_lock() does
> 	the job in CONFIG_PREEMPT_RT (since interrupts are run in process
> 	context, right).  I could use _irq in both, but that would
> 	unnecessarily degrade interrupt latency in CONFIG_PREEMPT_RT.

Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
just use local_irq_save with spin_lock, since now local_irq_save no
longer disables interrupts in PREEMPT_RT.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 16:30 PREEMPT/PREEMPT_RT question Paul E. McKenney
  2005-07-12 17:05 ` Steven Rostedt
@ 2005-07-12 19:26 ` Ingo Molnar
  2005-07-12 21:04   ` Paul E. McKenney
  1 sibling, 1 reply; 11+ messages in thread
From: Ingo Molnar @ 2005-07-12 19:26 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel, dipankar, shemminger


* Paul E. McKenney <paulmck@us.ibm.com> wrote:

> Hello!
> 
> OK, counter-flip RCU actually survives a pair of overnight runs on 
> CONFIG_PREEMPT running on 4-CPU machines, and also survives five 
> kernbenches and an LTP on another 4-CPU machine. [...]

cool!

> So, time to get serious about a bit of code cleanup:
> 
> o	The heavyweight atomic operations in rcu_read_lock() and
> 	rcu_read_unlock() are not needed in UP kernels, since
> 	interrupts are disabled.

atomic_*() ops should already be lightweight on UP.

> o	In order to get things to work in both CONFIG_PREEMPT and
> 	CONFIG_PREEMPT_RT, I ended up using the following:
> 
> 		#ifdef CONFIG_PREEMPT_RT
> 
> 		#define rcu_spinlock_t _raw_spinlock_t
> 		#define rcu_spin_lock(l, f) _raw_spin_lock(l)
> 		#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
> 		#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
> 		#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED
> 
> 		#else /* #ifdef CONFIG_PREEMPT_RT */
> 
> 		#define rcu_spinlock_t spinlock_t
> 		#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
> 		#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
> 		#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
> 		#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
> 
> 		#endif /* #else #ifdef CONFIG_PREEMPT_RT */
> 
> 	Then using rcu_spin_lock() &c everywhere.  The problem is
> 	that (as near as I can tell) the only way to prevent interrupts 
> 	from running on the current CPU in CONFIG_PREEMPT kernels is
> 	to use the _irq spinlock primitives, but _raw_spin_lock() does
> 	the job in CONFIG_PREEMPT_RT (since interrupts are run in process
> 	context, right).  I could use _irq in both, but that would
> 	unnecessarily degrade interrupt latency in CONFIG_PREEMPT_RT.
> 
> 	Suggestions???

in PREEMPT_RT, if you define a spinlock type as raw_spinlock_t (via 
DEFINE_RAW_SPINLOCK(lock)) then the spin_lock*() APIs automatically 
switch over to that type. I.e. no need to do the rcu_ stuff AFAICT.  
(unless i missed some detail about what you are trying to do.) Raw 
spinlocks under PREEMPT_RT pair with the raw IRQ flag, i.e.  
spin_lock_irq() will disable hard interrupts. Generally i'd suggest to 
go for the raw spinlock and raw-irq-flag-disabling variant, because RCU 
locking is core enough functionality to warrant atomic treatment. If 
there's any latency path in it, we can work on breaking it up later.

it's perfectly fine to disable raw interrupts with the RCU code, as long 
as you do O(1) amount of work. (where the constant factor isnt 2^(2^32) 
;-)

> Some remaining shortcomings of the current code:
> 
> o	Untested on CONFIG_PREEMPT_RT (working on this, suggestions
> 	for 4-CPU-stable CONFIG_PREEMPT_RT versions most welcome).

latest (-51-28) has no known regressions, and i regularly boot on 4-way 
(and irregularly on 8-way) boxes. So if you see something strange on the 
latest -RT kernel, please report it.

	Ingo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 17:05 ` Steven Rostedt
@ 2005-07-12 19:28   ` Paul E. McKenney
  2005-07-12 20:04     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-12 19:28 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, Jul 12, 2005 at 01:05:24PM -0400, Steven Rostedt wrote:
> On Tue, 2005-07-12 at 09:30 -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > OK, counter-flip RCU actually survives a pair of overnight runs on
> > CONFIG_PREEMPT running on 4-CPU machines, and also survives five
> > kernbenches and an LTP on another 4-CPU machine.  (Overnight-run script
> > later in this message, FWIW.)
> > 
> > So, time to get serious about a bit of code cleanup:
> > 
> > o	The heavyweight atomic operations in rcu_read_lock() and
> > 	rcu_read_unlock() are not needed in UP kernels, since
> > 	interrupts are disabled.
> > 
> > 	Is there already something like smp_atomic_inc() and
> > 	smp_atomic_dec() that generate atomic_inc()/atomic_dec() in
> > 	SMP kernels, but ++/-- in UP kernels?  If not, any reasons
> > 	not to add them, for example, as follows?
> > 
> > 		#ifdef CONFIG_SMP
> > 		#define smp_atomic_inc(v) atomic_inc(v)
> > 		#define smp_atomic_dec(v) atomic_dec(v)
> > 		#else /* #ifdef CONFIG_SMP */
> > 		#define smp_atomic_inc(v) ((v)++)
> > 		#define smp_atomic_dec(v) ((v)++)
> > 		#endif /* #else #ifdef CONFIG_SMP */
> 
> What's the problem with atomic_inc?  At least on x86, atomic inc is
> defined as:
> 
> static __inline__ void atomic_inc(atomic_t *v)
> {
> 	__asm__ __volatile__(
> 		LOCK "incl %0"
> 		:"=m" (v->counter)
> 		:"m" (v->counter));
> }
> 
> With LOCK defined as:
> 
> #ifdef CONFIG_SMP
> #define LOCK "lock ; "
> #else
> #define LOCK ""
> #endif
> 
> So is there a difference on UP between x.counter++ and atomic_inc(&x)?

On x86, you are right.  The full list of architectures that are atomic
only in SMP are i386 (as you noted), parisc, sparc, and x86_64.

The architectures that appear to always be atomic are: alpha, ia64, m32r
(but unfamiliar with this one), mips, ppc, ppc64, s390 (I think...),
and sparc64.  Most of these architectures need to disable interrupts
to provide "universal" atomic_inc() semantics in UP kernels, due to
their RISC-style atomic instructions.  

The following architectures avoid the issue entirely by refusing to
support SMP: arm, arm26, cris, frv, h8300, m68k, m68knommu, sh, sh64,
and v850.  Many of them disable interrupts in their atomic_inc()
implementations.

The advantage of smp_atomic_inc() is that architectures could dispense
with interrupt disabling.  The disadvantage is that it is yet another
contribution to Linux's combinatorial explosion of primitives.

For the moment, I will grit my teeth and keep atomic_inc() and atomic_dec().

Other thoughts?

> > 	Since interrupts must be disabled for these to be safe,
> > 	my guess is that I should define them locally in rcupdate.c.
> > 	If there turns out to be a general need for them, they can
> > 	be moved somewhere more public.
> > 
> > 	Objections?
> > 
> > o	In order to get things to work in both CONFIG_PREEMPT and
> > 	CONFIG_PREEMPT_RT, I ended up using the following:
> > 
> > 		#ifdef CONFIG_PREEMPT_RT
> > 
> > 		#define rcu_spinlock_t _raw_spinlock_t
> > 		#define rcu_spin_lock(l, f) _raw_spin_lock(l)
> > 		#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
> > 		#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
> > 		#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED
> > 
> > 		#else /* #ifdef CONFIG_PREEMPT_RT */
> > 
> > 		#define rcu_spinlock_t spinlock_t
> > 		#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
> > 		#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
> > 		#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
> > 		#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
> > 
> > 		#endif /* #else #ifdef CONFIG_PREEMPT_RT */
> > 
> > 	Then using rcu_spin_lock() &c everywhere.  The problem is
> > 	that (as near as I can tell) the only way to prevent interrupts 
> > 	from running on the current CPU in CONFIG_PREEMPT kernels is
> > 	to use the _irq spinlock primitives, but _raw_spin_lock() does
> > 	the job in CONFIG_PREEMPT_RT (since interrupts are run in process
> > 	context, right).  I could use _irq in both, but that would
> > 	unnecessarily degrade interrupt latency in CONFIG_PREEMPT_RT.
> 
> Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
> just use local_irq_save with spin_lock, since now local_irq_save no
> longer disables interrupts in PREEMPT_RT.

By this you mean the following?

	local_irq_save(flags);
	_raw_spin_lock(&mylock);

	/* critical section */

	_raw_spin_unlock(&mylock);
	local_irq_restore(flags);

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 19:28   ` Paul E. McKenney
@ 2005-07-12 20:04     ` Steven Rostedt
  2005-07-12 21:34       ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2005-07-12 20:04 UTC (permalink / raw)
  To: paulmck; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, 2005-07-12 at 12:28 -0700, Paul E. McKenney wrote:
> > 
> > So is there a difference on UP between x.counter++ and atomic_inc(&x)?
> 
> On x86, you are right.  The full list of architectures that are atomic
> only in SMP are i386 (as you noted), parisc, sparc, and x86_64.
> 
> The architectures that appear to always be atomic are: alpha, ia64, m32r
> (but unfamiliar with this one), mips, ppc, ppc64, s390 (I think...),
> and sparc64.  Most of these architectures need to disable interrupts
> to provide "universal" atomic_inc() semantics in UP kernels, due to
> their RISC-style atomic instructions.  
> 
> The following architectures avoid the issue entirely by refusing to
> support SMP: arm, arm26, cris, frv, h8300, m68k, m68knommu, sh, sh64,
> and v850.  Many of them disable interrupts in their atomic_inc()
> implementations.
> 
> The advantage of smp_atomic_inc() is that architectures could dispense
> with interrupt disabling.  The disadvantage is that it is yet another
> contribution to Linux's combinatorial explosion of primitives.
> 
> For the moment, I will grit my teeth and keep atomic_inc() and atomic_dec().
> 
> Other thoughts?

How did I know that you would mention mips and the like :-)

Yeah, mips has the crazy Load Linked and Store Conditional crap, so it
is a little more complex than the simple add one.  And I think PPC does
too, although it has been a while since I've used them.  And older mips
don't have the LL SC command so the only option is to turn off
interrupts (of course those that don't have the LL and SC are not SMP
compatible).  So, I will admit that having a smp_atomic_inc might be
nice. I was just being a narrow minded x86 hacker ;-)



> > Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
> > just use local_irq_save with spin_lock, since now local_irq_save no
> > longer disables interrupts in PREEMPT_RT.
> 
> By this you mean the following?
> 
> 	local_irq_save(flags);
> 	_raw_spin_lock(&mylock);
> 
> 	/* critical section */
> 
> 	_raw_spin_unlock(&mylock);
> 	local_irq_restore(flags);

Yeah, that on PREEMP_RT would not turn off interrupts but just stops
preemption. This is fine as long as the mylock is not used in any
SA_NODELAY interrupt.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 19:26 ` Ingo Molnar
@ 2005-07-12 21:04   ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-12 21:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: linux-kernel, dipankar, shemminger

On Tue, Jul 12, 2005 at 09:26:35PM +0200, Ingo Molnar wrote:
> * Paul E. McKenney <paulmck@us.ibm.com> wrote:
> > So, time to get serious about a bit of code cleanup:
> > 
> > o	The heavyweight atomic operations in rcu_read_lock() and
> > 	rcu_read_unlock() are not needed in UP kernels, since
> > 	interrupts are disabled.
> 
> atomic_*() ops should already be lightweight on UP.

Agreed, will worry about architectures where they might not be later.

> > o	In order to get things to work in both CONFIG_PREEMPT and
> > 	CONFIG_PREEMPT_RT, I ended up using the following:
> > 
> > 		#ifdef CONFIG_PREEMPT_RT
> > 
> > 		#define rcu_spinlock_t _raw_spinlock_t
> > 		#define rcu_spin_lock(l, f) _raw_spin_lock(l)
> > 		#define rcu_spin_trylock(l, f) _raw_spin_trylock(l)
> > 		#define rcu_spin_unlock(l, f) _raw_spin_unlock(l)
> > 		#define RCU_SPIN_LOCK_UNLOCKED RAW_SPIN_LOCK_UNLOCKED
> > 
> > 		#else /* #ifdef CONFIG_PREEMPT_RT */
> > 
> > 		#define rcu_spinlock_t spinlock_t
> > 		#define rcu_spin_lock(l, f) spin_lock_irqsave(l, f)
> > 		#define rcu_spin_trylock(l, f) spin_trylock_irqsave(l, f)
> > 		#define rcu_spin_unlock(l, f) spin_unlock_irqrestore(l, f)
> > 		#define RCU_SPIN_LOCK_UNLOCKED SPIN_LOCK_UNLOCKED
> > 
> > 		#endif /* #else #ifdef CONFIG_PREEMPT_RT */
> > 
> > 	Then using rcu_spin_lock() &c everywhere.  The problem is
> > 	that (as near as I can tell) the only way to prevent interrupts 
> > 	from running on the current CPU in CONFIG_PREEMPT kernels is
> > 	to use the _irq spinlock primitives, but _raw_spin_lock() does
> > 	the job in CONFIG_PREEMPT_RT (since interrupts are run in process
> > 	context, right).  I could use _irq in both, but that would
> > 	unnecessarily degrade interrupt latency in CONFIG_PREEMPT_RT.
> > 
> > 	Suggestions???
> 
> in PREEMPT_RT, if you define a spinlock type as raw_spinlock_t (via 
> DEFINE_RAW_SPINLOCK(lock)) then the spin_lock*() APIs automatically 
> switch over to that type. I.e. no need to do the rcu_ stuff AFAICT.  
> (unless i missed some detail about what you are trying to do.) Raw 
> spinlocks under PREEMPT_RT pair with the raw IRQ flag, i.e.  
> spin_lock_irq() will disable hard interrupts. Generally i'd suggest to 
> go for the raw spinlock and raw-irq-flag-disabling variant, because RCU 
> locking is core enough functionality to warrant atomic treatment. If 
> there's any latency path in it, we can work on breaking it up later.

Fair enough!

The thing I am (perhaps foolishly) trying to do is get to a single
source base that provides PREEMPT_RCU under both CONFIG_PREEMPT and
CONFIG_PREEMPT_RT.  Since you are OK with RCU disabling interrupts,
I am happy.

> it's perfectly fine to disable raw interrupts with the RCU code, as long 
> as you do O(1) amount of work. (where the constant factor isnt 2^(2^32) 
> ;-)

Will try to keep it down to a dull roar.  ;-)

> > Some remaining shortcomings of the current code:
> > 
> > o	Untested on CONFIG_PREEMPT_RT (working on this, suggestions
> > 	for 4-CPU-stable CONFIG_PREEMPT_RT versions most welcome).
> 
> latest (-51-28) has no known regressions, and i regularly boot on 4-way 
> (and irregularly on 8-way) boxes. So if you see something strange on the 
> latest -RT kernel, please report it.

And stock -51-7 just passed nine rounds of kernbench + LTP on some 4-CPU
x86 machines, with one still running.  Good stuff!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 20:04     ` Steven Rostedt
@ 2005-07-12 21:34       ` Paul E. McKenney
  2005-07-12 23:47         ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-12 21:34 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, Jul 12, 2005 at 04:04:17PM -0400, Steven Rostedt wrote:
> On Tue, 2005-07-12 at 12:28 -0700, Paul E. McKenney wrote:
> > > 
> > > So is there a difference on UP between x.counter++ and atomic_inc(&x)?
> > 
> > On x86, you are right.  The full list of architectures that are atomic
> > only in SMP are i386 (as you noted), parisc, sparc, and x86_64.
> > 
> > The architectures that appear to always be atomic are: alpha, ia64, m32r
> > (but unfamiliar with this one), mips, ppc, ppc64, s390 (I think...),
> > and sparc64.  Most of these architectures need to disable interrupts
> > to provide "universal" atomic_inc() semantics in UP kernels, due to
> > their RISC-style atomic instructions.  
> > 
> > The following architectures avoid the issue entirely by refusing to
> > support SMP: arm, arm26, cris, frv, h8300, m68k, m68knommu, sh, sh64,
> > and v850.  Many of them disable interrupts in their atomic_inc()
> > implementations.
> > 
> > The advantage of smp_atomic_inc() is that architectures could dispense
> > with interrupt disabling.  The disadvantage is that it is yet another
> > contribution to Linux's combinatorial explosion of primitives.
> > 
> > For the moment, I will grit my teeth and keep atomic_inc() and atomic_dec().
> > 
> > Other thoughts?
> 
> How did I know that you would mention mips and the like :-)
> 
> Yeah, mips has the crazy Load Linked and Store Conditional crap, so it
> is a little more complex than the simple add one.  And I think PPC does
> too, although it has been a while since I've used them.  And older mips
> don't have the LL SC command so the only option is to turn off
> interrupts (of course those that don't have the LL and SC are not SMP
> compatible).  So, I will admit that having a smp_atomic_inc might be
> nice. I was just being a narrow minded x86 hacker ;-)

I am sure that LL/SC seemed like a good idea at the time.  ;-)

To be fair, LL/SC does allow allow some things to be done more easily
than with cmpxchg, since it allows you to tell that the value changed
even if it later changed back.  Helps with some linked-list operations.

> > > Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
> > > just use local_irq_save with spin_lock, since now local_irq_save no
> > > longer disables interrupts in PREEMPT_RT.
> > 
> > By this you mean the following?
> > 
> > 	local_irq_save(flags);
> > 	_raw_spin_lock(&mylock);
> > 
> > 	/* critical section */
> > 
> > 	_raw_spin_unlock(&mylock);
> > 	local_irq_restore(flags);
> 
> Yeah, that on PREEMP_RT would not turn off interrupts but just stops
> preemption. This is fine as long as the mylock is not used in any
> SA_NODELAY interrupt.

Cool!  Is the scheduling-clock interrupt an SA_NODELAY interrupt?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 21:34       ` Paul E. McKenney
@ 2005-07-12 23:47         ` Steven Rostedt
  2005-07-13  1:46           ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2005-07-12 23:47 UTC (permalink / raw)
  To: paulmck; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, 2005-07-12 at 14:34 -0700, Paul E. McKenney wrote:

> > 
> > Yeah, mips has the crazy Load Linked and Store Conditional crap, so it
> > is a little more complex than the simple add one.  And I think PPC does
> > too, although it has been a while since I've used them.  And older mips
> > don't have the LL SC command so the only option is to turn off
> > interrupts (of course those that don't have the LL and SC are not SMP
> > compatible).  So, I will admit that having a smp_atomic_inc might be
> > nice. I was just being a narrow minded x86 hacker ;-)
> 
> I am sure that LL/SC seemed like a good idea at the time.  ;-)
> 
> To be fair, LL/SC does allow allow some things to be done more easily
> than with cmpxchg, since it allows you to tell that the value changed
> even if it later changed back.  Helps with some linked-list operations.

I was being a little harsh in my statements.  I didn't really mind the
LL and SC but a true atomic inc would have been nice.  I actually had to
port Linux to a MIPS board once that didn't have the LL or SC, and that
was even more painful.

> > > > Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
> > > > just use local_irq_save with spin_lock, since now local_irq_save no
> > > > longer disables interrupts in PREEMPT_RT.
> > > 
> > > By this you mean the following?
> > > 
> > > 	local_irq_save(flags);
> > > 	_raw_spin_lock(&mylock);
> > > 
> > > 	/* critical section */
> > > 
> > > 	_raw_spin_unlock(&mylock);
> > > 	local_irq_restore(flags);
> > 
> > Yeah, that on PREEMP_RT would not turn off interrupts but just stops
> > preemption. This is fine as long as the mylock is not used in any
> > SA_NODELAY interrupt.
> 
> Cool!  Is the scheduling-clock interrupt an SA_NODELAY interrupt?

If you are talking about scheduler_tick, then yes, it is called by the
timer interrupt which is a SA_NODELAY interrupt.  If you don't want to
get interrupted by the timer interrupt, then you will need to disable
interrupts for both. Since currently, the timer interrupt is the only
true hard interrupt in the PREEMPT_RT and that may not change.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-12 23:47         ` Steven Rostedt
@ 2005-07-13  1:46           ` Paul E. McKenney
  2005-07-13  3:54             ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-13  1:46 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, Jul 12, 2005 at 07:47:15PM -0400, Steven Rostedt wrote:
> On Tue, 2005-07-12 at 14:34 -0700, Paul E. McKenney wrote:
> > > Yeah, mips has the crazy Load Linked and Store Conditional crap, so it
> > > is a little more complex than the simple add one.  And I think PPC does
> > > too, although it has been a while since I've used them.  And older mips
> > > don't have the LL SC command so the only option is to turn off
> > > interrupts (of course those that don't have the LL and SC are not SMP
> > > compatible).  So, I will admit that having a smp_atomic_inc might be
> > > nice. I was just being a narrow minded x86 hacker ;-)
> > 
> > I am sure that LL/SC seemed like a good idea at the time.  ;-)
> > 
> > To be fair, LL/SC does allow allow some things to be done more easily
> > than with cmpxchg, since it allows you to tell that the value changed
> > even if it later changed back.  Helps with some linked-list operations.
> 
> I was being a little harsh in my statements.  I didn't really mind the
> LL and SC but a true atomic inc would have been nice.  I actually had to
> port Linux to a MIPS board once that didn't have the LL or SC, and that
> was even more painful.

I could believe that!

> > > > > Yep interrupts are threads in CONFIG_PREEMPT_RT.  I guess you could also
> > > > > just use local_irq_save with spin_lock, since now local_irq_save no
> > > > > longer disables interrupts in PREEMPT_RT.
> > > > 
> > > > By this you mean the following?
> > > > 
> > > > 	local_irq_save(flags);
> > > > 	_raw_spin_lock(&mylock);
> > > > 
> > > > 	/* critical section */
> > > > 
> > > > 	_raw_spin_unlock(&mylock);
> > > > 	local_irq_restore(flags);
> > > 
> > > Yeah, that on PREEMP_RT would not turn off interrupts but just stops
> > > preemption. This is fine as long as the mylock is not used in any
> > > SA_NODELAY interrupt.
> > 
> > Cool!  Is the scheduling-clock interrupt an SA_NODELAY interrupt?
> 
> If you are talking about scheduler_tick, then yes, it is called by the
> timer interrupt which is a SA_NODELAY interrupt.  If you don't want to
> get interrupted by the timer interrupt, then you will need to disable
> interrupts for both. Since currently, the timer interrupt is the only
> true hard interrupt in the PREEMPT_RT and that may not change.

OK, so if I take a spinlock in something invoked from scheduler_tick(),
then any other acquisitions of that spinlock must disable hardware
interrupts, right?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-13  1:46           ` Paul E. McKenney
@ 2005-07-13  3:54             ` Steven Rostedt
  2005-07-13 15:41               ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2005-07-13  3:54 UTC (permalink / raw)
  To: paulmck; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, 2005-07-12 at 18:46 -0700, Paul E. McKenney wrote:

> > If you are talking about scheduler_tick, then yes, it is called by the
> > timer interrupt which is a SA_NODELAY interrupt.  If you don't want to
> > get interrupted by the timer interrupt, then you will need to disable
> > interrupts for both. Since currently, the timer interrupt is the only
> > true hard interrupt in the PREEMPT_RT and that may not change.
> 
> OK, so if I take a spinlock in something invoked from scheduler_tick(),
> then any other acquisitions of that spinlock must disable hardware
> interrupts, right?

Yes, otherwise you could have a local CPU deadlock on a SMP machine. And
I would also say the same is true for any lock that is grabbed by the
timer interrupt or one of the functions it calls.

-- Steve



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: PREEMPT/PREEMPT_RT question
  2005-07-13  3:54             ` Steven Rostedt
@ 2005-07-13 15:41               ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2005-07-13 15:41 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: shemminger, dipankar, mingo, linux-kernel

On Tue, Jul 12, 2005 at 11:54:50PM -0400, Steven Rostedt wrote:
> On Tue, 2005-07-12 at 18:46 -0700, Paul E. McKenney wrote:
> 
> > > If you are talking about scheduler_tick, then yes, it is called by the
> > > timer interrupt which is a SA_NODELAY interrupt.  If you don't want to
> > > get interrupted by the timer interrupt, then you will need to disable
> > > interrupts for both. Since currently, the timer interrupt is the only
> > > true hard interrupt in the PREEMPT_RT and that may not change.
> > 
> > OK, so if I take a spinlock in something invoked from scheduler_tick(),
> > then any other acquisitions of that spinlock must disable hardware
> > interrupts, right?
> 
> Yes, otherwise you could have a local CPU deadlock on a SMP machine. And
> I would also say the same is true for any lock that is grabbed by the
> timer interrupt or one of the functions it calls.

OK, I do indeed have critical sections spanning rcu_check_callbacks()
and process-level code.  Since rcu_check_callbacks() is called from
account_user_vtime() and update_process_times(), both of which call
scheduler_tick(), looks like I need to disable hardware interrupts.

Looks to me like I need to use _raw_spin_lock() and
_raw_spin_unlock() from within rcu_check_callbacks(), but to instead
use _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() outside of
rcu_check_callbacks() for locks acquired within rcu_check_callbacks().

Of course, for fliplock, which is only acquired conditionally from
a rcu_check_callbacks() callee, I can just use spin_trylock() and
spin_unlock().  Though _raw_spin_lock()/_raw_spin_unlock() might be
faster?

No -wonder- I have been seeing hangs in CONFIG_PREEMPT_RT!!!  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2005-07-13 15:43 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-12 16:30 PREEMPT/PREEMPT_RT question Paul E. McKenney
2005-07-12 17:05 ` Steven Rostedt
2005-07-12 19:28   ` Paul E. McKenney
2005-07-12 20:04     ` Steven Rostedt
2005-07-12 21:34       ` Paul E. McKenney
2005-07-12 23:47         ` Steven Rostedt
2005-07-13  1:46           ` Paul E. McKenney
2005-07-13  3:54             ` Steven Rostedt
2005-07-13 15:41               ` Paul E. McKenney
2005-07-12 19:26 ` Ingo Molnar
2005-07-12 21:04   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox