* [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups
@ 2009-12-02 20:09 Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-02 20:09 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
This patchset includes some cleanups, improved diagnostics, and an
implementation of expedited preemptible RCU grace periods that is
actually expedited:
1. Rename the "quiet" functions. The name rcu_quiet_cpu() was
clear enough, but now that we have four flavors of quietness
with more on the way, we need more meaningful names.
2. Enable a fourth level of the rcu_node hierarchy. No, we really
don't need the ability to run on million-CPU SMP systems at the
moment, but the additional level allows more vigorous
stress-testing on 16-CPU systems.
3. Add an implementation of synchronize_rcu_expedited() that
actually expedites preemptible-RCU grace periods.
4. Make RCU_CPU_STALL_DETECTOR be on by default. If this works
well, the #ifdefs will eventually be removed to reduce testing
load.
This patchset is intended for 2.6.34, but has passed sufficient testing
that it could safely be included in 2.6.33, if desired.
b/kernel/rcutorture.c | 34 +++++--
b/kernel/rcutree.c | 66 ++++++++-------
b/kernel/rcutree.h | 3
b/kernel/rcutree_plugin.h | 13 +--
b/kernel/rcutree_trace.c | 11 +-
b/lib/Kconfig.debug | 3
kernel/rcutree.c | 16 ++-
kernel/rcutree.h | 51 ++++++++++-
kernel/rcutree_plugin.h | 198 +++++++++++++++++++++++++++++++++++++++++++---
9 files changed, 324 insertions(+), 71 deletions(-)
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
@ 2009-12-02 20:10 ` Paul E. McKenney
2009-12-02 23:20 ` Josh Triplett
2009-12-03 13:22 ` [tip:core/rcu] rcu: Rename " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy Paul E. McKenney
` (3 subsequent siblings)
4 siblings, 2 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-02 20:10 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
Paul E. McKenney
The number of "quiet" functions has grown recently, and the names are
no longer very descriptive. The point of all of these functions is to
do some portion of the task of reporting a quiescent state, so rename
them accordingly:
o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
quiescent state to the per-CPU rcu_data structure. If this
turns out to be a new quiescent state for this grace period,
then rcu_report_qs_rnp() will be invoked to propagate the
quiescent state up the rcu_node hierarchy.
o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
a quiescent state for a given CPU (or possibly a set of CPUs)
up the rcu_node hierarchy.
o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
reports a full set of quiescent states to the global rcu_state
structure.
o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
a quiescent state due to a task exiting an RCU read-side critical
section that had previously blocked in that same critical section.
As indicated by the new name, this type of quiescent state is
reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
to do so).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
kernel/rcutree.c | 66 ++++++++++++++++++++++++++--------------------
kernel/rcutree.h | 3 +-
kernel/rcutree_plugin.h | 12 ++++----
3 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4ca7e02..a9f5103 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -740,11 +740,13 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
}
/*
- * Clean up after the prior grace period and let rcu_start_gp() start up
- * the next grace period if one is needed. Note that the caller must
- * hold rnp->lock, as required by rcu_start_gp(), which will release it.
+ * Report a full set of quiescent states to the specified rcu_state
+ * data structure. This involves cleaning up after the prior grace
+ * period and letting rcu_start_gp() start up the next grace period
+ * if one is needed. Note that the caller must hold rnp->lock, as
+ * required by rcu_start_gp(), which will release it.
*/
-static void cpu_quiet_msk_finish(struct rcu_state *rsp, unsigned long flags)
+static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
__releases(rcu_get_root(rsp)->lock)
{
WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
@@ -754,15 +756,16 @@ static void cpu_quiet_msk_finish(struct rcu_state *rsp, unsigned long flags)
}
/*
- * Similar to cpu_quiet(), for which it is a helper function. Allows
- * a group of CPUs to be quieted at one go, though all the CPUs in the
- * group must be represented by the same leaf rcu_node structure.
- * That structure's lock must be held upon entry, and it is released
- * before return.
+ * Similar to rcu_report_qs_rdp(), for which it is a helper function.
+ * Allows quiescent states for a group of CPUs to be reported at one go
+ * to the specified rcu_node structure, though all the CPUs in the group
+ * must be represented by the same rcu_node structure (which need not be
+ * a leaf rcu_node structure, though it often will be). That structure's
+ * lock must be held upon entry, and it is released before return.
*/
static void
-cpu_quiet_msk(unsigned long mask, struct rcu_state *rsp, struct rcu_node *rnp,
- unsigned long flags)
+rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
+ struct rcu_node *rnp, unsigned long flags)
__releases(rnp->lock)
{
struct rcu_node *rnp_c;
@@ -798,21 +801,23 @@ cpu_quiet_msk(unsigned long mask, struct rcu_state *rsp, struct rcu_node *rnp,
/*
* Get here if we are the last CPU to pass through a quiescent
- * state for this grace period. Invoke cpu_quiet_msk_finish()
+ * state for this grace period. Invoke rcu_report_qs_rsp()
* to clean up and start the next grace period if one is needed.
*/
- cpu_quiet_msk_finish(rsp, flags); /* releases rnp->lock. */
+ rcu_report_qs_rsp(rsp, flags); /* releases rnp->lock. */
}
/*
- * Record a quiescent state for the specified CPU, which must either be
- * the current CPU. The lastcomp argument is used to make sure we are
- * still in the grace period of interest. We don't want to end the current
- * grace period based on quiescent states detected in an earlier grace
- * period!
+ * Record a quiescent state for the specified CPU to that CPU's rcu_data
+ * structure. This must be either called from the specified CPU, or
+ * called when the specified CPU is known to be offline (and when it is
+ * also known that no other CPU is concurrently trying to help the offline
+ * CPU). The lastcomp argument is used to make sure we are still in the
+ * grace period of interest. We don't want to end the current grace period
+ * based on quiescent states detected in an earlier grace period!
*/
static void
-cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
+rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
{
unsigned long flags;
unsigned long mask;
@@ -827,8 +832,8 @@ cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
* The race with GP start is resolved by the fact that we
* hold the leaf rcu_node lock, so that the per-CPU bits
* cannot yet be initialized -- so we would simply find our
- * CPU's bit already cleared in cpu_quiet_msk() if this race
- * occurred.
+ * CPU's bit already cleared in rcu_report_qs_rnp() if this
+ * race occurred.
*/
rdp->passed_quiesc = 0; /* try again later! */
spin_unlock_irqrestore(&rnp->lock, flags);
@@ -846,7 +851,7 @@ cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
*/
rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
- cpu_quiet_msk(mask, rsp, rnp, flags); /* releases rnp->lock */
+ rcu_report_qs_rnp(mask, rsp, rnp, flags); /* rlses rnp->lock */
}
}
@@ -877,8 +882,11 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
if (!rdp->passed_quiesc)
return;
- /* Tell RCU we are done (but cpu_quiet() will be the judge of that). */
- cpu_quiet(rdp->cpu, rsp, rdp, rdp->passed_quiesc_completed);
+ /*
+ * Tell RCU we are done (but rcu_report_qs_rdp() will be the
+ * judge of that).
+ */
+ rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesc_completed);
}
#ifdef CONFIG_HOTPLUG_CPU
@@ -968,13 +976,13 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
/*
* We still hold the leaf rcu_node structure lock here, and
* irqs are still disabled. The reason for this subterfuge is
- * because invoking task_quiet() with ->onofflock held leads
- * to deadlock.
+ * because invoking rcu_report_unblock_qs_rnp() with ->onofflock
+ * held leads to deadlock.
*/
spin_unlock(&rsp->onofflock); /* irqs remain disabled. */
rnp = rdp->mynode;
if (need_quiet)
- task_quiet(rnp, flags);
+ rcu_report_unblock_qs_rnp(rnp, flags);
else
spin_unlock_irqrestore(&rnp->lock, flags);
@@ -1164,8 +1172,8 @@ static int rcu_process_dyntick(struct rcu_state *rsp, long lastcomp,
}
if (mask != 0 && rnp->completed == lastcomp) {
- /* cpu_quiet_msk() releases rnp->lock. */
- cpu_quiet_msk(mask, rsp, rnp, flags);
+ /* rcu_report_qs_rnp() releases rnp->lock. */
+ rcu_report_qs_rnp(mask, rsp, rnp, flags);
continue;
}
spin_unlock_irqrestore(&rnp->lock, flags);
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index a81188c..8bb03cb 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -306,7 +306,8 @@ long rcu_batches_completed(void);
static void rcu_preempt_note_context_switch(int cpu);
static int rcu_preempted_readers(struct rcu_node *rnp);
#ifdef CONFIG_HOTPLUG_CPU
-static void task_quiet(struct rcu_node *rnp, unsigned long flags);
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp,
+ unsigned long flags);
#endif /* #ifdef CONFIG_HOTPLUG_CPU */
#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
static void rcu_print_task_stall(struct rcu_node *rnp);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 1d295c7..c9f0c97 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -167,7 +167,7 @@ static int rcu_preempted_readers(struct rcu_node *rnp)
* irqs disabled, and this lock is released upon return, but irqs remain
* disabled.
*/
-static void task_quiet(struct rcu_node *rnp, unsigned long flags)
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
__releases(rnp->lock)
{
unsigned long mask;
@@ -185,7 +185,7 @@ static void task_quiet(struct rcu_node *rnp, unsigned long flags)
* or tasks were kicked up to root rcu_node due to
* CPUs going offline.
*/
- cpu_quiet_msk_finish(&rcu_preempt_state, flags);
+ rcu_report_qs_rsp(&rcu_preempt_state, flags);
return;
}
@@ -193,7 +193,7 @@ static void task_quiet(struct rcu_node *rnp, unsigned long flags)
mask = rnp->grpmask;
spin_unlock(&rnp->lock); /* irqs remain disabled. */
spin_lock(&rnp_p->lock); /* irqs already disabled. */
- cpu_quiet_msk(mask, &rcu_preempt_state, rnp_p, flags);
+ rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags);
}
/*
@@ -253,12 +253,12 @@ static void rcu_read_unlock_special(struct task_struct *t)
/*
* If this was the last task on the current list, and if
* we aren't waiting on any CPUs, report the quiescent state.
- * Note that task_quiet() releases rnp->lock.
+ * Note that rcu_report_unblock_qs_rnp() releases rnp->lock.
*/
if (empty)
spin_unlock_irqrestore(&rnp->lock, flags);
else
- task_quiet(rnp, flags);
+ rcu_report_unblock_qs_rnp(rnp, flags);
} else {
local_irq_restore(flags);
}
@@ -566,7 +566,7 @@ static int rcu_preempted_readers(struct rcu_node *rnp)
#ifdef CONFIG_HOTPLUG_CPU
/* Because preemptible RCU does not exist, no quieting of tasks. */
-static void task_quiet(struct rcu_node *rnp, unsigned long flags)
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
{
spin_unlock_irqrestore(&rnp->lock, flags);
}
--
1.5.2.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
@ 2009-12-02 20:10 ` Paul E. McKenney
2009-12-02 23:25 ` Josh Triplett
2009-12-03 13:22 ` [tip:core/rcu] rcu: Enable " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU Paul E. McKenney
` (2 subsequent siblings)
4 siblings, 2 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-02 20:10 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
Paul E. McKenney
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU. This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216 CPUs
on 64-bit systems, though only 1,048,576 CPUs on 32-bit systems.
Normal experimental use of this fourth level will normally set
CONFIG_RCU_FANOUT=2, requiring a 16-CPU system, though the more
adventurous (and more fortunate) experimenters may wish to chose
CONFIG_RCU_FANOUT=3 for 81-CPU systems or even CONFIG_RCU_FANOUT=4 for
256-CPU systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
kernel/rcutree.c | 6 +++++-
kernel/rcutree.h | 15 +++++++++++++--
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a9f5103..d47e03e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,7 +60,8 @@ static struct lock_class_key rcu_node_class[NUM_RCU_LVLS];
NUM_RCU_LVL_0, /* root of hierarchy. */ \
NUM_RCU_LVL_1, \
NUM_RCU_LVL_2, \
- NUM_RCU_LVL_3, /* == MAX_RCU_LVLS */ \
+ NUM_RCU_LVL_3, \
+ NUM_RCU_LVL_4, /* == MAX_RCU_LVLS */ \
}, \
.signaled = RCU_GP_IDLE, \
.gpnum = -300, \
@@ -1877,6 +1878,9 @@ void __init rcu_init(void)
#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
printk(KERN_INFO "RCU-based detection of stalled CPUs is enabled.\n");
#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
+#if NUM_RCU_LVL_4 != 0
+ printk(KERN_INFO "Experimental four-level hierarchy is enabled.\n");
+#endif /* #if NUM_RCU_LVL_4 != 0 */
RCU_INIT_FLAVOR(&rcu_sched_state, rcu_sched_data);
RCU_INIT_FLAVOR(&rcu_bh_state, rcu_bh_data);
__rcu_init_preempt();
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8bb03cb..df2e0b6 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -34,10 +34,11 @@
* In practice, this has not been tested, so there is probably some
* bug somewhere.
*/
-#define MAX_RCU_LVLS 3
+#define MAX_RCU_LVLS 4
#define RCU_FANOUT (CONFIG_RCU_FANOUT)
#define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
#define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)
+#define RCU_FANOUT_FOURTH (RCU_FANOUT_CUBE * RCU_FANOUT)
#if NR_CPUS <= RCU_FANOUT
# define NUM_RCU_LVLS 1
@@ -45,23 +46,33 @@
# define NUM_RCU_LVL_1 (NR_CPUS)
# define NUM_RCU_LVL_2 0
# define NUM_RCU_LVL_3 0
+# define NUM_RCU_LVL_4 0
#elif NR_CPUS <= RCU_FANOUT_SQ
# define NUM_RCU_LVLS 2
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
# define NUM_RCU_LVL_2 (NR_CPUS)
# define NUM_RCU_LVL_3 0
+# define NUM_RCU_LVL_4 0
#elif NR_CPUS <= RCU_FANOUT_CUBE
# define NUM_RCU_LVLS 3
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
# define NUM_RCU_LVL_3 NR_CPUS
+# define NUM_RCU_LVL_4 0
+#elif NR_CPUS <= RCU_FANOUT_FOURTH
+# define NUM_RCU_LVLS 4
+# define NUM_RCU_LVL_0 1
+# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_CUBE)
+# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ)
+# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
+# define NUM_RCU_LVL_4 NR_CPUS
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT */
-#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
/*
--
1.5.2.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy Paul E. McKenney
@ 2009-12-02 20:10 ` Paul E. McKenney
2009-12-03 9:26 ` Lai Jiangshan
2009-12-03 13:22 ` [tip:core/rcu] rcu: Add " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 4/4] rcu: make RCU's CPU-stall detector be default Paul E. McKenney
2009-12-03 8:53 ` [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Lai Jiangshan
4 siblings, 2 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-02 20:10 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
Paul E. McKenney
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Implement an synchronize_rcu_expedited() for preemptible RCU that actually
is expedited. This uses synchronize_sched_expedited() to force all
threads currently running in a preemptible-RCU read-side critical section
onto the appropriate ->blocked_tasks[] list, then takes a snapshot of
all of these lists and waits for them to drain.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
kernel/rcutorture.c | 34 ++++++--
kernel/rcutree.c | 10 ++-
kernel/rcutree.h | 35 ++++++++-
kernel/rcutree_plugin.h | 198 ++++++++++++++++++++++++++++++++++++++++++++--
kernel/rcutree_trace.c | 10 ++-
5 files changed, 260 insertions(+), 27 deletions(-)
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 3dd0ca2..a621a67 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -327,6 +327,11 @@ rcu_torture_cb(struct rcu_head *p)
cur_ops->deferred_free(rp);
}
+static int rcu_no_completed(void)
+{
+ return 0;
+}
+
static void rcu_torture_deferred_free(struct rcu_torture *p)
{
call_rcu(&p->rtort_rcu, rcu_torture_cb);
@@ -388,6 +393,21 @@ static struct rcu_torture_ops rcu_sync_ops = {
.name = "rcu_sync"
};
+static struct rcu_torture_ops rcu_expedited_ops = {
+ .init = rcu_sync_torture_init,
+ .cleanup = NULL,
+ .readlock = rcu_torture_read_lock,
+ .read_delay = rcu_read_delay, /* just reuse rcu's version. */
+ .readunlock = rcu_torture_read_unlock,
+ .completed = rcu_no_completed,
+ .deferred_free = rcu_sync_torture_deferred_free,
+ .sync = synchronize_rcu_expedited,
+ .cb_barrier = NULL,
+ .stats = NULL,
+ .irq_capable = 1,
+ .name = "rcu_expedited"
+};
+
/*
* Definitions for rcu_bh torture testing.
*/
@@ -581,11 +601,6 @@ static void sched_torture_read_unlock(int idx)
preempt_enable();
}
-static int sched_torture_completed(void)
-{
- return 0;
-}
-
static void rcu_sched_torture_deferred_free(struct rcu_torture *p)
{
call_rcu_sched(&p->rtort_rcu, rcu_torture_cb);
@@ -602,7 +617,7 @@ static struct rcu_torture_ops sched_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sched_torture_deferred_free,
.sync = sched_torture_synchronize,
.cb_barrier = rcu_barrier_sched,
@@ -617,7 +632,7 @@ static struct rcu_torture_ops sched_sync_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sync_torture_deferred_free,
.sync = sched_torture_synchronize,
.cb_barrier = NULL,
@@ -631,7 +646,7 @@ static struct rcu_torture_ops sched_expedited_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sync_torture_deferred_free,
.sync = synchronize_sched_expedited,
.cb_barrier = NULL,
@@ -1116,7 +1131,8 @@ rcu_torture_init(void)
int cpu;
int firsterr = 0;
static struct rcu_torture_ops *torture_ops[] =
- { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+ { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
+ &rcu_bh_ops, &rcu_bh_sync_ops,
&srcu_ops, &srcu_expedited_ops,
&sched_ops, &sched_sync_ops, &sched_expedited_ops, };
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d47e03e..53ae959 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -948,7 +948,7 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
{
unsigned long flags;
unsigned long mask;
- int need_quiet = 0;
+ int need_report = 0;
struct rcu_data *rdp = rsp->rda[cpu];
struct rcu_node *rnp;
@@ -967,7 +967,7 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
break;
}
if (rnp == rdp->mynode)
- need_quiet = rcu_preempt_offline_tasks(rsp, rnp, rdp);
+ need_report = rcu_preempt_offline_tasks(rsp, rnp, rdp);
else
spin_unlock(&rnp->lock); /* irqs remain disabled. */
mask = rnp->grpmask;
@@ -982,10 +982,12 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
*/
spin_unlock(&rsp->onofflock); /* irqs remain disabled. */
rnp = rdp->mynode;
- if (need_quiet)
+ if (need_report & RCU_OFL_TASKS_NORM_GP)
rcu_report_unblock_qs_rnp(rnp, flags);
else
spin_unlock_irqrestore(&rnp->lock, flags);
+ if (need_report & RCU_OFL_TASKS_EXP_GP)
+ rcu_report_exp_rnp(rsp, rnp);
rcu_adopt_orphan_cbs(rsp);
}
@@ -1843,6 +1845,8 @@ static void __init rcu_init_one(struct rcu_state *rsp)
rnp->level = i;
INIT_LIST_HEAD(&rnp->blocked_tasks[0]);
INIT_LIST_HEAD(&rnp->blocked_tasks[1]);
+ INIT_LIST_HEAD(&rnp->blocked_tasks[2]);
+ INIT_LIST_HEAD(&rnp->blocked_tasks[3]);
}
}
}
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index df2e0b6..d2a0046 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -104,8 +104,12 @@ struct rcu_node {
/* an rcu_data structure, otherwise, each */
/* bit corresponds to a child rcu_node */
/* structure. */
+ unsigned long expmask; /* Groups that have ->blocked_tasks[] */
+ /* elements that need to drain to allow the */
+ /* current expedited grace period to */
+ /* complete (only for TREE_PREEMPT_RCU). */
unsigned long qsmaskinit;
- /* Per-GP initialization for qsmask. */
+ /* Per-GP initial value for qsmask & expmask. */
unsigned long grpmask; /* Mask to apply to parent qsmask. */
/* Only one bit will be set in this mask. */
int grplo; /* lowest-numbered CPU or group here. */
@@ -113,7 +117,7 @@ struct rcu_node {
u8 grpnum; /* CPU/group number for next level up. */
u8 level; /* root is at level 0. */
struct rcu_node *parent;
- struct list_head blocked_tasks[2];
+ struct list_head blocked_tasks[4];
/* Tasks blocked in RCU read-side critsect. */
/* Grace period number (->gpnum) x blocked */
/* by tasks on the (x & 0x1) element of the */
@@ -128,6 +132,21 @@ struct rcu_node {
for ((rnp) = &(rsp)->node[0]; \
(rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+/*
+ * Do a breadth-first scan of the non-leaf rcu_node structures for the
+ * specified rcu_state structure. Note that if there is a singleton
+ * rcu_node tree with but one rcu_node structure, this loop is a no-op.
+ */
+#define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
+ for ((rnp) = &(rsp)->node[0]; \
+ (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
+
+/*
+ * Scan the leaves of the rcu_node hierarchy for the specified rcu_state
+ * structure. Note that if there is a singleton rcu_node tree with but
+ * one rcu_node structure, this loop -will- visit the rcu_node structure.
+ * It is still a leaf node, even if it is also the root node.
+ */
#define rcu_for_each_leaf_node(rsp, rnp) \
for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
(rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
@@ -261,7 +280,7 @@ struct rcu_state {
long gpnum; /* Current gp number. */
long completed; /* # of last completed gp. */
- /* End of fields guarded by root rcu_node's lock. */
+ /* End of fields guarded by root rcu_node's lock. */
spinlock_t onofflock; /* exclude on/offline and */
/* starting new GP. Also */
@@ -293,6 +312,13 @@ struct rcu_state {
#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
};
+/* Return values for rcu_preempt_offline_tasks(). */
+
+#define RCU_OFL_TASKS_NORM_GP 0x1 /* Tasks blocking normal */
+ /* GP were moved to root. */
+#define RCU_OFL_TASKS_EXP_GP 0x2 /* Tasks blocking expedited */
+ /* GP were moved to root. */
+
#ifdef RCU_TREE_NONCORE
/*
@@ -333,6 +359,9 @@ static void rcu_preempt_offline_cpu(int cpu);
static void rcu_preempt_check_callbacks(int cpu);
static void rcu_preempt_process_callbacks(void);
void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU)
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp);
+#endif /* #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU) */
static int rcu_preempt_pending(int cpu);
static int rcu_preempt_needs_cpu(int cpu);
static void __cpuinit rcu_preempt_init_percpu_data(int cpu);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c9f0c97..37fbccd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -24,12 +24,15 @@
* Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/
+#include <linux/delay.h>
#ifdef CONFIG_TREE_PREEMPT_RCU
struct rcu_state rcu_preempt_state = RCU_STATE_INITIALIZER(rcu_preempt_state);
DEFINE_PER_CPU(struct rcu_data, rcu_preempt_data);
+static int rcu_preempted_readers_exp(struct rcu_node *rnp);
+
/*
* Tell them what RCU they are running.
*/
@@ -157,7 +160,10 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
*/
static int rcu_preempted_readers(struct rcu_node *rnp)
{
- return !list_empty(&rnp->blocked_tasks[rnp->gpnum & 0x1]);
+ int phase = rnp->gpnum & 0x1;
+
+ return !list_empty(&rnp->blocked_tasks[phase]) ||
+ !list_empty(&rnp->blocked_tasks[phase + 2]);
}
/*
@@ -204,6 +210,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
static void rcu_read_unlock_special(struct task_struct *t)
{
int empty;
+ int empty_exp;
unsigned long flags;
struct rcu_node *rnp;
int special;
@@ -247,6 +254,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
spin_unlock(&rnp->lock); /* irqs remain disabled. */
}
empty = !rcu_preempted_readers(rnp);
+ empty_exp = !rcu_preempted_readers_exp(rnp);
+ smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
list_del_init(&t->rcu_node_entry);
t->rcu_blocked_node = NULL;
@@ -259,6 +268,13 @@ static void rcu_read_unlock_special(struct task_struct *t)
spin_unlock_irqrestore(&rnp->lock, flags);
else
rcu_report_unblock_qs_rnp(rnp, flags);
+
+ /*
+ * If this was the last task on the expedited lists,
+ * then we need to report up the rcu_node hierarchy.
+ */
+ if (!empty_exp && !rcu_preempted_readers_exp(rnp))
+ rcu_report_exp_rnp(&rcu_preempt_state, rnp);
} else {
local_irq_restore(flags);
}
@@ -343,7 +359,7 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
int i;
struct list_head *lp;
struct list_head *lp_root;
- int retval;
+ int retval = 0;
struct rcu_node *rnp_root = rcu_get_root(rsp);
struct task_struct *tp;
@@ -353,7 +369,9 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
}
WARN_ON_ONCE(rnp != rdp->mynode &&
(!list_empty(&rnp->blocked_tasks[0]) ||
- !list_empty(&rnp->blocked_tasks[1])));
+ !list_empty(&rnp->blocked_tasks[1]) ||
+ !list_empty(&rnp->blocked_tasks[2]) ||
+ !list_empty(&rnp->blocked_tasks[3])));
/*
* Move tasks up to root rcu_node. Rely on the fact that the
@@ -361,8 +379,11 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
* rcu_nodes in terms of gp_num value. This fact allows us to
* move the blocked_tasks[] array directly, element by element.
*/
- retval = rcu_preempted_readers(rnp);
- for (i = 0; i < 2; i++) {
+ if (rcu_preempted_readers(rnp))
+ retval |= RCU_OFL_TASKS_NORM_GP;
+ if (rcu_preempted_readers_exp(rnp))
+ retval |= RCU_OFL_TASKS_EXP_GP;
+ for (i = 0; i < 4; i++) {
lp = &rnp->blocked_tasks[i];
lp_root = &rnp_root->blocked_tasks[i];
while (!list_empty(lp)) {
@@ -449,14 +470,159 @@ void synchronize_rcu(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu);
+static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
+static long sync_rcu_preempt_exp_count;
+static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
+
+/*
+ * Return non-zero if there are any tasks in RCU read-side critical
+ * sections blocking the current preemptible-RCU expedited grace period.
+ * If there is no preemptible-RCU expedited grace period currently in
+ * progress, returns zero unconditionally.
+ */
+static int rcu_preempted_readers_exp(struct rcu_node *rnp)
+{
+ return !list_empty(&rnp->blocked_tasks[2]) ||
+ !list_empty(&rnp->blocked_tasks[3]);
+}
+
+/*
+ * return non-zero if there is no RCU expedited grace period in progress
+ * for the specified rcu_node structure, in other words, if all CPUs and
+ * tasks covered by the specified rcu_node structure have done their bit
+ * for the current expedited grace period. Works only for preemptible
+ * RCU -- other RCU implementation use other means.
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex.
+ */
+static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
+{
+ return !rcu_preempted_readers_exp(rnp) &&
+ ACCESS_ONCE(rnp->expmask) == 0;
+}
+
+/*
+ * Report the exit from RCU read-side critical section for the last task
+ * that queued itself during or before the current expedited preemptible-RCU
+ * grace period. This event is reported either to the rcu_node structure on
+ * which the task was queued or to one of that rcu_node structure's ancestors,
+ * recursively up the tree. (Calm down, calm down, we do the recursion
+ * iteratively!)
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex.
+ */
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ unsigned long flags;
+ unsigned long mask;
+
+ spin_lock_irqsave(&rnp->lock, flags);
+ for (;;) {
+ if (!sync_rcu_preempt_exp_done(rnp))
+ break;
+ if (rnp->parent == NULL) {
+ wake_up(&sync_rcu_preempt_exp_wq);
+ break;
+ }
+ mask = rnp->grpmask;
+ spin_unlock(&rnp->lock); /* irqs remain disabled */
+ rnp = rnp->parent;
+ spin_lock(&rnp->lock); /* irqs already disabled */
+ rnp->expmask &= ~mask;
+ }
+ spin_unlock_irqrestore(&rnp->lock, flags);
+}
+
+/*
+ * Snapshot the tasks blocking the newly started preemptible-RCU expedited
+ * grace period for the specified rcu_node structure. If there are no such
+ * tasks, report it up the rcu_node hierarchy.
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex and rsp->onofflock.
+ */
+static void
+sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ int must_wait;
+
+ spin_lock(&rnp->lock); /* irqs already disabled */
+ list_splice_init(&rnp->blocked_tasks[0], &rnp->blocked_tasks[2]);
+ list_splice_init(&rnp->blocked_tasks[1], &rnp->blocked_tasks[3]);
+ must_wait = rcu_preempted_readers_exp(rnp);
+ spin_unlock(&rnp->lock); /* irqs remain disabled */
+ if (!must_wait)
+ rcu_report_exp_rnp(rsp, rnp);
+}
+
/*
- * Wait for an rcu-preempt grace period. We are supposed to expedite the
- * grace period, but this is the crude slow compatability hack, so just
- * invoke synchronize_rcu().
+ * Wait for an rcu-preempt grace period, but expedite it. The basic idea
+ * is to invoke synchronize_sched_expedited() to push all the tasks to
+ * the ->blocked_tasks[] lists, move all entries from the first set of
+ * ->blocked_tasks[] lists to the second set, and finally wait for this
+ * second set to drain.
*/
void synchronize_rcu_expedited(void)
{
- synchronize_rcu();
+ unsigned long flags;
+ struct rcu_node *rnp;
+ struct rcu_state *rsp = &rcu_preempt_state;
+ long snap;
+ int trycount = 0;
+
+ smp_mb(); /* Caller's modifications seen first by other CPUs. */
+ snap = ACCESS_ONCE(sync_rcu_preempt_exp_count) + 1;
+ smp_mb(); /* Above access cannot bleed into critical section. */
+
+ /*
+ * Acquire lock, falling back to synchronize_rcu() if too many
+ * lock-acquisition failures. Of course, if someone does the
+ * expedited grace period for us, just leave.
+ */
+ while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
+ if (trycount++ < 10)
+ udelay(trycount * num_online_cpus());
+ else {
+ synchronize_rcu();
+ return;
+ }
+ if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+ goto mb_ret; /* Others did our work for us. */
+ }
+ if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+ goto unlock_mb_ret; /* Others did our work for us. */
+
+ /* force all RCU readers onto blocked_tasks[]. */
+ synchronize_sched_expedited();
+
+ spin_lock_irqsave(&rsp->onofflock, flags);
+
+ /* Initialize ->expmask for all non-leaf rcu_node structures. */
+ rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
+ spin_lock(&rnp->lock); /* irqs already disabled. */
+ rnp->expmask = rnp->qsmaskinit;
+ spin_unlock(&rnp->lock); /* irqs remain disabled. */
+ }
+
+ /* Snapshot current state of ->blocked_tasks[] lists. */
+ rcu_for_each_leaf_node(rsp, rnp)
+ sync_rcu_preempt_exp_init(rsp, rnp);
+ if (NUM_RCU_NODES > 1)
+ sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
+
+ spin_unlock_irqrestore(&rsp->onofflock, flags);
+
+ /* Wait for snapshotted ->blocked_tasks[] lists to drain. */
+ rnp = rcu_get_root(rsp);
+ wait_event(sync_rcu_preempt_exp_wq,
+ sync_rcu_preempt_exp_done(rnp));
+
+ /* Clean up and exit. */
+ smp_mb(); /* ensure expedited GP seen before counter increment. */
+ ACCESS_ONCE(sync_rcu_preempt_exp_count)++;
+unlock_mb_ret:
+ mutex_unlock(&sync_rcu_preempt_exp_mutex);
+mb_ret:
+ smp_mb(); /* ensure subsequent action seen after grace period. */
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
@@ -655,6 +821,20 @@ void synchronize_rcu_expedited(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+#ifdef CONFIG_HOTPLUG_CPU
+
+/*
+ * Because preemptable RCU does not exist, there is never any need to
+ * report on tasks preempted in RCU read-side critical sections during
+ * expedited RCU grace periods.
+ */
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ return;
+}
+
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
+
/*
* Because preemptable RCU does not exist, it never has any work to do.
*/
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index 1984cdc..9d2c884 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -157,6 +157,7 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
{
long gpnum;
int level = 0;
+ int phase;
struct rcu_node *rnp;
gpnum = rsp->gpnum;
@@ -173,10 +174,13 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
seq_puts(m, "\n");
level = rnp->level;
}
- seq_printf(m, "%lx/%lx %c>%c %d:%d ^%d ",
+ phase = gpnum & 0x1;
+ seq_printf(m, "%lx/%lx %c%c>%c%c %d:%d ^%d ",
rnp->qsmask, rnp->qsmaskinit,
- "T."[list_empty(&rnp->blocked_tasks[gpnum & 1])],
- "T."[list_empty(&rnp->blocked_tasks[!(gpnum & 1)])],
+ "T."[list_empty(&rnp->blocked_tasks[phase])],
+ "E."[list_empty(&rnp->blocked_tasks[phase + 2])],
+ "T."[list_empty(&rnp->blocked_tasks[!phase])],
+ "E."[list_empty(&rnp->blocked_tasks[!phase + 2])],
rnp->grplo, rnp->grphi, rnp->grpnum);
}
seq_puts(m, "\n");
--
1.5.2.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH tip/core/rcu 4/4] rcu: make RCU's CPU-stall detector be default
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
` (2 preceding siblings ...)
2009-12-02 20:10 ` [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU Paul E. McKenney
@ 2009-12-02 20:10 ` Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Make " tip-bot for Paul E. McKenney
2009-12-03 8:53 ` [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Lai Jiangshan
4 siblings, 1 reply; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-02 20:10 UTC (permalink / raw)
To: linux-kernel
Cc: mingo, laijs, dipankar, akpm, mathieu.desnoyers, josh, dvhltc,
niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells,
Paul E. McKenney
From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The RCU_CPU_STALL_DETECTOR costs almost nothing and has located some bugs
that might otherwise have been difficult to track down. Make it be default
for the TREE RCU implementations. (TINY_RCU does not support CPU-stall
detection, but then again it is a uniprocessor...)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
lib/Kconfig.debug | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8911558..50e0e78 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -749,7 +749,7 @@ config RCU_TORTURE_TEST_RUNNABLE
config RCU_CPU_STALL_DETECTOR
bool "Check for stalled CPUs delaying RCU grace periods"
depends on TREE_RCU || TREE_PREEMPT_RCU
- default n
+ default y
help
This option causes RCU to printk information on which
CPUs are delaying the current grace period, but only when
--
1.5.2.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
@ 2009-12-02 23:20 ` Josh Triplett
2009-12-03 13:22 ` [tip:core/rcu] rcu: Rename " tip-bot for Paul E. McKenney
1 sibling, 0 replies; 15+ messages in thread
From: Josh Triplett @ 2009-12-02 23:20 UTC (permalink / raw)
To: Paul E. McKenney
Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
On Wed, Dec 02, 2009 at 12:10:13PM -0800, Paul E. McKenney wrote:
> The number of "quiet" functions has grown recently, and the names are
> no longer very descriptive. The point of all of these functions is to
> do some portion of the task of reporting a quiescent state, so rename
> them accordingly:
>
> o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
> quiescent state to the per-CPU rcu_data structure. If this
> turns out to be a new quiescent state for this grace period,
> then rcu_report_qs_rnp() will be invoked to propagate the
> quiescent state up the rcu_node hierarchy.
>
> o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
> a quiescent state for a given CPU (or possibly a set of CPUs)
> up the rcu_node hierarchy.
>
> o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
> reports a full set of quiescent states to the global rcu_state
> structure.
>
> o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
> a quiescent state due to a task exiting an RCU read-side critical
> section that had previously blocked in that same critical section.
> As indicated by the new name, this type of quiescent state is
> reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
> to do so).
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*Huge* improvement in terminology.
Acked-by: Josh Triplett <josh@joshtriplett.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy
2009-12-02 20:10 ` [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy Paul E. McKenney
@ 2009-12-02 23:25 ` Josh Triplett
2009-12-03 0:20 ` Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Enable " tip-bot for Paul E. McKenney
1 sibling, 1 reply; 15+ messages in thread
From: Josh Triplett @ 2009-12-02 23:25 UTC (permalink / raw)
To: Paul E. McKenney
Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
On Wed, Dec 02, 2009 at 12:10:14PM -0800, Paul E. McKenney wrote:
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> Enable a fourth level of rcu_node hierarchy for TREE_RCU and
> TREE_PREEMPT_RCU. This is for stress-testing and experiemental
> purposes only, although in theory this would enable 16,777,216 CPUs
> on 64-bit systems, though only 1,048,576 CPUs on 32-bit systems.
> Normal experimental use of this fourth level will normally set
> CONFIG_RCU_FANOUT=2, requiring a 16-CPU system, though the more
> adventurous (and more fortunate) experimenters may wish to chose
> CONFIG_RCU_FANOUT=3 for 81-CPU systems or even CONFIG_RCU_FANOUT=4 for
> 256-CPU systems.
>
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
I like this idea in general, but I have one suggestion on your boot-up
message:
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
[...]
> @@ -1877,6 +1878,9 @@ void __init rcu_init(void)
> #ifdef CONFIG_RCU_CPU_STALL_DETECTOR
> printk(KERN_INFO "RCU-based detection of stalled CPUs is enabled.\n");
> #endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
> +#if NUM_RCU_LVL_4 != 0
> + printk(KERN_INFO "Experimental four-level hierarchy is enabled.\n");
> +#endif /* #if NUM_RCU_LVL_4 != 0 */
Rather than printing a message when people use the four-level hierarchy,
how about just printing a message any time someone has set
CONFIG_RCU_FANOUT manually rather than automatically, and including
NR_CPUS in that message? That should only occur when testing or when
trying to do NUMA optimization, and either way it seems worth noting.
Either way the change seems fine to me. With or without that suggested
change:
Acked-by: Josh Triplett <josh@joshtriplett.org>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy
2009-12-02 23:25 ` Josh Triplett
@ 2009-12-03 0:20 ` Paul E. McKenney
0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-03 0:20 UTC (permalink / raw)
To: Josh Triplett
Cc: linux-kernel, mingo, laijs, dipankar, akpm, mathieu.desnoyers,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
On Wed, Dec 02, 2009 at 03:25:42PM -0800, Josh Triplett wrote:
> On Wed, Dec 02, 2009 at 12:10:14PM -0800, Paul E. McKenney wrote:
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > Enable a fourth level of rcu_node hierarchy for TREE_RCU and
> > TREE_PREEMPT_RCU. This is for stress-testing and experiemental
> > purposes only, although in theory this would enable 16,777,216 CPUs
> > on 64-bit systems, though only 1,048,576 CPUs on 32-bit systems.
> > Normal experimental use of this fourth level will normally set
> > CONFIG_RCU_FANOUT=2, requiring a 16-CPU system, though the more
> > adventurous (and more fortunate) experimenters may wish to chose
> > CONFIG_RCU_FANOUT=3 for 81-CPU systems or even CONFIG_RCU_FANOUT=4 for
> > 256-CPU systems.
> >
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> I like this idea in general, but I have one suggestion on your boot-up
> message:
>
> > --- a/kernel/rcutree.c
> > +++ b/kernel/rcutree.c
> [...]
> > @@ -1877,6 +1878,9 @@ void __init rcu_init(void)
> > #ifdef CONFIG_RCU_CPU_STALL_DETECTOR
> > printk(KERN_INFO "RCU-based detection of stalled CPUs is enabled.\n");
> > #endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
> > +#if NUM_RCU_LVL_4 != 0
> > + printk(KERN_INFO "Experimental four-level hierarchy is enabled.\n");
> > +#endif /* #if NUM_RCU_LVL_4 != 0 */
>
> Rather than printing a message when people use the four-level hierarchy,
> how about just printing a message any time someone has set
> CONFIG_RCU_FANOUT manually rather than automatically, and including
> NR_CPUS in that message? That should only occur when testing or when
> trying to do NUMA optimization, and either way it seems worth noting.
Good point! I will keep this patch as is, but I like the idea of having
RCU note anything unusual at boot time, and so have added this to my
todo list. Of course, if CONFIG_RCU_FANOUT starts getting set low as a
matter of course for any (good) reason, then the definition of "anything
unusual" would change, and hence the code would also change.
> Either way the change seems fine to me. With or without that suggested
> change:
>
> Acked-by: Josh Triplett <josh@joshtriplett.org>
Thank you!
Thanx, Paul
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
` (3 preceding siblings ...)
2009-12-02 20:10 ` [PATCH tip/core/rcu 4/4] rcu: make RCU's CPU-stall detector be default Paul E. McKenney
@ 2009-12-03 8:53 ` Lai Jiangshan
4 siblings, 0 replies; 15+ messages in thread
From: Lai Jiangshan @ 2009-12-03 8:53 UTC (permalink / raw)
To: paulmck
Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
Paul E. McKenney wrote:
> This patchset includes some cleanups, improved diagnostics, and an
> implementation of expedited preemptible RCU grace periods that is
> actually expedited:
>
> 1. Rename the "quiet" functions. The name rcu_quiet_cpu() was
> clear enough, but now that we have four flavors of quietness
> with more on the way, we need more meaningful names.
>
> 2. Enable a fourth level of the rcu_node hierarchy. No, we really
> don't need the ability to run on million-CPU SMP systems at the
> moment, but the additional level allows more vigorous
> stress-testing on 16-CPU systems.
>
> 3. Add an implementation of synchronize_rcu_expedited() that
> actually expedites preemptible-RCU grace periods.
>
> 4. Make RCU_CPU_STALL_DETECTOR be on by default. If this works
> well, the #ifdefs will eventually be removed to reduce testing
> load.
>
> This patchset is intended for 2.6.34, but has passed sufficient testing
> that it could safely be included in 2.6.33, if desired.
>
Ack for the whole patchset except [PATCH 3/4]
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU
2009-12-02 20:10 ` [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU Paul E. McKenney
@ 2009-12-03 9:26 ` Lai Jiangshan
2009-12-03 14:41 ` Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Add " tip-bot for Paul E. McKenney
1 sibling, 1 reply; 15+ messages in thread
From: Lai Jiangshan @ 2009-12-03 9:26 UTC (permalink / raw)
To: Paul E. McKenney
Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
Paul E. McKenney wrote:
> From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> Implement an synchronize_rcu_expedited() for preemptible RCU that actually
> is expedited. This uses synchronize_sched_expedited() to force all
> threads currently running in a preemptible-RCU read-side critical section
> onto the appropriate ->blocked_tasks[] list, then takes a snapshot of
> all of these lists and waits for them to drain.
>
> 3. Add an implementation of synchronize_rcu_expedited() that
> actually expedites preemptible-RCU grace periods.
It's very nice.
But I don't understand all things.
1) Why it can be speeded up (in theory)?
synchronize_sched_expedited() does speed up, it is due to
migration_threads are the most highest priority threads.
But for synchronize_rcu_expedited(), some preempted tasks in ->blocked_tasks[]
may be waiting at runqueue for long long time because some other
higher priority threads comes.
simply comparison:
synchronize_sched_expedited()
==> wake_up_process(rq->migration_thread) to force schedule on cpus.
which forces read-sides notify the end earlier,
or we can say "it forces read-sides run to end faster"
synchronize_rcu_expedited()
==> Nothing to force preempted read-site run to end faster.
2) Why we introduce a API which no one use it.
I remember that Net guys request a expedited synchronize_rcu().
but currently there is still no one use it.
Beware my thinking may be wrong!
Thanks, Lai
^ permalink raw reply [flat|nested] 15+ messages in thread
* [tip:core/rcu] rcu: Rename "quiet" functions
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
2009-12-02 23:20 ` Josh Triplett
@ 2009-12-03 13:22 ` tip-bot for Paul E. McKenney
1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Paul E. McKenney @ 2009-12-03 13:22 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, paulmck, hpa, mingo, josh, tglx, laijs, mingo
Commit-ID: d3f6bad3911736e44ba11f3f3f6ac4e8c837fdfc
Gitweb: http://git.kernel.org/tip/d3f6bad3911736e44ba11f3f3f6ac4e8c837fdfc
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 2 Dec 2009 12:10:13 -0800
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 3 Dec 2009 11:34:26 +0100
rcu: Rename "quiet" functions
The number of "quiet" functions has grown recently, and the
names are no longer very descriptive. The point of all of these
functions is to do some portion of the task of reporting a
quiescent state, so rename them accordingly:
o cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
quiescent state to the per-CPU rcu_data structure. If this
turns out to be a new quiescent state for this grace period,
then rcu_report_qs_rnp() will be invoked to propagate the
quiescent state up the rcu_node hierarchy.
o cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
a quiescent state for a given CPU (or possibly a set of CPUs)
up the rcu_node hierarchy.
o cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
reports a full set of quiescent states to the global rcu_state
structure.
o task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
a quiescent state due to a task exiting an RCU read-side critical
section that had previously blocked in that same critical section.
As indicated by the new name, this type of quiescent state is
reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
to do so).
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846163698-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/rcutree.c | 66 ++++++++++++++++++++++++++--------------------
kernel/rcutree.h | 3 +-
kernel/rcutree_plugin.h | 12 ++++----
3 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 4ca7e02..a9f5103 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -740,11 +740,13 @@ rcu_start_gp(struct rcu_state *rsp, unsigned long flags)
}
/*
- * Clean up after the prior grace period and let rcu_start_gp() start up
- * the next grace period if one is needed. Note that the caller must
- * hold rnp->lock, as required by rcu_start_gp(), which will release it.
+ * Report a full set of quiescent states to the specified rcu_state
+ * data structure. This involves cleaning up after the prior grace
+ * period and letting rcu_start_gp() start up the next grace period
+ * if one is needed. Note that the caller must hold rnp->lock, as
+ * required by rcu_start_gp(), which will release it.
*/
-static void cpu_quiet_msk_finish(struct rcu_state *rsp, unsigned long flags)
+static void rcu_report_qs_rsp(struct rcu_state *rsp, unsigned long flags)
__releases(rcu_get_root(rsp)->lock)
{
WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
@@ -754,15 +756,16 @@ static void cpu_quiet_msk_finish(struct rcu_state *rsp, unsigned long flags)
}
/*
- * Similar to cpu_quiet(), for which it is a helper function. Allows
- * a group of CPUs to be quieted at one go, though all the CPUs in the
- * group must be represented by the same leaf rcu_node structure.
- * That structure's lock must be held upon entry, and it is released
- * before return.
+ * Similar to rcu_report_qs_rdp(), for which it is a helper function.
+ * Allows quiescent states for a group of CPUs to be reported at one go
+ * to the specified rcu_node structure, though all the CPUs in the group
+ * must be represented by the same rcu_node structure (which need not be
+ * a leaf rcu_node structure, though it often will be). That structure's
+ * lock must be held upon entry, and it is released before return.
*/
static void
-cpu_quiet_msk(unsigned long mask, struct rcu_state *rsp, struct rcu_node *rnp,
- unsigned long flags)
+rcu_report_qs_rnp(unsigned long mask, struct rcu_state *rsp,
+ struct rcu_node *rnp, unsigned long flags)
__releases(rnp->lock)
{
struct rcu_node *rnp_c;
@@ -798,21 +801,23 @@ cpu_quiet_msk(unsigned long mask, struct rcu_state *rsp, struct rcu_node *rnp,
/*
* Get here if we are the last CPU to pass through a quiescent
- * state for this grace period. Invoke cpu_quiet_msk_finish()
+ * state for this grace period. Invoke rcu_report_qs_rsp()
* to clean up and start the next grace period if one is needed.
*/
- cpu_quiet_msk_finish(rsp, flags); /* releases rnp->lock. */
+ rcu_report_qs_rsp(rsp, flags); /* releases rnp->lock. */
}
/*
- * Record a quiescent state for the specified CPU, which must either be
- * the current CPU. The lastcomp argument is used to make sure we are
- * still in the grace period of interest. We don't want to end the current
- * grace period based on quiescent states detected in an earlier grace
- * period!
+ * Record a quiescent state for the specified CPU to that CPU's rcu_data
+ * structure. This must be either called from the specified CPU, or
+ * called when the specified CPU is known to be offline (and when it is
+ * also known that no other CPU is concurrently trying to help the offline
+ * CPU). The lastcomp argument is used to make sure we are still in the
+ * grace period of interest. We don't want to end the current grace period
+ * based on quiescent states detected in an earlier grace period!
*/
static void
-cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
+rcu_report_qs_rdp(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
{
unsigned long flags;
unsigned long mask;
@@ -827,8 +832,8 @@ cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
* The race with GP start is resolved by the fact that we
* hold the leaf rcu_node lock, so that the per-CPU bits
* cannot yet be initialized -- so we would simply find our
- * CPU's bit already cleared in cpu_quiet_msk() if this race
- * occurred.
+ * CPU's bit already cleared in rcu_report_qs_rnp() if this
+ * race occurred.
*/
rdp->passed_quiesc = 0; /* try again later! */
spin_unlock_irqrestore(&rnp->lock, flags);
@@ -846,7 +851,7 @@ cpu_quiet(int cpu, struct rcu_state *rsp, struct rcu_data *rdp, long lastcomp)
*/
rdp->nxttail[RCU_NEXT_READY_TAIL] = rdp->nxttail[RCU_NEXT_TAIL];
- cpu_quiet_msk(mask, rsp, rnp, flags); /* releases rnp->lock */
+ rcu_report_qs_rnp(mask, rsp, rnp, flags); /* rlses rnp->lock */
}
}
@@ -877,8 +882,11 @@ rcu_check_quiescent_state(struct rcu_state *rsp, struct rcu_data *rdp)
if (!rdp->passed_quiesc)
return;
- /* Tell RCU we are done (but cpu_quiet() will be the judge of that). */
- cpu_quiet(rdp->cpu, rsp, rdp, rdp->passed_quiesc_completed);
+ /*
+ * Tell RCU we are done (but rcu_report_qs_rdp() will be the
+ * judge of that).
+ */
+ rcu_report_qs_rdp(rdp->cpu, rsp, rdp, rdp->passed_quiesc_completed);
}
#ifdef CONFIG_HOTPLUG_CPU
@@ -968,13 +976,13 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
/*
* We still hold the leaf rcu_node structure lock here, and
* irqs are still disabled. The reason for this subterfuge is
- * because invoking task_quiet() with ->onofflock held leads
- * to deadlock.
+ * because invoking rcu_report_unblock_qs_rnp() with ->onofflock
+ * held leads to deadlock.
*/
spin_unlock(&rsp->onofflock); /* irqs remain disabled. */
rnp = rdp->mynode;
if (need_quiet)
- task_quiet(rnp, flags);
+ rcu_report_unblock_qs_rnp(rnp, flags);
else
spin_unlock_irqrestore(&rnp->lock, flags);
@@ -1164,8 +1172,8 @@ static int rcu_process_dyntick(struct rcu_state *rsp, long lastcomp,
}
if (mask != 0 && rnp->completed == lastcomp) {
- /* cpu_quiet_msk() releases rnp->lock. */
- cpu_quiet_msk(mask, rsp, rnp, flags);
+ /* rcu_report_qs_rnp() releases rnp->lock. */
+ rcu_report_qs_rnp(mask, rsp, rnp, flags);
continue;
}
spin_unlock_irqrestore(&rnp->lock, flags);
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index a81188c..8bb03cb 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -306,7 +306,8 @@ long rcu_batches_completed(void);
static void rcu_preempt_note_context_switch(int cpu);
static int rcu_preempted_readers(struct rcu_node *rnp);
#ifdef CONFIG_HOTPLUG_CPU
-static void task_quiet(struct rcu_node *rnp, unsigned long flags);
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp,
+ unsigned long flags);
#endif /* #ifdef CONFIG_HOTPLUG_CPU */
#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
static void rcu_print_task_stall(struct rcu_node *rnp);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index 1d295c7..c9f0c97 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -167,7 +167,7 @@ static int rcu_preempted_readers(struct rcu_node *rnp)
* irqs disabled, and this lock is released upon return, but irqs remain
* disabled.
*/
-static void task_quiet(struct rcu_node *rnp, unsigned long flags)
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
__releases(rnp->lock)
{
unsigned long mask;
@@ -185,7 +185,7 @@ static void task_quiet(struct rcu_node *rnp, unsigned long flags)
* or tasks were kicked up to root rcu_node due to
* CPUs going offline.
*/
- cpu_quiet_msk_finish(&rcu_preempt_state, flags);
+ rcu_report_qs_rsp(&rcu_preempt_state, flags);
return;
}
@@ -193,7 +193,7 @@ static void task_quiet(struct rcu_node *rnp, unsigned long flags)
mask = rnp->grpmask;
spin_unlock(&rnp->lock); /* irqs remain disabled. */
spin_lock(&rnp_p->lock); /* irqs already disabled. */
- cpu_quiet_msk(mask, &rcu_preempt_state, rnp_p, flags);
+ rcu_report_qs_rnp(mask, &rcu_preempt_state, rnp_p, flags);
}
/*
@@ -253,12 +253,12 @@ static void rcu_read_unlock_special(struct task_struct *t)
/*
* If this was the last task on the current list, and if
* we aren't waiting on any CPUs, report the quiescent state.
- * Note that task_quiet() releases rnp->lock.
+ * Note that rcu_report_unblock_qs_rnp() releases rnp->lock.
*/
if (empty)
spin_unlock_irqrestore(&rnp->lock, flags);
else
- task_quiet(rnp, flags);
+ rcu_report_unblock_qs_rnp(rnp, flags);
} else {
local_irq_restore(flags);
}
@@ -566,7 +566,7 @@ static int rcu_preempted_readers(struct rcu_node *rnp)
#ifdef CONFIG_HOTPLUG_CPU
/* Because preemptible RCU does not exist, no quieting of tasks. */
-static void task_quiet(struct rcu_node *rnp, unsigned long flags)
+static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
{
spin_unlock_irqrestore(&rnp->lock, flags);
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip:core/rcu] rcu: Enable fourth level of TREE_RCU hierarchy
2009-12-02 20:10 ` [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy Paul E. McKenney
2009-12-02 23:25 ` Josh Triplett
@ 2009-12-03 13:22 ` tip-bot for Paul E. McKenney
1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Paul E. McKenney @ 2009-12-03 13:22 UTC (permalink / raw)
To: linux-tip-commits
Cc: linux-kernel, paulmck, hpa, mingo, josh, tglx, laijs, mingo
Commit-ID: cf244dc01bf68e1ad338b82447f8686d24ea4435
Gitweb: http://git.kernel.org/tip/cf244dc01bf68e1ad338b82447f8686d24ea4435
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 2 Dec 2009 12:10:14 -0800
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 3 Dec 2009 11:34:53 +0100
rcu: Enable fourth level of TREE_RCU hierarchy
Enable a fourth level of rcu_node hierarchy for TREE_RCU and
TREE_PREEMPT_RCU. This is for stress-testing and experiemental
purposes only, although in theory this would enable 16,777,216
CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
systems. Normal experimental use of this fourth level will
normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
though the more adventurous (and more fortunate) experimenters
may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
CONFIG_RCU_FANOUT=4 for 256-CPU systems.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846161257-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/rcutree.c | 6 +++++-
kernel/rcutree.h | 15 +++++++++++++--
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index a9f5103..d47e03e 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -60,7 +60,8 @@ static struct lock_class_key rcu_node_class[NUM_RCU_LVLS];
NUM_RCU_LVL_0, /* root of hierarchy. */ \
NUM_RCU_LVL_1, \
NUM_RCU_LVL_2, \
- NUM_RCU_LVL_3, /* == MAX_RCU_LVLS */ \
+ NUM_RCU_LVL_3, \
+ NUM_RCU_LVL_4, /* == MAX_RCU_LVLS */ \
}, \
.signaled = RCU_GP_IDLE, \
.gpnum = -300, \
@@ -1877,6 +1878,9 @@ void __init rcu_init(void)
#ifdef CONFIG_RCU_CPU_STALL_DETECTOR
printk(KERN_INFO "RCU-based detection of stalled CPUs is enabled.\n");
#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
+#if NUM_RCU_LVL_4 != 0
+ printk(KERN_INFO "Experimental four-level hierarchy is enabled.\n");
+#endif /* #if NUM_RCU_LVL_4 != 0 */
RCU_INIT_FLAVOR(&rcu_sched_state, rcu_sched_data);
RCU_INIT_FLAVOR(&rcu_bh_state, rcu_bh_data);
__rcu_init_preempt();
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 8bb03cb..df2e0b6 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -34,10 +34,11 @@
* In practice, this has not been tested, so there is probably some
* bug somewhere.
*/
-#define MAX_RCU_LVLS 3
+#define MAX_RCU_LVLS 4
#define RCU_FANOUT (CONFIG_RCU_FANOUT)
#define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
#define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)
+#define RCU_FANOUT_FOURTH (RCU_FANOUT_CUBE * RCU_FANOUT)
#if NR_CPUS <= RCU_FANOUT
# define NUM_RCU_LVLS 1
@@ -45,23 +46,33 @@
# define NUM_RCU_LVL_1 (NR_CPUS)
# define NUM_RCU_LVL_2 0
# define NUM_RCU_LVL_3 0
+# define NUM_RCU_LVL_4 0
#elif NR_CPUS <= RCU_FANOUT_SQ
# define NUM_RCU_LVLS 2
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
# define NUM_RCU_LVL_2 (NR_CPUS)
# define NUM_RCU_LVL_3 0
+# define NUM_RCU_LVL_4 0
#elif NR_CPUS <= RCU_FANOUT_CUBE
# define NUM_RCU_LVLS 3
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
# define NUM_RCU_LVL_3 NR_CPUS
+# define NUM_RCU_LVL_4 0
+#elif NR_CPUS <= RCU_FANOUT_FOURTH
+# define NUM_RCU_LVLS 4
+# define NUM_RCU_LVL_0 1
+# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_CUBE)
+# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_SQ)
+# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT)
+# define NUM_RCU_LVL_4 NR_CPUS
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT */
-#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
/*
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip:core/rcu] rcu: Add expedited grace-period support for preemptible RCU
2009-12-02 20:10 ` [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU Paul E. McKenney
2009-12-03 9:26 ` Lai Jiangshan
@ 2009-12-03 13:22 ` tip-bot for Paul E. McKenney
1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Paul E. McKenney @ 2009-12-03 13:22 UTC (permalink / raw)
To: linux-tip-commits; +Cc: linux-kernel, paulmck, hpa, mingo, tglx, mingo
Commit-ID: d9a3da0699b24a589b27a61e1a5b5bd30d9db669
Gitweb: http://git.kernel.org/tip/d9a3da0699b24a589b27a61e1a5b5bd30d9db669
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 2 Dec 2009 12:10:15 -0800
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 3 Dec 2009 11:35:25 +0100
rcu: Add expedited grace-period support for preemptible RCU
Implement an synchronize_rcu_expedited() for preemptible RCU
that actually is expedited. This uses
synchronize_sched_expedited() to force all threads currently
running in a preemptible-RCU read-side critical section onto the
appropriate ->blocked_tasks[] list, then takes a snapshot of all
of these lists and waits for them to drain.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1259784616158-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
kernel/rcutorture.c | 34 ++++++--
kernel/rcutree.c | 10 ++-
kernel/rcutree.h | 35 ++++++++-
kernel/rcutree_plugin.h | 198 ++++++++++++++++++++++++++++++++++++++++++++--
kernel/rcutree_trace.c | 10 ++-
5 files changed, 260 insertions(+), 27 deletions(-)
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 3dd0ca2..a621a67 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -327,6 +327,11 @@ rcu_torture_cb(struct rcu_head *p)
cur_ops->deferred_free(rp);
}
+static int rcu_no_completed(void)
+{
+ return 0;
+}
+
static void rcu_torture_deferred_free(struct rcu_torture *p)
{
call_rcu(&p->rtort_rcu, rcu_torture_cb);
@@ -388,6 +393,21 @@ static struct rcu_torture_ops rcu_sync_ops = {
.name = "rcu_sync"
};
+static struct rcu_torture_ops rcu_expedited_ops = {
+ .init = rcu_sync_torture_init,
+ .cleanup = NULL,
+ .readlock = rcu_torture_read_lock,
+ .read_delay = rcu_read_delay, /* just reuse rcu's version. */
+ .readunlock = rcu_torture_read_unlock,
+ .completed = rcu_no_completed,
+ .deferred_free = rcu_sync_torture_deferred_free,
+ .sync = synchronize_rcu_expedited,
+ .cb_barrier = NULL,
+ .stats = NULL,
+ .irq_capable = 1,
+ .name = "rcu_expedited"
+};
+
/*
* Definitions for rcu_bh torture testing.
*/
@@ -581,11 +601,6 @@ static void sched_torture_read_unlock(int idx)
preempt_enable();
}
-static int sched_torture_completed(void)
-{
- return 0;
-}
-
static void rcu_sched_torture_deferred_free(struct rcu_torture *p)
{
call_rcu_sched(&p->rtort_rcu, rcu_torture_cb);
@@ -602,7 +617,7 @@ static struct rcu_torture_ops sched_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sched_torture_deferred_free,
.sync = sched_torture_synchronize,
.cb_barrier = rcu_barrier_sched,
@@ -617,7 +632,7 @@ static struct rcu_torture_ops sched_sync_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sync_torture_deferred_free,
.sync = sched_torture_synchronize,
.cb_barrier = NULL,
@@ -631,7 +646,7 @@ static struct rcu_torture_ops sched_expedited_ops = {
.readlock = sched_torture_read_lock,
.read_delay = rcu_read_delay, /* just reuse rcu's version. */
.readunlock = sched_torture_read_unlock,
- .completed = sched_torture_completed,
+ .completed = rcu_no_completed,
.deferred_free = rcu_sync_torture_deferred_free,
.sync = synchronize_sched_expedited,
.cb_barrier = NULL,
@@ -1116,7 +1131,8 @@ rcu_torture_init(void)
int cpu;
int firsterr = 0;
static struct rcu_torture_ops *torture_ops[] =
- { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops,
+ { &rcu_ops, &rcu_sync_ops, &rcu_expedited_ops,
+ &rcu_bh_ops, &rcu_bh_sync_ops,
&srcu_ops, &srcu_expedited_ops,
&sched_ops, &sched_sync_ops, &sched_expedited_ops, };
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index d47e03e..53ae959 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -948,7 +948,7 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
{
unsigned long flags;
unsigned long mask;
- int need_quiet = 0;
+ int need_report = 0;
struct rcu_data *rdp = rsp->rda[cpu];
struct rcu_node *rnp;
@@ -967,7 +967,7 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
break;
}
if (rnp == rdp->mynode)
- need_quiet = rcu_preempt_offline_tasks(rsp, rnp, rdp);
+ need_report = rcu_preempt_offline_tasks(rsp, rnp, rdp);
else
spin_unlock(&rnp->lock); /* irqs remain disabled. */
mask = rnp->grpmask;
@@ -982,10 +982,12 @@ static void __rcu_offline_cpu(int cpu, struct rcu_state *rsp)
*/
spin_unlock(&rsp->onofflock); /* irqs remain disabled. */
rnp = rdp->mynode;
- if (need_quiet)
+ if (need_report & RCU_OFL_TASKS_NORM_GP)
rcu_report_unblock_qs_rnp(rnp, flags);
else
spin_unlock_irqrestore(&rnp->lock, flags);
+ if (need_report & RCU_OFL_TASKS_EXP_GP)
+ rcu_report_exp_rnp(rsp, rnp);
rcu_adopt_orphan_cbs(rsp);
}
@@ -1843,6 +1845,8 @@ static void __init rcu_init_one(struct rcu_state *rsp)
rnp->level = i;
INIT_LIST_HEAD(&rnp->blocked_tasks[0]);
INIT_LIST_HEAD(&rnp->blocked_tasks[1]);
+ INIT_LIST_HEAD(&rnp->blocked_tasks[2]);
+ INIT_LIST_HEAD(&rnp->blocked_tasks[3]);
}
}
}
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index df2e0b6..d2a0046 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -104,8 +104,12 @@ struct rcu_node {
/* an rcu_data structure, otherwise, each */
/* bit corresponds to a child rcu_node */
/* structure. */
+ unsigned long expmask; /* Groups that have ->blocked_tasks[] */
+ /* elements that need to drain to allow the */
+ /* current expedited grace period to */
+ /* complete (only for TREE_PREEMPT_RCU). */
unsigned long qsmaskinit;
- /* Per-GP initialization for qsmask. */
+ /* Per-GP initial value for qsmask & expmask. */
unsigned long grpmask; /* Mask to apply to parent qsmask. */
/* Only one bit will be set in this mask. */
int grplo; /* lowest-numbered CPU or group here. */
@@ -113,7 +117,7 @@ struct rcu_node {
u8 grpnum; /* CPU/group number for next level up. */
u8 level; /* root is at level 0. */
struct rcu_node *parent;
- struct list_head blocked_tasks[2];
+ struct list_head blocked_tasks[4];
/* Tasks blocked in RCU read-side critsect. */
/* Grace period number (->gpnum) x blocked */
/* by tasks on the (x & 0x1) element of the */
@@ -128,6 +132,21 @@ struct rcu_node {
for ((rnp) = &(rsp)->node[0]; \
(rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
+/*
+ * Do a breadth-first scan of the non-leaf rcu_node structures for the
+ * specified rcu_state structure. Note that if there is a singleton
+ * rcu_node tree with but one rcu_node structure, this loop is a no-op.
+ */
+#define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
+ for ((rnp) = &(rsp)->node[0]; \
+ (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++)
+
+/*
+ * Scan the leaves of the rcu_node hierarchy for the specified rcu_state
+ * structure. Note that if there is a singleton rcu_node tree with but
+ * one rcu_node structure, this loop -will- visit the rcu_node structure.
+ * It is still a leaf node, even if it is also the root node.
+ */
#define rcu_for_each_leaf_node(rsp, rnp) \
for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \
(rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++)
@@ -261,7 +280,7 @@ struct rcu_state {
long gpnum; /* Current gp number. */
long completed; /* # of last completed gp. */
- /* End of fields guarded by root rcu_node's lock. */
+ /* End of fields guarded by root rcu_node's lock. */
spinlock_t onofflock; /* exclude on/offline and */
/* starting new GP. Also */
@@ -293,6 +312,13 @@ struct rcu_state {
#endif /* #ifdef CONFIG_RCU_CPU_STALL_DETECTOR */
};
+/* Return values for rcu_preempt_offline_tasks(). */
+
+#define RCU_OFL_TASKS_NORM_GP 0x1 /* Tasks blocking normal */
+ /* GP were moved to root. */
+#define RCU_OFL_TASKS_EXP_GP 0x2 /* Tasks blocking expedited */
+ /* GP were moved to root. */
+
#ifdef RCU_TREE_NONCORE
/*
@@ -333,6 +359,9 @@ static void rcu_preempt_offline_cpu(int cpu);
static void rcu_preempt_check_callbacks(int cpu);
static void rcu_preempt_process_callbacks(void);
void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU)
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp);
+#endif /* #if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_TREE_PREEMPT_RCU) */
static int rcu_preempt_pending(int cpu);
static int rcu_preempt_needs_cpu(int cpu);
static void __cpuinit rcu_preempt_init_percpu_data(int cpu);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c9f0c97..37fbccd 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -24,12 +24,15 @@
* Paul E. McKenney <paulmck@linux.vnet.ibm.com>
*/
+#include <linux/delay.h>
#ifdef CONFIG_TREE_PREEMPT_RCU
struct rcu_state rcu_preempt_state = RCU_STATE_INITIALIZER(rcu_preempt_state);
DEFINE_PER_CPU(struct rcu_data, rcu_preempt_data);
+static int rcu_preempted_readers_exp(struct rcu_node *rnp);
+
/*
* Tell them what RCU they are running.
*/
@@ -157,7 +160,10 @@ EXPORT_SYMBOL_GPL(__rcu_read_lock);
*/
static int rcu_preempted_readers(struct rcu_node *rnp)
{
- return !list_empty(&rnp->blocked_tasks[rnp->gpnum & 0x1]);
+ int phase = rnp->gpnum & 0x1;
+
+ return !list_empty(&rnp->blocked_tasks[phase]) ||
+ !list_empty(&rnp->blocked_tasks[phase + 2]);
}
/*
@@ -204,6 +210,7 @@ static void rcu_report_unblock_qs_rnp(struct rcu_node *rnp, unsigned long flags)
static void rcu_read_unlock_special(struct task_struct *t)
{
int empty;
+ int empty_exp;
unsigned long flags;
struct rcu_node *rnp;
int special;
@@ -247,6 +254,8 @@ static void rcu_read_unlock_special(struct task_struct *t)
spin_unlock(&rnp->lock); /* irqs remain disabled. */
}
empty = !rcu_preempted_readers(rnp);
+ empty_exp = !rcu_preempted_readers_exp(rnp);
+ smp_mb(); /* ensure expedited fastpath sees end of RCU c-s. */
list_del_init(&t->rcu_node_entry);
t->rcu_blocked_node = NULL;
@@ -259,6 +268,13 @@ static void rcu_read_unlock_special(struct task_struct *t)
spin_unlock_irqrestore(&rnp->lock, flags);
else
rcu_report_unblock_qs_rnp(rnp, flags);
+
+ /*
+ * If this was the last task on the expedited lists,
+ * then we need to report up the rcu_node hierarchy.
+ */
+ if (!empty_exp && !rcu_preempted_readers_exp(rnp))
+ rcu_report_exp_rnp(&rcu_preempt_state, rnp);
} else {
local_irq_restore(flags);
}
@@ -343,7 +359,7 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
int i;
struct list_head *lp;
struct list_head *lp_root;
- int retval;
+ int retval = 0;
struct rcu_node *rnp_root = rcu_get_root(rsp);
struct task_struct *tp;
@@ -353,7 +369,9 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
}
WARN_ON_ONCE(rnp != rdp->mynode &&
(!list_empty(&rnp->blocked_tasks[0]) ||
- !list_empty(&rnp->blocked_tasks[1])));
+ !list_empty(&rnp->blocked_tasks[1]) ||
+ !list_empty(&rnp->blocked_tasks[2]) ||
+ !list_empty(&rnp->blocked_tasks[3])));
/*
* Move tasks up to root rcu_node. Rely on the fact that the
@@ -361,8 +379,11 @@ static int rcu_preempt_offline_tasks(struct rcu_state *rsp,
* rcu_nodes in terms of gp_num value. This fact allows us to
* move the blocked_tasks[] array directly, element by element.
*/
- retval = rcu_preempted_readers(rnp);
- for (i = 0; i < 2; i++) {
+ if (rcu_preempted_readers(rnp))
+ retval |= RCU_OFL_TASKS_NORM_GP;
+ if (rcu_preempted_readers_exp(rnp))
+ retval |= RCU_OFL_TASKS_EXP_GP;
+ for (i = 0; i < 4; i++) {
lp = &rnp->blocked_tasks[i];
lp_root = &rnp_root->blocked_tasks[i];
while (!list_empty(lp)) {
@@ -449,14 +470,159 @@ void synchronize_rcu(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu);
+static DECLARE_WAIT_QUEUE_HEAD(sync_rcu_preempt_exp_wq);
+static long sync_rcu_preempt_exp_count;
+static DEFINE_MUTEX(sync_rcu_preempt_exp_mutex);
+
+/*
+ * Return non-zero if there are any tasks in RCU read-side critical
+ * sections blocking the current preemptible-RCU expedited grace period.
+ * If there is no preemptible-RCU expedited grace period currently in
+ * progress, returns zero unconditionally.
+ */
+static int rcu_preempted_readers_exp(struct rcu_node *rnp)
+{
+ return !list_empty(&rnp->blocked_tasks[2]) ||
+ !list_empty(&rnp->blocked_tasks[3]);
+}
+
+/*
+ * return non-zero if there is no RCU expedited grace period in progress
+ * for the specified rcu_node structure, in other words, if all CPUs and
+ * tasks covered by the specified rcu_node structure have done their bit
+ * for the current expedited grace period. Works only for preemptible
+ * RCU -- other RCU implementation use other means.
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex.
+ */
+static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
+{
+ return !rcu_preempted_readers_exp(rnp) &&
+ ACCESS_ONCE(rnp->expmask) == 0;
+}
+
+/*
+ * Report the exit from RCU read-side critical section for the last task
+ * that queued itself during or before the current expedited preemptible-RCU
+ * grace period. This event is reported either to the rcu_node structure on
+ * which the task was queued or to one of that rcu_node structure's ancestors,
+ * recursively up the tree. (Calm down, calm down, we do the recursion
+ * iteratively!)
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex.
+ */
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ unsigned long flags;
+ unsigned long mask;
+
+ spin_lock_irqsave(&rnp->lock, flags);
+ for (;;) {
+ if (!sync_rcu_preempt_exp_done(rnp))
+ break;
+ if (rnp->parent == NULL) {
+ wake_up(&sync_rcu_preempt_exp_wq);
+ break;
+ }
+ mask = rnp->grpmask;
+ spin_unlock(&rnp->lock); /* irqs remain disabled */
+ rnp = rnp->parent;
+ spin_lock(&rnp->lock); /* irqs already disabled */
+ rnp->expmask &= ~mask;
+ }
+ spin_unlock_irqrestore(&rnp->lock, flags);
+}
+
+/*
+ * Snapshot the tasks blocking the newly started preemptible-RCU expedited
+ * grace period for the specified rcu_node structure. If there are no such
+ * tasks, report it up the rcu_node hierarchy.
+ *
+ * Caller must hold sync_rcu_preempt_exp_mutex and rsp->onofflock.
+ */
+static void
+sync_rcu_preempt_exp_init(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ int must_wait;
+
+ spin_lock(&rnp->lock); /* irqs already disabled */
+ list_splice_init(&rnp->blocked_tasks[0], &rnp->blocked_tasks[2]);
+ list_splice_init(&rnp->blocked_tasks[1], &rnp->blocked_tasks[3]);
+ must_wait = rcu_preempted_readers_exp(rnp);
+ spin_unlock(&rnp->lock); /* irqs remain disabled */
+ if (!must_wait)
+ rcu_report_exp_rnp(rsp, rnp);
+}
+
/*
- * Wait for an rcu-preempt grace period. We are supposed to expedite the
- * grace period, but this is the crude slow compatability hack, so just
- * invoke synchronize_rcu().
+ * Wait for an rcu-preempt grace period, but expedite it. The basic idea
+ * is to invoke synchronize_sched_expedited() to push all the tasks to
+ * the ->blocked_tasks[] lists, move all entries from the first set of
+ * ->blocked_tasks[] lists to the second set, and finally wait for this
+ * second set to drain.
*/
void synchronize_rcu_expedited(void)
{
- synchronize_rcu();
+ unsigned long flags;
+ struct rcu_node *rnp;
+ struct rcu_state *rsp = &rcu_preempt_state;
+ long snap;
+ int trycount = 0;
+
+ smp_mb(); /* Caller's modifications seen first by other CPUs. */
+ snap = ACCESS_ONCE(sync_rcu_preempt_exp_count) + 1;
+ smp_mb(); /* Above access cannot bleed into critical section. */
+
+ /*
+ * Acquire lock, falling back to synchronize_rcu() if too many
+ * lock-acquisition failures. Of course, if someone does the
+ * expedited grace period for us, just leave.
+ */
+ while (!mutex_trylock(&sync_rcu_preempt_exp_mutex)) {
+ if (trycount++ < 10)
+ udelay(trycount * num_online_cpus());
+ else {
+ synchronize_rcu();
+ return;
+ }
+ if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+ goto mb_ret; /* Others did our work for us. */
+ }
+ if ((ACCESS_ONCE(sync_rcu_preempt_exp_count) - snap) > 0)
+ goto unlock_mb_ret; /* Others did our work for us. */
+
+ /* force all RCU readers onto blocked_tasks[]. */
+ synchronize_sched_expedited();
+
+ spin_lock_irqsave(&rsp->onofflock, flags);
+
+ /* Initialize ->expmask for all non-leaf rcu_node structures. */
+ rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) {
+ spin_lock(&rnp->lock); /* irqs already disabled. */
+ rnp->expmask = rnp->qsmaskinit;
+ spin_unlock(&rnp->lock); /* irqs remain disabled. */
+ }
+
+ /* Snapshot current state of ->blocked_tasks[] lists. */
+ rcu_for_each_leaf_node(rsp, rnp)
+ sync_rcu_preempt_exp_init(rsp, rnp);
+ if (NUM_RCU_NODES > 1)
+ sync_rcu_preempt_exp_init(rsp, rcu_get_root(rsp));
+
+ spin_unlock_irqrestore(&rsp->onofflock, flags);
+
+ /* Wait for snapshotted ->blocked_tasks[] lists to drain. */
+ rnp = rcu_get_root(rsp);
+ wait_event(sync_rcu_preempt_exp_wq,
+ sync_rcu_preempt_exp_done(rnp));
+
+ /* Clean up and exit. */
+ smp_mb(); /* ensure expedited GP seen before counter increment. */
+ ACCESS_ONCE(sync_rcu_preempt_exp_count)++;
+unlock_mb_ret:
+ mutex_unlock(&sync_rcu_preempt_exp_mutex);
+mb_ret:
+ smp_mb(); /* ensure subsequent action seen after grace period. */
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
@@ -655,6 +821,20 @@ void synchronize_rcu_expedited(void)
}
EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
+#ifdef CONFIG_HOTPLUG_CPU
+
+/*
+ * Because preemptable RCU does not exist, there is never any need to
+ * report on tasks preempted in RCU read-side critical sections during
+ * expedited RCU grace periods.
+ */
+static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp)
+{
+ return;
+}
+
+#endif /* #ifdef CONFIG_HOTPLUG_CPU */
+
/*
* Because preemptable RCU does not exist, it never has any work to do.
*/
diff --git a/kernel/rcutree_trace.c b/kernel/rcutree_trace.c
index 1984cdc..9d2c884 100644
--- a/kernel/rcutree_trace.c
+++ b/kernel/rcutree_trace.c
@@ -157,6 +157,7 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
{
long gpnum;
int level = 0;
+ int phase;
struct rcu_node *rnp;
gpnum = rsp->gpnum;
@@ -173,10 +174,13 @@ static void print_one_rcu_state(struct seq_file *m, struct rcu_state *rsp)
seq_puts(m, "\n");
level = rnp->level;
}
- seq_printf(m, "%lx/%lx %c>%c %d:%d ^%d ",
+ phase = gpnum & 0x1;
+ seq_printf(m, "%lx/%lx %c%c>%c%c %d:%d ^%d ",
rnp->qsmask, rnp->qsmaskinit,
- "T."[list_empty(&rnp->blocked_tasks[gpnum & 1])],
- "T."[list_empty(&rnp->blocked_tasks[!(gpnum & 1)])],
+ "T."[list_empty(&rnp->blocked_tasks[phase])],
+ "E."[list_empty(&rnp->blocked_tasks[phase + 2])],
+ "T."[list_empty(&rnp->blocked_tasks[!phase])],
+ "E."[list_empty(&rnp->blocked_tasks[!phase + 2])],
rnp->grplo, rnp->grphi, rnp->grpnum);
}
seq_puts(m, "\n");
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [tip:core/rcu] rcu: Make RCU's CPU-stall detector be default
2009-12-02 20:10 ` [PATCH tip/core/rcu 4/4] rcu: make RCU's CPU-stall detector be default Paul E. McKenney
@ 2009-12-03 13:22 ` tip-bot for Paul E. McKenney
0 siblings, 0 replies; 15+ messages in thread
From: tip-bot for Paul E. McKenney @ 2009-12-03 13:22 UTC (permalink / raw)
To: linux-tip-commits; +Cc: linux-kernel, paulmck, hpa, mingo, tglx, laijs, mingo
Commit-ID: 8bfb2f8e655b9d0c45fde679fcd5fd97e34513db
Gitweb: http://git.kernel.org/tip/8bfb2f8e655b9d0c45fde679fcd5fd97e34513db
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
AuthorDate: Wed, 2 Dec 2009 12:10:16 -0800
Committer: Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 3 Dec 2009 11:35:27 +0100
rcu: Make RCU's CPU-stall detector be default
The RCU_CPU_STALL_DETECTOR costs almost nothing and has located
some bugs that might otherwise have been difficult to track
down. Make it be default for the TREE RCU implementations.
The vmlinux size impact is limited (on 64-bit x86 defconfig):
text data bss dec hex filename
8440248 1260076 995588 10695912 a334e8 vmlinux.before
8440774 1260060 995588 10696422 a336e6 vmlinux.after
+526 bytes - acceptable default cost.
For RAM starved systems, TINY_RCU does not support CPU-stall detection
and is much smaller, but then again it is a uniprocessor...
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12597846162906-git-send-email->
[ v2: added image size calculations to the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
lib/Kconfig.debug | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8911558..50e0e78 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -749,7 +749,7 @@ config RCU_TORTURE_TEST_RUNNABLE
config RCU_CPU_STALL_DETECTOR
bool "Check for stalled CPUs delaying RCU grace periods"
depends on TREE_RCU || TREE_PREEMPT_RCU
- default n
+ default y
help
This option causes RCU to printk information on which
CPUs are delaying the current grace period, but only when
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU
2009-12-03 9:26 ` Lai Jiangshan
@ 2009-12-03 14:41 ` Paul E. McKenney
0 siblings, 0 replies; 15+ messages in thread
From: Paul E. McKenney @ 2009-12-03 14:41 UTC (permalink / raw)
To: Lai Jiangshan
Cc: linux-kernel, mingo, dipankar, akpm, mathieu.desnoyers, josh,
dvhltc, niv, tglx, peterz, rostedt, Valdis.Kletnieks, dhowells
On Thu, Dec 03, 2009 at 05:26:24PM +0800, Lai Jiangshan wrote:
> Paul E. McKenney wrote:
> > From: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > Implement an synchronize_rcu_expedited() for preemptible RCU that actually
> > is expedited. This uses synchronize_sched_expedited() to force all
> > threads currently running in a preemptible-RCU read-side critical section
> > onto the appropriate ->blocked_tasks[] list, then takes a snapshot of
> > all of these lists and waits for them to drain.
> >
>
> > 3. Add an implementation of synchronize_rcu_expedited() that
> > actually expedites preemptible-RCU grace periods.
>
> It's very nice.
I am glad you like it!
> But I don't understand all things.
>
> 1) Why it can be speeded up (in theory)?
> synchronize_sched_expedited() does speed up, it is due to
> migration_threads are the most highest priority threads.
>
> But for synchronize_rcu_expedited(), some preempted tasks in ->blocked_tasks[]
> may be waiting at runqueue for long long time because some other
> higher priority threads comes.
>
> simply comparison:
> synchronize_sched_expedited()
> ==> wake_up_process(rq->migration_thread) to force schedule on cpus.
> which forces read-sides notify the end earlier,
> or we can say "it forces read-sides run to end faster"
>
> synchronize_rcu_expedited()
> ==> Nothing to force preempted read-site run to end faster.
You are quite right, and this is one reason why one of the items on my
todo list is "RCU priority boosting". I am currently doing some work
to prepare for this by simplifying force_quiescent_state(). The general
idea will be to traverse the ->blocked_tasks[] lists to raise priorities
if the grace period goes too long.
That said, the purpose of synchronize_rcu_expedited() is not to provide
real-time response, but rather to provide short -average- grace-period
durations. In the common case, RCU read-side critical sections do not
get preempted to begin with, so in the common case, this implementation
of synchronize_sched_expedited() should provide short average grace-period
durations.
> 2) Why we introduce a API which no one use it.
> I remember that Net guys request a expedited synchronize_rcu().
> but currently there is still no one use it.
There have been several requests for it over the years, so I feel
justified providing it. If it is still unused some years hence, it
is really easy to remove it, but it is really hard to provide it on a
moment's notice.
> Beware my thinking may be wrong!
That would apply to both of us! ;-)
Thanx, Paul
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2009-12-03 14:41 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-02 20:09 [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 1/4] rcu: rename "quiet" functions Paul E. McKenney
2009-12-02 23:20 ` Josh Triplett
2009-12-03 13:22 ` [tip:core/rcu] rcu: Rename " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 2/4] rcu: enable fourth level of TREE_RCU hierarchy Paul E. McKenney
2009-12-02 23:25 ` Josh Triplett
2009-12-03 0:20 ` Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Enable " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 3/4] rcu: add expedited grace-period support for preemptible RCU Paul E. McKenney
2009-12-03 9:26 ` Lai Jiangshan
2009-12-03 14:41 ` Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Add " tip-bot for Paul E. McKenney
2009-12-02 20:10 ` [PATCH tip/core/rcu 4/4] rcu: make RCU's CPU-stall detector be default Paul E. McKenney
2009-12-03 13:22 ` [tip:core/rcu] rcu: Make " tip-bot for Paul E. McKenney
2009-12-03 8:53 ` [PATCH tip/core/rcu 0/4] rcu: preemptible expedited grace periods and cleanups Lai Jiangshan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox