* [PATCH v2] rcu: sysctl: Panic on RCU Stall
[not found] <cover.1464743544.git.bristot@redhat.com>
@ 2016-06-02 16:51 ` Daniel Bristot de Oliveira
2016-06-03 3:39 ` Paul E. McKenney
0 siblings, 1 reply; 2+ messages in thread
From: Daniel Bristot de Oliveira @ 2016-06-02 16:51 UTC (permalink / raw)
To: linux-kernel
Cc: Jonathan Corbet, Paul E. McKenney, Josh Triplett, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Christian Borntraeger,
Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves
It is not always easy to define the cause of an RCU stall just by
analysing the RCU stall messages, mainly when the problem is caused
by the indirect starvation of rcu threads. For example, when preempt_rcu
is not awakened due to the starvation of a timer softirq.
We have been hard coding panic() in the RCU stall functions for
some time while testing the kernel-rt. But this is not possible in
some scenarios, like when supporting customers.
This patch implements the sysctl kernel.panic_on_rcu_stall. If
set to 1, the system will panic() when an RCU stall takes place,
enabling the capture of a vmcore. The vmcore provides a way to analyze
all kernel/tasks states, helping out to point to the culprit and the
solution for the stall.
The kernel.panic_on_rcu_stall sysctl is disabled by default.
Changes from v1:
- Fixed a typo in the git log
- The if(sysctl_panic_on_rcu_stall) panic() is in a static function
- Fixed the CONFIG_TINY_RCU compilation issue
- The var sysctl_panic_on_rcu_stall is now __read_mostly
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Tested-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
---
Documentation/sysctl/kernel.txt | 12 ++++++++++++
include/linux/kernel.h | 1 +
kernel/rcu/tree.c | 12 ++++++++++++
kernel/sysctl.c | 11 +++++++++++
4 files changed, 36 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a3683ce..3320460 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -58,6 +58,7 @@ show up in /proc/sys/kernel:
- panic_on_stackoverflow
- panic_on_unrecovered_nmi
- panic_on_warn
+- panic_on_rcu_stall
- perf_cpu_time_max_percent
- perf_event_paranoid
- perf_event_max_stack
@@ -618,6 +619,17 @@ a kernel rebuild when attempting to kdump at the location of a WARN().
==============================================================
+panic_on_rcu_stall:
+
+When set to 1, calls panic() after RCU stall detection messages. This
+is useful to define the root cause of RCU stalls using a vmcore.
+
+0: do not panic() when RCU stall takes place, default behavior.
+
+1: panic() after printing RCU stall messages.
+
+==============================================================
+
perf_cpu_time_max_percent:
Hints to the kernel how much CPU time it should be allowed to
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94aa10f..c420821 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -451,6 +451,7 @@ extern int panic_on_oops;
extern int panic_on_unrecovered_nmi;
extern int panic_on_io_nmi;
extern int panic_on_warn;
+extern int sysctl_panic_on_rcu_stall;
extern int sysctl_panic_on_stackoverflow;
extern bool crash_kexec_post_notifiers;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c7f1bc4..d531988 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -125,6 +125,8 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
/* Number of rcu_nodes at specified level. */
static int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
+/* panic() on RCU Stall sysctl. */
+int sysctl_panic_on_rcu_stall __read_mostly;
/*
* The rcu_scheduler_active variable transitions from zero to one just
@@ -1311,6 +1313,12 @@ static void rcu_stall_kick_kthreads(struct rcu_state *rsp)
}
}
+static inline void panic_on_rcu_stall(void)
+{
+ if (sysctl_panic_on_rcu_stall)
+ panic("RCU Stall\n");
+}
+
static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
{
int cpu;
@@ -1390,6 +1398,8 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
rcu_check_gp_kthread_starvation(rsp);
+ panic_on_rcu_stall();
+
force_quiescent_state(rsp); /* Kick them all. */
}
@@ -1430,6 +1440,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ panic_on_rcu_stall();
+
/*
* Attempt to revive the RCU machinery by forcing a context switch.
*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87b2fc3..35f0dcb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,6 +1205,17 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
#endif
+#if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU)
+ {
+ .procname = "panic_on_rcu_stall",
+ .data = &sysctl_panic_on_rcu_stall,
+ .maxlen = sizeof(sysctl_panic_on_rcu_stall),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif
{ }
};
--
2.5.5
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH v2] rcu: sysctl: Panic on RCU Stall
2016-06-02 16:51 ` [PATCH v2] rcu: sysctl: Panic on RCU Stall Daniel Bristot de Oliveira
@ 2016-06-03 3:39 ` Paul E. McKenney
0 siblings, 0 replies; 2+ messages in thread
From: Paul E. McKenney @ 2016-06-03 3:39 UTC (permalink / raw)
To: Daniel Bristot de Oliveira
Cc: linux-kernel, Jonathan Corbet, Josh Triplett, Steven Rostedt,
Mathieu Desnoyers, Lai Jiangshan, Christian Borntraeger,
Arnaldo Carvalho de Melo, Luis Claudio R. Goncalves
On Thu, Jun 02, 2016 at 01:51:41PM -0300, Daniel Bristot de Oliveira wrote:
> It is not always easy to define the cause of an RCU stall just by
> analysing the RCU stall messages, mainly when the problem is caused
> by the indirect starvation of rcu threads. For example, when preempt_rcu
> is not awakened due to the starvation of a timer softirq.
>
> We have been hard coding panic() in the RCU stall functions for
> some time while testing the kernel-rt. But this is not possible in
> some scenarios, like when supporting customers.
>
> This patch implements the sysctl kernel.panic_on_rcu_stall. If
> set to 1, the system will panic() when an RCU stall takes place,
> enabling the capture of a vmcore. The vmcore provides a way to analyze
> all kernel/tasks states, helping out to point to the culprit and the
> solution for the stall.
>
> The kernel.panic_on_rcu_stall sysctl is disabled by default.
>
> Changes from v1:
> - Fixed a typo in the git log
> - The if(sysctl_panic_on_rcu_stall) panic() is in a static function
> - Fixed the CONFIG_TINY_RCU compilation issue
> - The var sysctl_panic_on_rcu_stall is now __read_mostly
>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
> Reviewed-by: Arnaldo Carvalho de Melo <acme@kernel.org>
> Tested-by: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Queued for testing and further review.
Thanx, Paul
> ---
> Documentation/sysctl/kernel.txt | 12 ++++++++++++
> include/linux/kernel.h | 1 +
> kernel/rcu/tree.c | 12 ++++++++++++
> kernel/sysctl.c | 11 +++++++++++
> 4 files changed, 36 insertions(+)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index a3683ce..3320460 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -58,6 +58,7 @@ show up in /proc/sys/kernel:
> - panic_on_stackoverflow
> - panic_on_unrecovered_nmi
> - panic_on_warn
> +- panic_on_rcu_stall
> - perf_cpu_time_max_percent
> - perf_event_paranoid
> - perf_event_max_stack
> @@ -618,6 +619,17 @@ a kernel rebuild when attempting to kdump at the location of a WARN().
>
> ==============================================================
>
> +panic_on_rcu_stall:
> +
> +When set to 1, calls panic() after RCU stall detection messages. This
> +is useful to define the root cause of RCU stalls using a vmcore.
> +
> +0: do not panic() when RCU stall takes place, default behavior.
> +
> +1: panic() after printing RCU stall messages.
> +
> +==============================================================
> +
> perf_cpu_time_max_percent:
>
> Hints to the kernel how much CPU time it should be allowed to
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 94aa10f..c420821 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -451,6 +451,7 @@ extern int panic_on_oops;
> extern int panic_on_unrecovered_nmi;
> extern int panic_on_io_nmi;
> extern int panic_on_warn;
> +extern int sysctl_panic_on_rcu_stall;
> extern int sysctl_panic_on_stackoverflow;
>
> extern bool crash_kexec_post_notifiers;
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c7f1bc4..d531988 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -125,6 +125,8 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
> /* Number of rcu_nodes at specified level. */
> static int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> +/* panic() on RCU Stall sysctl. */
> +int sysctl_panic_on_rcu_stall __read_mostly;
>
> /*
> * The rcu_scheduler_active variable transitions from zero to one just
> @@ -1311,6 +1313,12 @@ static void rcu_stall_kick_kthreads(struct rcu_state *rsp)
> }
> }
>
> +static inline void panic_on_rcu_stall(void)
> +{
> + if (sysctl_panic_on_rcu_stall)
> + panic("RCU Stall\n");
> +}
> +
> static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
> {
> int cpu;
> @@ -1390,6 +1398,8 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
>
> rcu_check_gp_kthread_starvation(rsp);
>
> + panic_on_rcu_stall();
> +
> force_quiescent_state(rsp); /* Kick them all. */
> }
>
> @@ -1430,6 +1440,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
> jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>
> + panic_on_rcu_stall();
> +
> /*
> * Attempt to revive the RCU machinery by forcing a context switch.
> *
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 87b2fc3..35f0dcb 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1205,6 +1205,17 @@ static struct ctl_table kern_table[] = {
> .extra2 = &one,
> },
> #endif
> +#if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU)
> + {
> + .procname = "panic_on_rcu_stall",
> + .data = &sysctl_panic_on_rcu_stall,
> + .maxlen = sizeof(sysctl_panic_on_rcu_stall),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> +#endif
> { }
> };
>
> --
> 2.5.5
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-06-03 3:39 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1464743544.git.bristot@redhat.com>
2016-06-02 16:51 ` [PATCH v2] rcu: sysctl: Panic on RCU Stall Daniel Bristot de Oliveira
2016-06-03 3:39 ` Paul E. McKenney
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).