All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
	mingo@kernel.org, jiangshanlai@gmail.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org,
	dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com,
	oleg@redhat.com, joel@joelfernandes.org
Subject: Re: [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs
Date: Thu, 8 Aug 2019 14:30:12 -0700	[thread overview]
Message-ID: <20190808213012.GA28773@linux.ibm.com> (raw)
In-Reply-To: <20190808203541.GA8160@linux.ibm.com>

On Thu, Aug 08, 2019 at 01:35:41PM -0700, Paul E. McKenney wrote:
> On Wed, Aug 07, 2019 at 02:41:31PM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 06, 2019 at 11:08:24AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 05, 2019 at 10:48:00AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 05, 2019 at 05:50:24PM +0200, Peter Zijlstra wrote:
> > > > > On Mon, Aug 05, 2019 at 07:54:48AM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > > > Right; so clearly we're not understanding what's happening. That seems
> > > > > > > like a requirement for actually doing a patch.
> > > > > > 
> > > > > > Almost but not quite.  It is a requirement for a patch *that* *is*
> > > > > > *supposed* *to* *be* *a* *fix*.  If you are trying to prohibit me from
> > > > > > writing experimental patches, please feel free to take a long walk on
> > > > > > a short pier.
> > > > > > 
> > > > > > Understood???
> > > > > 
> > > > > Ah, my bad, I thought you were actually proposing this as an actual
> > > > > patch. I now see that is my bad, I'd overlooked the RFC part.
> > > > 
> > > > No problem!
> > > > 
> > > > And of course adding tracing decreases the frequency and duration of
> > > > the multi_cpu_stop().  Re-running with shorter-duration triggering.  ;-)
> > > 
> > > And I did eventually get a good trace.  If I am interpreting this trace
> > > correctly, the torture_-135 task didn't get around to attempting to wake
> > > up all of the CPUs.  I will try again, but this time with the sched_switch
> > > trace event enabled.
> > > 
> > > As a side note, enabling ftrace from the command line seems to interact
> > > badly with turning tracing off and on in the kernel, so I eventually
> > > resorted to trace_printk() in the functions of interest.  The trace
> > > output is below, followed by the current diagnostic patch.  Please note
> > > that I am -not- using the desperation hammer-the-scheduler patches.
> > > 
> > > More as I learn more!
> > 
> > And of course I forgot to dump out the online CPUs, so I really had no
> > idea whether or not all the CPUs were accounted for.  I added tracing
> > to dump out the online CPUs at the beginning of __stop_cpus() and then
> > reworked it a few times to get the problem to happen in reasonable time.
> > Please see below for the resulting annotated trace.
> > 
> > I was primed to expect a lost IPI, perhaps due to yet another qemu bug,
> > but all the migration threads are running within about 2 milliseconds.
> > It is then almost two minutes(!) until the next trace message.
> > 
> > Looks like time to (very carefully!) instrument multi_cpu_stop().
> > 
> > Of course, if you have any ideas, please do not keep them a secret!
> 
> Functionally, multi_cpu_stop() is working fine, according to the trace
> below (search for a line beginning with TAB).  But somehow CPU 2 took
> almost three -minutes- to do one iteration of the loop.  The prime suspect
> in that loop is cpu_relax() due to the hypervisor having an opportunity
> to do something at that point.  The commentary below (again, search for
> a line beginning with TAB) gives my analysis.
> 
> Of course, if I am correct, it should be possible to catch cpu_relax()
> in the act.  That is the next step, give or take the Heisenbuggy nature
> of this beast.
> 
> Another thing for me to try is to run longer with !NO_HZ_FULL, just in
> case the earlier runs just got lucky.
> 
> Thoughts?

And it really can happen:

[ 1881.467922] migratio-33      4...1 1879530317us : stop_machine_yield: cpu_relax() took 756140 ms

The previous timestamp was 1123391100us, so the cpu_relax() is almost
exactly the full delay.

But another instance stalled for many minutes without a ten-second
cpu_relax().  So it is not just cpu_relax() causing trouble.  I could
rationalize that vCPU preemption being at fault...

And my diagnostic patch is below, just in case I am doing something
stupid with that.

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ce00b442ced0..1a50ed258ef0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3569,6 +3569,7 @@ void __init rcu_init(void)
 	rcu_par_gp_wq = alloc_workqueue("rcu_par_gp", WQ_MEM_RECLAIM, 0);
 	WARN_ON(!rcu_par_gp_wq);
 	srcu_init();
+	tracing_off();
 }
 
 #include "tree_stall.h"
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0b22e55cebe8..a5a879a49051 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -479,6 +479,7 @@ void wake_up_q(struct wake_q_head *head)
 {
 	struct wake_q_node *node = head->first;
 
+	trace_printk("entered\n");
 	while (node != WAKE_Q_TAIL) {
 		struct task_struct *task;
 
@@ -509,6 +510,7 @@ void resched_curr(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	int cpu;
 
+	trace_printk("entered\n");
 	lockdep_assert_held(&rq->lock);
 
 	if (test_tsk_need_resched(curr))
@@ -1197,6 +1199,7 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	trace_printk("entered\n");
 	if (task_contributes_to_load(p))
 		rq->nr_uninterruptible--;
 
@@ -1298,6 +1301,7 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
 	const struct sched_class *class;
 
+	trace_printk("entered\n");
 	if (p->sched_class == rq->curr->sched_class) {
 		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
 	} else {
@@ -2097,6 +2101,7 @@ ttwu_stat(struct task_struct *p, int cpu, int wake_flags)
 static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
 			   struct rq_flags *rf)
 {
+	trace_printk("entered\n");
 	check_preempt_curr(rq, p, wake_flags);
 	p->state = TASK_RUNNING;
 	trace_sched_wakeup(p);
@@ -2132,6 +2137,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 {
 	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;
 
+	trace_printk("entered\n");
 	lockdep_assert_held(&rq->lock);
 
 #ifdef CONFIG_SMP
@@ -2178,9 +2184,11 @@ void sched_ttwu_pending(void)
 	struct task_struct *p, *t;
 	struct rq_flags rf;
 
+	trace_printk("entered\n");
 	if (!llist)
 		return;
 
+	trace_printk("non-NULL llist\n");
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
 
@@ -2192,6 +2200,7 @@ void sched_ttwu_pending(void)
 
 void scheduler_ipi(void)
 {
+	trace_printk("entered\n");
 	/*
 	 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
 	 * TIF_NEED_RESCHED remotely (for the first time) will also send
@@ -2232,6 +2241,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
 	struct rq *rq = cpu_rq(cpu);
 
+	trace_printk("%s entered, CPU %d\n", __func__, cpu);
 	p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
 
 	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list)) {
@@ -2277,6 +2287,7 @@ static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
 	struct rq *rq = cpu_rq(cpu);
 	struct rq_flags rf;
 
+	trace_printk("%s entered\n", __func__);
 #if defined(CONFIG_SMP)
 	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
 		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
@@ -2399,6 +2410,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	unsigned long flags;
 	int cpu, success = 0;
 
+	trace_printk("entered\n");
 	preempt_disable();
 	if (p == current) {
 		/*
@@ -2545,6 +2557,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
  */
 int wake_up_process(struct task_struct *p)
 {
+	trace_printk("entered\n");
 	return try_to_wake_up(p, TASK_NORMAL, 0);
 }
 EXPORT_SYMBOL(wake_up_process);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 5c2b2f90fae1..a07f77b9c1f2 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -21,6 +21,7 @@
 #include <linux/atomic.h>
 #include <linux/nmi.h>
 #include <linux/sched/wake_q.h>
+#include <linux/sched/clock.h>
 
 /*
  * Structure to determine completion condition and record errors.  May
@@ -80,6 +81,7 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct cpu_stop_work *work)
 	unsigned long flags;
 	bool enabled;
 
+	trace_printk("entered for CPU %u\n", cpu);
 	preempt_disable();
 	raw_spin_lock_irqsave(&stopper->lock, flags);
 	enabled = stopper->enabled;
@@ -167,7 +169,7 @@ static void set_state(struct multi_stop_data *msdata,
 	/* Reset ack counter. */
 	atomic_set(&msdata->thread_ack, msdata->num_threads);
 	smp_wmb();
-	msdata->state = newstate;
+	WRITE_ONCE(msdata->state, newstate);
 }
 
 /* Last one to ack a state moves to the next state. */
@@ -179,7 +181,15 @@ static void ack_state(struct multi_stop_data *msdata)
 
 void __weak stop_machine_yield(const struct cpumask *cpumask)
 {
+	u64 starttime = local_clock();
+	u64 endtime;
+	const u64 delta = 100ULL * 1000ULL * 1000ULL * 1000ULL;
+
 	cpu_relax();
+	endtime = local_clock();
+	if (time_after64(endtime, starttime + delta))
+		trace_printk("cpu_relax() took %llu ms\n",
+			     (endtime - starttime) / (1000ULL * 1000ULL));
 }
 
 /* This is the cpu_stop function which stops the CPU. */
@@ -210,8 +220,9 @@ static int multi_cpu_stop(void *data)
 	do {
 		/* Chill out and ensure we re-read multi_stop_state. */
 		stop_machine_yield(cpumask);
-		if (msdata->state != curstate) {
-			curstate = msdata->state;
+		if (READ_ONCE(msdata->state) != curstate) {
+			curstate = READ_ONCE(msdata->state);
+			trace_printk("curstate = %d, ack = %d\n", curstate, atomic_read(&msdata->thread_ack));
 			switch (curstate) {
 			case MULTI_STOP_DISABLE_IRQ:
 				local_irq_disable();
@@ -382,6 +393,7 @@ static bool queue_stop_cpus_work(const struct cpumask *cpumask,
 	 * preempted by a stopper which might wait for other stoppers
 	 * to enter @fn which can lead to deadlock.
 	 */
+	trace_printk("entered\n");
 	preempt_disable();
 	stop_cpus_in_progress = true;
 	for_each_cpu(cpu, cpumask) {
@@ -402,11 +414,18 @@ static int __stop_cpus(const struct cpumask *cpumask,
 		       cpu_stop_fn_t fn, void *arg)
 {
 	struct cpu_stop_done done;
+	unsigned long j = jiffies;
 
+	tracing_on();
+	trace_printk("entered\n");
+	trace_printk("CPUs %*pbl online\n", cpumask_pr_args(cpu_online_mask));
 	cpu_stop_init_done(&done, cpumask_weight(cpumask));
 	if (!queue_stop_cpus_work(cpumask, fn, arg, &done))
 		return -ENOENT;
 	wait_for_completion(&done.completion);
+	tracing_off();
+	if (time_after(jiffies, j + HZ * 20))
+		ftrace_dump(DUMP_ALL);
 	return done.ret;
 }
 
@@ -442,6 +461,7 @@ int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg)
 {
 	int ret;
 
+	trace_printk("entered\n");
 	/* static works are used, process one request at a time */
 	mutex_lock(&stop_cpus_mutex);
 	ret = __stop_cpus(cpumask, fn, arg);
@@ -599,6 +619,7 @@ int stop_machine_cpuslocked(cpu_stop_fn_t fn, void *data,
 		.active_cpus = cpus,
 	};
 
+	trace_printk("entered\n");
 	lockdep_assert_cpus_held();
 
 	if (!stop_machine_initialized) {


  reply	other threads:[~2019-08-08 21:30 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-02 15:14 [PATCH tip/core/rcu 0/14] No-CBs bypass addition for v5.4 Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 01/14] rcu/nocb: Atomic ->len field in rcu_segcblist structure Paul E. McKenney
2019-08-04 14:50   ` Peter Zijlstra
2019-08-04 14:52     ` Peter Zijlstra
2019-08-04 18:45       ` Paul E. McKenney
2019-08-04 18:42     ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 02/14] rcu/nocb: Add bypass callback queueing Paul E. McKenney
2019-08-07  0:03   ` Joel Fernandes
2019-08-07  0:16     ` Joel Fernandes
2019-08-07  0:35     ` Paul E. McKenney
2019-08-07  0:40       ` Steven Rostedt
2019-08-07  1:17         ` Paul E. McKenney
2019-08-07  1:24           ` Steven Rostedt
2019-08-07  3:47             ` Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 03/14] rcu/nocb: EXP Check use and usefulness of ->nocb_lock_contended Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 04/14] rcu/nocb: Print no-CBs diagnostics when rcutorture writer unduly delayed Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 05/14] rcu/nocb: Avoid synchronous wakeup in __call_rcu_nocb_wake() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 06/14] rcu/nocb: Advance CBs after merge in rcutree_migrate_callbacks() Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 07/14] rcu/nocb: Reduce nocb_cb_wait() leaf rcu_node ->lock contention Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 08/14] rcu/nocb: Reduce __call_rcu_nocb_wake() " Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 09/14] rcu/nocb: Don't wake no-CBs GP kthread if timer posted under overload Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 10/14] rcu: Allow rcu_do_batch() to dynamically adjust batch sizes Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 11/14] EXP nohz: Add TICK_DEP_BIT_RCU Paul E. McKenney
2019-08-02 15:14 ` [PATCH RFC tip/core/rcu 12/14] rcu/nohz: Force on tick when invoking lots of callbacks Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 13/14] rcutorture: Force on tick for readers and callback flooders Paul E. McKenney
2019-08-02 15:15 ` [PATCH RFC tip/core/rcu 14/14] rcu/nohz: Make multi_cpu_stop() enable tick on all online CPUs Paul E. McKenney
2019-08-04 14:43   ` Peter Zijlstra
2019-08-04 14:48     ` Peter Zijlstra
2019-08-04 18:41       ` Paul E. McKenney
2019-08-04 20:24         ` Paul E. McKenney
2019-08-05  4:19           ` Paul E. McKenney
2019-08-05  8:07             ` Peter Zijlstra
2019-08-05 14:47               ` Paul E. McKenney
2019-08-05  8:05         ` Peter Zijlstra
2019-08-05 14:54           ` Paul E. McKenney
2019-08-05 15:50             ` Peter Zijlstra
2019-08-05 17:48               ` Paul E. McKenney
2019-08-06 18:08                 ` Paul E. McKenney
2019-08-07 21:41                   ` Paul E. McKenney
2019-08-08 20:35                     ` Paul E. McKenney
2019-08-08 21:30                       ` Paul E. McKenney [this message]
2019-08-09 16:51                         ` Paul E. McKenney
2019-08-09 18:07                           ` Joel Fernandes
2019-08-09 18:39                             ` Paul E. McKenney
2019-08-12 21:02   ` Frederic Weisbecker
2019-08-12 23:23     ` Paul E. McKenney
2019-08-13  1:33       ` Joel Fernandes
2019-08-13 12:30       ` Frederic Weisbecker
2019-08-13 14:48         ` Paul E. McKenney
2019-08-14 17:55           ` Joel Fernandes
2019-08-14 22:05             ` Paul E. McKenney
2019-08-15 15:07               ` Joel Fernandes
2019-08-15 17:23                 ` Paul E. McKenney
2019-08-15 18:15                   ` Joel Fernandes
2019-08-15 18:39                     ` Paul E. McKenney
2019-08-15 19:42                       ` Joel Fernandes
2019-08-13 21:06       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190808213012.GA28773@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.