Re: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Joel Fernandes <joel@joelfernandes.org>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	Neeraj Upadhyay <neeraj.upadhyay@amd.com>,
	Uladzislau Rezki <urezki@gmail.com>,
	Zqiang <qiang.zhang1211@gmail.com>, rcu <rcu@vger.kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying
Date: Mon, 18 Dec 2023 22:38:52 -0500	[thread overview]
Message-ID: <65811051.d40a0220.75c79.66cf@mx.google.com> (raw)
In-Reply-To: <20231218231916.11719-3-frederic@kernel.org>

On Tue, Dec 19, 2023 at 12:19:15AM +0100, Frederic Weisbecker wrote:
> When the CPU goes idle for the last time during the CPU down hotplug
> process, RCU reports a final quiescent state for the current CPU. If
> this quiescent state propagates up to the top, some tasks may then be
> woken up to complete the grace period: the main grace period kthread
> and/or the expedited main workqueue (or kworker).
> 
> If those kthreads have a SCHED_FIFO policy, the wake up can indirectly
> arm the RT bandwith timer to the local offline CPU. Since this happens
> after hrtimers have been migrated at CPUHP_AP_HRTIMERS_DYING stage, the
> timer gets ignored. Therefore if the RCU kthreads are waiting for RT
> bandwidth to be available, they may never be actually scheduled.
> 
> This triggers TREE03 rcutorture hangs:
> 
> 	 rcu: INFO: rcu_preempt self-detected stall on CPU
> 	 rcu:     4-...!: (1 GPs behind) idle=9874/1/0x4000000000000000 softirq=0/0 fqs=20 rcuc=21071 jiffies(starved)
> 	 rcu:     (t=21035 jiffies g=938281 q=40787 ncpus=6)
> 	 rcu: rcu_preempt kthread starved for 20964 jiffies! g938281 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> 	 rcu:     Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
> 	 rcu: RCU grace-period kthread stack dump:
> 	 task:rcu_preempt     state:R  running task     stack:14896 pid:14    tgid:14    ppid:2      flags:0x00004000
> 	 Call Trace:
> 	  <TASK>
> 	  __schedule+0x2eb/0xa80
> 	  schedule+0x1f/0x90
> 	  schedule_timeout+0x163/0x270
> 	  ? __pfx_process_timeout+0x10/0x10
> 	  rcu_gp_fqs_loop+0x37c/0x5b0
> 	  ? __pfx_rcu_gp_kthread+0x10/0x10
> 	  rcu_gp_kthread+0x17c/0x200
> 	  kthread+0xde/0x110
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork+0x2b/0x40
> 	  ? __pfx_kthread+0x10/0x10
> 	  ret_from_fork_asm+0x1b/0x30
> 	  </TASK>
> 
> The situation can't be solved with just unpinning the timer. The hrtimer
> infrastructure and the nohz heuristics involved in finding the best
> remote target for an unpinned timer would then also need to handle
> enqueues from an offline CPU in the most horrendous way.
> 
> So fix this on the RCU side instead and defer the wake up to an online
> CPU if it's too late for the local one.

Ah, ideally we'd not run into this if sched_feat(TTWU_QUEUE) was enabled
but then in any case there is also the ttwu_queue_cond() also shutting down
the remote queueing..

> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c     | 34 +++++++++++++++++++++++++++++++++-
>  kernel/rcu/tree_exp.h |  3 +--
>  2 files changed, 34 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3ac3c846105f..157f3ca2a9b5 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1013,6 +1013,38 @@ static bool rcu_future_gp_cleanup(struct rcu_node *rnp)
>  	return needmore;
>  }
>  
> +static void swake_up_one_online_ipi(void *arg)
> +{
> +	struct swait_queue_head *wqh = arg;
> +
> +	swake_up_one(wqh);
> +}

Speaking of, the scheduler refuses to do remote-IPI-style wakeups
(TTWU_QUEUE) whenever the destination CPU is in a hotplug state.

static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
{
	/*
	 * Do not complicate things with the async wake_list while the CPU is
	 * in hotplug state.
	 */
	if (!cpu_active(cpu))
		return false;
	...
}

Along these lines, I wonder if, it is safe to do a wakeup in this fashion (as
done by this patch) if the destination CPU was also going down.

Also the same ttwu_queue_cond() checks for CPU affinities before deciding to
not do the IPI-style queue.

	/* Ensure the task will still be allowed to run on the CPU. */
	if (!cpumask_test_cpu(cpu, p->cpus_ptr))
		return false;

Not that anyone should be changing RCU thread priorities around while the IPI
is in flight, but...

I wonder if the reason TTWU is excessively paranoid is that the IPI can be
delayed for example, leading to race conditions.

Anyway, just my 2 cents.

Happy holidays! thanks,

 - Joel


> +
> +static void swake_up_one_online(struct swait_queue_head *wqh)
> +{
> +	int cpu = get_cpu();
> +
> +	/*
> +	 * If called from rcutree_report_cpu_starting(), wake up
> +	 * is dangerous that late in the CPU-down hotplug process. The
> +	 * scheduler might queue an ignored hrtimer. Defer the wake up
> +	 * to an online CPU instead.
> +	 */
> +	if (unlikely(cpu_is_offline(cpu))) {
> +		int target;
> +
> +		target = cpumask_any_and(housekeeping_cpumask(HK_TYPE_RCU),
> +					 cpu_online_mask);
> +
> +		smp_call_function_single(target, swake_up_one_online_ipi,
> +					 wqh, 0);
> +		put_cpu();
> +	} else {
> +		put_cpu();
> +		swake_up_one(wqh);
> +	}
> +}
> +
>  /*
>   * Awaken the grace-period kthread.  Don't do a self-awaken (unless in an
>   * interrupt or softirq handler, in which case we just might immediately
> @@ -1037,7 +1069,7 @@ static void rcu_gp_kthread_wake(void)
>  		return;
>  	WRITE_ONCE(rcu_state.gp_wake_time, jiffies);
>  	WRITE_ONCE(rcu_state.gp_wake_seq, READ_ONCE(rcu_state.gp_seq));
> -	swake_up_one(&rcu_state.gp_wq);
> +	swake_up_one_online(&rcu_state.gp_wq);
>  }
>  
>  /*
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 6d7cea5d591f..2ac440bc7e10 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -173,7 +173,6 @@ static bool sync_rcu_exp_done_unlocked(struct rcu_node *rnp)
>  	return ret;
>  }
>  
> -
>  /*
>   * Report the exit from RCU read-side critical section for the last task
>   * that queued itself during or before the current expedited preemptible-RCU
> @@ -201,7 +200,7 @@ static void __rcu_report_exp_rnp(struct rcu_node *rnp,
>  			raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
>  			if (wake) {
>  				smp_mb(); /* EGP done before wake_up(). */
> -				swake_up_one(&rcu_state.expedited_wq);
> +				swake_up_one_online(&rcu_state.expedited_wq);
>  			}
>  			break;
>  		}
> -- 
> 2.42.1
>

next prev parent reply	other threads:[~2023-12-19  3:38 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-18 23:19 [PATCH 0/3] timers & RCU: Fix TREE03 stalls Frederic Weisbecker
2023-12-18 23:19 ` [PATCH 1/3] hrtimer: Report offline hrtimer enqueue Frederic Weisbecker
2023-12-18 23:19 ` [PATCH 2/3] rcu: Defer RCU kthreads wakeup when CPU is dying Frederic Weisbecker
2023-12-19  3:38   ` Joel Fernandes [this message]
2023-12-19 11:56     ` Frederic Weisbecker
2023-12-20  3:01       ` Joel Fernandes
2023-12-20 15:50         ` Frederic Weisbecker
2023-12-21  0:57           ` Paul E. McKenney
2023-12-19  4:42   ` Hillf Danton
2023-12-19 23:42     ` Frederic Weisbecker
2023-12-19 15:29   ` Paul E. McKenney
2023-12-19 15:47     ` Frederic Weisbecker
2023-12-20  8:24   ` Z qiang
2023-12-20 15:13     ` Frederic Weisbecker
2024-08-12  9:10       ` Cheng-Jui Wang (王正睿)
2024-08-12 11:13         ` Frederic Weisbecker
2024-08-12 11:53           ` Cheng-Jui Wang (王正睿)
2023-12-18 23:19 ` [PATCH 3/3] rcu/exp: Remove full barrier upon main thread wakeup Frederic Weisbecker
2023-12-19  1:40 ` [PATCH 0/3] timers & RCU: Fix TREE03 stalls Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65811051.d40a0220.75c79.66cf@mx.google.com \
    --to=joel@joelfernandes.org \
    --cc=boqun.feng@gmail.com \
    --cc=frederic@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neeraj.upadhyay@amd.com \
    --cc=paulmck@kernel.org \
    --cc=peterz@infradead.org \
    --cc=qiang.zhang1211@gmail.com \
    --cc=rcu@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.