Re: need_heavy_qs flag for PREEMPT=y kernels

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.ibm.com>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Subject: Re: need_heavy_qs flag for PREEMPT=y kernels
Date: Mon, 12 Aug 2019 16:01:38 -0700	[thread overview]
Message-ID: <20190812230138.GS28441@linux.ibm.com> (raw)
In-Reply-To: <20190812212013.GB48751@google.com>

On Mon, Aug 12, 2019 at 05:20:13PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > > reported a quiescent state for the current grace period.
> > > 
> > > Just wanted to ask something - how does resched_cpu() help for this case?
> > > 
> > > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > > 
> > > This seems not that useful. Even if we enter the scheduler due to the
> > > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > > quiescent state to the leaf node. Neither will anything to do a
> > > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > > will still end up getting blocked.
> > > 
> > > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > > kernel actually helps to end the grace period when we call resched_cpu() on
> > > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > > CPU? Maybe I should do an experiment to see this all play out.
> > 
> > An experiment would be good!
> 
> Hi Paul,
> Some very interesting findings!
> 
> Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
> CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
> exits. Diff for test is appended.
> 
> Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
> thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
> grace period is unable to come to an end with the hold up seemingly due to
> CPU 3.

Good catch!

> I see that the scheduler tick is off mostly, but occasionally is turned back
> on during the test loop. However it has no effect and the grace period is
> stuck on the same rcu_state.gp_seq value for the duration of the test. I
> think the scheduler-tick ineffectiveness could be because of this patch?
> https://lore.kernel.org/patchwork/patch/446044/

Unlikely, given that __rcu_pending(), which is now just rcu_pending(),
is only invoked from the scheduler-clock interrupt.

But from your traces below, clearly something is in need of repair.

> Relevant traces, sorry I did not wrap it for better readability:
> 
> Here I marked the start of the test:
> 
> [   24.852639] rcu_perf-163    13.... 1828278us : rcu_perf_writer: Start of rcuperf test
> [   24.853651] rcu_perf-163    13dN.1 1828284us : rcu_grace_period: rcu_preempt 18446744073709550677 cpuqs
> [   24.971709] rcu_pree-10      0d..1 1833228us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   26.493607] rcu_pree-10      0d..1 2137286us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   28.090618] rcu_pree-10      0d..1 2441290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> 
> Scheduler tick came in but almost 6 seconds of start of test, probably because migrate thread increase number of tasks queued on RQ:
> 
> [   28.355513] rcu_perf-163     3d.h. 2487447us : rcu_sched_clock_irq: sched-tick
> [   29.592912] rcu_pree-10      0d..1 2745293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   30.238041] rcu_perf-163     3d..2 2879562us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=migration/3 next_pid=26 next_prio=0
> [   30.269203] migratio-26      3d..2 2879745us : sched_switch: prev_comm=migration/3 prev_pid=26 prev_prio=0 prev_state=S ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
> [   31.109635] rcu_pree-10      0d..1 3049301us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   32.627686] rcu_pree-10      0d..1 3353298us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   34.071163] rcu_pree-10      0d..1 3657299us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> 
> Several context-switches on CPU 3, but grace-period is still extended:
> 
> [   34.634814] rcu_perf-163     3d..2 3775310us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
> [   34.638532] kworker/-28      3d..2 3775343us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=D ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
> [   34.640338]  cpuhp/3-25      3d..2 3775348us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   34.653831]   <idle>-0       3d..2 3775522us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
> [   34.657770] kworker/-28      3d..2 3775536us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=I ==> next_comm=kworker/3:1 next_pid=177 next_prio=120
> [   34.661758] kworker/-177     3d..2 3775543us : sched_switch: prev_comm=kworker/3:1 prev_pid=177 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   34.866007]   <idle>-0       3d..2 3789967us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0H next_pid=29 next_prio=100
> [   34.874200] kworker/-29      3d..2 3789988us : sched_switch: prev_comm=kworker/3:0H prev_pid=29 prev_prio=100 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   35.128506]   <idle>-0       3d..2 3816391us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
> [   35.130481]  cpuhp/3-25      3d..2 3816503us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   35.509469]   <idle>-0       3d..2 3828462us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
> 
> Scheduler tick doesn't do much to help:
> 
> [   35.512436] rcu_perf-163     3d.h. 3829210us : rcu_sched_clock_irq: sched-tick
> 
> Now scheduler tick is completely off for the next 29 seconds. FQS loops keeps
> doing resched_cpu on CPU 3 at 300 jiffy intervals (my HZ=1000):
> 
> [   36.160145] rcu_pree-10      0d..1 3958214us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   38.778574] rcu_pree-10      0d..1 4262288us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   40.604108] rcu_pree-10      0d..1 4566246us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   42.565345] rcu_pree-10      0d..1 4870294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   44.322041] rcu_pree-10      0d..1 5174293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   46.083710] rcu_pree-10      0d..1 5478290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   47.851449] rcu_pree-10      0d..1 5782289us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   49.693127] rcu_pree-10      0d..1 6086293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   51.432018] rcu_pree-10      0d..1 6390283us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   53.187991] rcu_pree-10      0d..1 6694294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   53.970850] rcu_perf-163     3.... 6828237us : rcu_perf_writer: End of rcuperf test
> 
> I feel we could do one of:
> 1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler

This would not be so good in the case where said handler interrupted
an RCU read-side critical section.

> 2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.

This could work, but we normally need multiple softirq invocations to
get to a quiescent state.  So we really need to turn the scheduler-clock
interrupt on.

We do have a hook into the interrupt handlers in the guise of the
irq==true case in rcu_nmi_exit_common().  When switching to an extended
quiescent state, there is nothing to do because FQS will detect the
quiescent state on the next pass.  Otherwise, the scheduler-clock
tick could be turned on.  But only if this is a nohz_full CPU and the
rdp->rcu_urgent_qs flag is set and the rdp->dynticks_nmi_nesting value
is 2.

We also would need to turn the scheduler-clock interrupt off at some
point, presumably when a CPU reports its quiescent state to RCU core.

Interestingly enough, rcu_torture_fwd_prog_nr() detected this, but
my response was to add this to the beginning and end of it:

	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
		tick_dep_set_task(current, TICK_DEP_MASK_RCU);

	...

	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
		tick_dep_clear_task(current, TICK_DEP_MASK_RCU);

Thus, one could argue that this functionality should instead be in the
nohz_full code, but one has to start somewhere.  The goal would be to
be able to remove these statements from rcu_torture_fwd_prog_nr().

> What do you think?  Also test diff is below.

Looks reasonable, though the long-term approach is to remove the above
lines from rcu_torture_fwd_prog_nr().  :-/

Untested patch below that includes some inadvertent whitespace fixes,
probably does not even build.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8c494a692728..ad906d6a74fb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
 	 */
 	if (rdp->dynticks_nmi_nesting != 1) {
 		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
+		if (tick_nohz_full_cpu(rdp->cpu) &&
+		    rdp->dynticks_nmi_nesting == 2 &&
+		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+			rdp->rcu_forced_tick = true;
+			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+		}
 		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdp->dynticks_nmi_nesting - 2);
 		return;
@@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
 	local_irq_restore(flags);
 }
 
+/*
+ * If the scheduler-clock interrupt was enabled on a nohz_full CPU
+ * in order to get to a quiescent state, disable it.
+ */
+void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
+{
+	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
+		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+}
+
 /**
  * rcu_is_watching - see if RCU thinks that the current CPU is not idle
  *
@@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
 		if (!offloaded)
 			needwake = rcu_accelerate_cbs(rnp, rdp);
 
+		rcu_disable_tick_upon_qs(rdp);
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		/* ^^^ Released rnp->lock */
 		if (needwake)
@@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 	int cpu;
 	unsigned long flags;
 	unsigned long mask;
+	struct rcu_data *rdp;
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rnp) {
@@ -2293,8 +2311,10 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 		for_each_leaf_node_possible_cpu(rnp, cpu) {
 			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
 			if ((rnp->qsmask & bit) != 0) {
-				if (f(per_cpu_ptr(&rcu_data, cpu)))
-					mask |= bit;
+				rdp = per_cpu_ptr(&rcu_data, cpu);
+				if (f(rdp))
+					rcu_disable_tick_upon_qs(rdp);
+				mask |= bit;
 			}
 		}
 		if (mask != 0) {
@@ -2322,7 +2342,7 @@ void rcu_force_quiescent_state(void)
 	rnp = __this_cpu_read(rcu_data.mynode);
 	for (; rnp != NULL; rnp = rnp->parent) {
 		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
-		      !raw_spin_trylock(&rnp->fqslock);
+		       !raw_spin_trylock(&rnp->fqslock);
 		if (rnp_old != NULL)
 			raw_spin_unlock(&rnp_old->fqslock);
 		if (ret)
@@ -2855,7 +2875,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
 {
 	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
 		rcu_barrier_trace(TPS("LastCB"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 		complete(&rcu_state.barrier_completion);
 	} else {
 		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
@@ -2879,7 +2899,7 @@ static void rcu_barrier_func(void *unused)
 	} else {
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 	}
 	rcu_nocb_unlock(rdp);
 }
@@ -2906,7 +2926,7 @@ void rcu_barrier(void)
 	/* Did someone else do our work for us? */
 	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
 		rcu_barrier_trace(TPS("EarlyExit"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 		smp_mb(); /* caller's subsequent code after above check. */
 		mutex_unlock(&rcu_state.barrier_mutex);
 		return;
@@ -2938,11 +2958,11 @@ void rcu_barrier(void)
 			continue;
 		if (rcu_segcblist_n_cbs(&rdp->cblist)) {
 			rcu_barrier_trace(TPS("OnlineQ"), cpu,
-					   rcu_state.barrier_sequence);
+					  rcu_state.barrier_sequence);
 			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
 		} else {
 			rcu_barrier_trace(TPS("OnlineNQ"), cpu,
-					   rcu_state.barrier_sequence);
+					  rcu_state.barrier_sequence);
 		}
 	}
 	put_online_cpus();
@@ -3168,6 +3188,7 @@ void rcu_cpu_starting(unsigned int cpu)
 	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
 	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
 	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+		rcu_disable_tick_upon_qs(rdp);
 		/* Report QS -after- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 	} else {
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index c612f306fe89..055c31781d3a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -181,6 +181,7 @@ struct rcu_data {
 	atomic_t dynticks;		/* Even value for idle, else odd. */
 	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
 	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
+	bool rcu_forced_tick;		/* Forced tick to provide QS. */
 #ifdef CONFIG_RCU_FAST_NO_HZ
 	bool all_lazy;			/* All CPU's CBs lazy at idle start? */
 	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */

next prev parent reply	other threads:[~2019-08-12 23:01 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-11 18:08 need_heavy_qs flag for PREEMPT=y kernels Joel Fernandes
2019-08-11 18:34 ` Joel Fernandes
2019-08-11 21:16   ` Paul E. McKenney
2019-08-11 21:25     ` Joel Fernandes
2019-08-11 23:30       ` Paul E. McKenney
2019-08-12  1:24         ` Joel Fernandes
2019-08-12  1:40           ` Joel Fernandes
2019-08-12  3:57             ` Paul E. McKenney
2019-08-11 21:13 ` Paul E. McKenney
2019-08-12  3:21   ` Joel Fernandes
2019-08-12  3:53     ` Paul E. McKenney
2019-08-12 21:20       ` Joel Fernandes
2019-08-12 23:01         ` Paul E. McKenney [this message]
2019-08-13  1:02           ` Joel Fernandes
2019-08-13  1:05             ` Joel Fernandes
2019-08-13  2:28               ` Paul E. McKenney
2019-08-13  2:27             ` Paul E. McKenney
2019-08-13  2:50               ` Paul E. McKenney
2019-08-15 17:17             ` Paul E. McKenney
2019-08-15 20:04               ` Joel Fernandes
2019-08-15 20:31                 ` Paul E. McKenney
2019-08-15 21:22                   ` Joel Fernandes
2019-08-15 21:27                     ` Joel Fernandes
2019-08-15 21:34                       ` Joel Fernandes
2019-08-15 21:57                         ` Paul E. McKenney
2019-08-15 21:45                     ` Paul E. McKenney
2019-08-16  0:02                       ` Joel Fernandes
2019-08-19 12:34                         ` Frederic Weisbecker
2019-08-19 12:09                   ` Frederic Weisbecker
2019-08-19 16:57                   ` Frederic Weisbecker
2019-08-19 22:31                     ` Paul E. McKenney

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:8c494a69272 dfblob:ad906d6a74f dfblob:c612f306fe8
dfblob:055c31781d3 )
 OR (
bs:"Re: need_heavy_qs flag for PREEMPT=y kernels" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190812230138.GS28441@linux.ibm.com \
    --to=paulmck@linux.ibm.com \
    --cc=joel@joelfernandes.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.