All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
Cc: Josh Boyer <jwboyer@redhat.com>,
	linux-kernel@vger.kernel.org, kernel-team@fedoraproject.org
Subject: Re: RCU related performance regression in 3.3
Date: Tue, 1 May 2012 08:45:13 -0700	[thread overview]
Message-ID: <20120501154513.GD2441@linux.vnet.ibm.com> (raw)
In-Reply-To: <4F9FA50C.9070600@wanadoo.fr>

On Tue, May 01, 2012 at 10:55:40AM +0200, Pascal Chapperon wrote:
> Le 01/05/2012 02:02, Paul E. McKenney a écrit :
> >On Fri, Apr 27, 2012 at 08:42:58PM -0700, Paul E. McKenney wrote:
> >>On Fri, Apr 27, 2012 at 02:15:20PM +0200, Pascal Chapperon wrote:
> >>>Le 18/04/2012 17:23, Paul E. McKenney a écrit :
> >>>>On Wed, Apr 18, 2012 at 05:00:14PM +0200, Pascal Chapperon wrote:
> >>>>>Le 18/04/2012 16:01, Paul E. McKenney a écrit :
> >>>>>>On Wed, Apr 18, 2012 at 11:37:28AM +0200, Pascal Chapperon wrote:
> >>>>>>>Mount and umount operations are not slower with RCU_FAST_NO_HZ during
> >>>>>>>runtime; systemctl start and stop operations are also not slower. In
> >>>>>>>fact, i couldn't find a single operation slower during runtime with
> >>>>>>>RCU_FAST_NO_HZ.
> >>>>>>
> >>>>>>Your boot-time setup is such that all CPUs are online before the
> >>>>>>boot-time mount operations take place, right?
> >>>>>Yes :
> >>>>>[ 0.242697] Brought up 8 CPUs
> >>>>>[ 0.242699] Total of 8 processors activated (35118.33 BogoMIPS).
> >>>>>
> >>>>>>Struggling to understand
> >>>>>>how RCU can tell the difference between post-CPU-bringup boot time
> >>>>>>and run time...
> >>>>>>
> >>>>>systemd is controlling the whole boot process including mount
> >>>>>operation (apart root filesystem) and as I can see, uses heavily
> >>>>>sockets to do it (not to mention cpu-affinity). It controls also the
> >>>>>major part of umount operations. Is it possible that your patch hits
> >>>>>a systemd bug ?
> >>>>
> >>>>Is it possible that systemd is using network operations that include
> >>>>synchronize_rcu()? Then if you did the same operation from the
> >>>>command line at runtime, you might not see the slowdown.
> >>>>
> >>>>Is it possible for you to convince systemd to collect RCU event tracing
> >>>>during the slow operation? RCU event tracing is available under
> >>>>/sys/kernel/debug/tracing/rcu.
> >>>>
> >>>.
> >>>I have collected the RCU event tracing during a slow boot with
> >>>FAST_NO_HZ (and the same without FAST_NO_HZ, same kernel config).
> >>>The full logs and associated "systemd-analyze plot" can be found
> >>>(in comment 32) at :
> >>>
> >>>https://bugzilla.redhat.com/show_bug.cgi?id=806548
> >>>
> >>>With FAST_NO_HZ, almost each rcu_prep_idle is followed by ksoftirqd
> >>>(75000 ksoftirqd lines with FAST_NO_HZ, 4000 without).
> >>>
> >>>Sorry, the logs are very huge, but I can't figure where are the
> >>>plots of some interest.
> >>
> >>Thank you for collecting them!  I clearly will need to do some scripting.  ;-)
> >
> >And it appears that your system is migrating timers without waking up
> >the CPU on which the timer was posted.  This explains the slowdowns:
> >RCU assumes that the timer will either fire on the CPU that it was posted
> >on or that that CPU will be awakened when it goes offline.  If the timer
> >does not fire on that CPU and that CPU is not otherwise awakened, then
> >that CPU's RCU callbacks can be indefinitely postponed, which could account
> >for the slowdowns that you were seeing.
> >
> >Please see below for a lightly tested patch that should address this
> >problem, and thank you again for your patient testing efforts!
> >
> >							Thanx, Paul
> >
> >------------------------------------------------------------------------
> >
> >rcu: Make RCU_FAST_NO_HZ handle timer migration
> >
> >The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a
> >CPU goes offline, in which case it assumes that the CPU will have to come
> >out of dyntick-idle mode (cancelling the timer) in order to go offline.
> >This is important because when RCU_FAST_NO_HZ permits a CPU to enter
> >dyntick-idle mode despite having RCU callbacks pending, it posts a timer
> >on that CPU to force a wakeup on that CPU.  This wakeup ensures that the
> >CPU will eventually handle the end of the grace period, including invoking
> >its RCU callbacks.
> >
> >However, Pascal Chapperon's test setup shows that the timer handler
> >rcu_idle_gp_timer_func() really does get invoked in some cases.  This is
> >problematic because this can cause the CPU that entered dyntick-idle
> >mode despite still having RCU callbacks pending to remain in
> >dyntick-idle mode indefinitely, which means that its RCU callbacks might
> >never be invoked.  This situation can result in grace-period delays or
> >even system hangs, which matches Pascal's observations of slow boot-up
> >and shutdown (https://lkml.org/lkml/2012/4/5/142).  See also the bugzilla:
> >
> >https://bugzilla.redhat.com/show_bug.cgi?id=806548
> >
> >This commit therefore causes the "should never be invoked" timer handler
> >rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up
> >the CPU for which the timer was intended, allowing that CPU to invoke
> >its RCU callbacks in a timely manner.
> >
> >Reported-by: Pascal Chapperon<pascal.chapperon@wanadoo.fr>
> >Signed-off-by: Paul E. McKenney<paul.mckenney@linaro.org>
> >Signed-off-by: Paul E. McKenney<paulmck@linux.vnet.ibm.com>
> >---
> >
> >  include/trace/events/rcu.h |    1 +
> >  kernel/rcutree_plugin.h    |   23 ++++++++++++++++++++---
> >  2 files changed, 21 insertions(+), 3 deletions(-)
> >
> >diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
> >index aaa55e1..1480900 100644
> >--- a/include/trace/events/rcu.h
> >+++ b/include/trace/events/rcu.h
> >@@ -292,6 +292,7 @@ TRACE_EVENT(rcu_dyntick,
> >   *	"More callbacks": Still more callbacks, try again to clear them out.
> >   *	"Callbacks drained": All callbacks processed, off to dyntick idle!
> >   *	"Timer": Timer fired to cause CPU to continue processing callbacks.
> >+ *	"Demigrate": Timer fired on wrong CPU, woke up correct CPU.
> >   *	"Cleanup after idle": Idle exited, timer canceled.
> >   */
> >  TRACE_EVENT(rcu_prep_idle,
> >diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> >index dc12efc..bbd064a 100644
> >--- a/kernel/rcutree_plugin.h
> >+++ b/kernel/rcutree_plugin.h
> >@@ -1994,16 +1994,33 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu)
> >  }
> >
> >  /*
> >+ *
> >+ */
> >+void rcu_idle_demigrate(void *unused)
> >+{
> >+	trace_rcu_prep_idle("Demigrate");
> >+}
> >+
> >+/*
> >   * Timer handler used to force CPU to start pushing its remaining RCU
> >   * callbacks in the case where it entered dyntick-idle mode with callbacks
> >   * pending.  The hander doesn't really need to do anything because the
> >   * real work is done upon re-entry to idle, or by the next scheduling-clock
> >   * interrupt should idle not be re-entered.
> >+ *
> >+ * One special case: the timer gets migrated without awakening the CPU
> >+ * on which the timer was scheduled on.  In this case, we must wake up
> >+ * that CPU.  We do so with smp_call_function_single().
> >   */
> >-static void rcu_idle_gp_timer_func(unsigned long unused)
> >+static void rcu_idle_gp_timer_func(unsigned long cpu_in)
> >  {
> >-	WARN_ON_ONCE(1); /* Getting here can hang the system... */
> >+	int cpu = (int)cpu_in;
> >+
> >  	trace_rcu_prep_idle("Timer");
> >+	if (cpu != smp_processor_id())
> >+		smp_call_function_single(cpu, rcu_idle_demigrate, NULL, 0);
> >+	else
> >+		WARN_ON_ONCE(1); /* Getting here can hang the system... */
> >  }
> >
> >  /*
> >@@ -2012,7 +2029,7 @@ static void rcu_idle_gp_timer_func(unsigned long unused)
> >  static void rcu_prepare_for_idle_init(int cpu)
> >  {
> >  	setup_timer(&per_cpu(rcu_idle_gp_timer, cpu),
> >-		    rcu_idle_gp_timer_func, 0);
> >+		    rcu_idle_gp_timer_func, cpu);
> >  }
> >
> >  /*
> >
> >
> Paul, I can't apply your patch on top of master branch; perhaps I
> need to pull your own git repository ?
> Among other things, you have:
> static void rcu_idle_gp_timer_func(unsigned long unused)
> and I have:
> static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp)

Here is my RCU_FAST_NO_HZ patch stack on top of v3.4-rc4.

Or you can pull branch fnh.2012.05.01a from:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 3370997..1480900 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -292,6 +292,8 @@ TRACE_EVENT(rcu_dyntick,
  *	"More callbacks": Still more callbacks, try again to clear them out.
  *	"Callbacks drained": All callbacks processed, off to dyntick idle!
  *	"Timer": Timer fired to cause CPU to continue processing callbacks.
+ *	"Demigrate": Timer fired on wrong CPU, woke up correct CPU.
+ *	"Cleanup after idle": Idle exited, timer canceled.
  */
 TRACE_EVENT(rcu_prep_idle,
 
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index 1050d6d..403306b 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1829,6 +1829,8 @@ __call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu),
 	rdp->qlen++;
 	if (lazy)
 		rdp->qlen_lazy++;
+	else
+		rcu_idle_count_callbacks_posted();
 
 	if (__is_kfree_rcu_offset((unsigned long)func))
 		trace_rcu_kfree_callback(rsp->name, head, (unsigned long)func,
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index cdd1be0..36ca28e 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -471,6 +471,7 @@ static void __cpuinit rcu_prepare_kthreads(int cpu);
 static void rcu_prepare_for_idle_init(int cpu);
 static void rcu_cleanup_after_idle(int cpu);
 static void rcu_prepare_for_idle(int cpu);
+static void rcu_idle_count_callbacks_posted(void);
 static void print_cpu_stall_info_begin(void);
 static void print_cpu_stall_info(struct rcu_state *rsp, int cpu);
 static void print_cpu_stall_info_end(void);
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index c023464..cadef05 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -1938,6 +1938,14 @@ static void rcu_prepare_for_idle(int cpu)
 {
 }
 
+/*
+ * Don't bother keeping a running count of the number of RCU callbacks
+ * posted because CONFIG_RCU_FAST_NO_HZ=n.
+ */
+static void rcu_idle_count_callbacks_posted(void)
+{
+}
+
 #else /* #if !defined(CONFIG_RCU_FAST_NO_HZ) */
 
 /*
@@ -1980,9 +1988,11 @@ static void rcu_prepare_for_idle(int cpu)
 
 static DEFINE_PER_CPU(int, rcu_dyntick_drain);
 static DEFINE_PER_CPU(unsigned long, rcu_dyntick_holdoff);
-static DEFINE_PER_CPU(struct hrtimer, rcu_idle_gp_timer);
-static ktime_t rcu_idle_gp_wait;	/* If some non-lazy callbacks. */
-static ktime_t rcu_idle_lazy_gp_wait;	/* If only lazy callbacks. */
+static DEFINE_PER_CPU(struct timer_list, rcu_idle_gp_timer);
+static DEFINE_PER_CPU(unsigned long, rcu_idle_gp_timer_expires);
+static DEFINE_PER_CPU(bool, rcu_idle_first_pass);
+static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted);
+static DEFINE_PER_CPU(unsigned long, rcu_nonlazy_posted_snap);
 
 /*
  * Allow the CPU to enter dyntick-idle mode if either: (1) There are no
@@ -1995,6 +2005,8 @@ static ktime_t rcu_idle_lazy_gp_wait;	/* If only lazy callbacks. */
  */
 int rcu_needs_cpu(int cpu)
 {
+	/* Flag a new idle sojourn to the idle-entry state machine. */
+	per_cpu(rcu_idle_first_pass, cpu) = 1;
 	/* If no callbacks, RCU doesn't need the CPU. */
 	if (!rcu_cpu_has_callbacks(cpu))
 		return 0;
@@ -2045,16 +2057,33 @@ static bool rcu_cpu_has_nonlazy_callbacks(int cpu)
 }
 
 /*
+ *
+ */
+void rcu_idle_demigrate(void *unused)
+{
+	trace_rcu_prep_idle("Demigrate");
+}
+
+/*
  * Timer handler used to force CPU to start pushing its remaining RCU
  * callbacks in the case where it entered dyntick-idle mode with callbacks
  * pending.  The hander doesn't really need to do anything because the
  * real work is done upon re-entry to idle, or by the next scheduling-clock
  * interrupt should idle not be re-entered.
+ *
+ * One special case: the timer gets migrated without awakening the CPU
+ * on which the timer was scheduled on.  In this case, we must wake up
+ * that CPU.  We do so with smp_call_function_single().
  */
-static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp)
+static void rcu_idle_gp_timer_func(unsigned long cpu_in)
 {
+	int cpu = (int)cpu_in;
+
 	trace_rcu_prep_idle("Timer");
-	return HRTIMER_NORESTART;
+	if (cpu != smp_processor_id())
+		smp_call_function_single(cpu, rcu_idle_demigrate, NULL, 0);
+	else
+		WARN_ON_ONCE(1); /* Getting here can hang the system... */
 }
 
 /*
@@ -2062,19 +2091,8 @@ static enum hrtimer_restart rcu_idle_gp_timer_func(struct hrtimer *hrtp)
  */
 static void rcu_prepare_for_idle_init(int cpu)
 {
-	static int firsttime = 1;
-	struct hrtimer *hrtp = &per_cpu(rcu_idle_gp_timer, cpu);
-
-	hrtimer_init(hrtp, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hrtp->function = rcu_idle_gp_timer_func;
-	if (firsttime) {
-		unsigned int upj = jiffies_to_usecs(RCU_IDLE_GP_DELAY);
-
-		rcu_idle_gp_wait = ns_to_ktime(upj * (u64)1000);
-		upj = jiffies_to_usecs(RCU_IDLE_LAZY_GP_DELAY);
-		rcu_idle_lazy_gp_wait = ns_to_ktime(upj * (u64)1000);
-		firsttime = 0;
-	}
+	setup_timer(&per_cpu(rcu_idle_gp_timer, cpu),
+		    rcu_idle_gp_timer_func, cpu);
 }
 
 /*
@@ -2084,7 +2102,8 @@ static void rcu_prepare_for_idle_init(int cpu)
  */
 static void rcu_cleanup_after_idle(int cpu)
 {
-	hrtimer_cancel(&per_cpu(rcu_idle_gp_timer, cpu));
+	del_timer(&per_cpu(rcu_idle_gp_timer, cpu));
+	trace_rcu_prep_idle("Cleanup after idle");
 }
 
 /*
@@ -2108,6 +2127,29 @@ static void rcu_cleanup_after_idle(int cpu)
  */
 static void rcu_prepare_for_idle(int cpu)
 {
+	struct timer_list *tp;
+
+	/*
+	 * If this is an idle re-entry, for example, due to use of
+	 * RCU_NONIDLE() or the new idle-loop tracing API within the idle
+	 * loop, then don't take any state-machine actions, unless the
+	 * momentary exit from idle queued additional non-lazy callbacks.
+	 * Instead, repost the rcu_idle_gp_timer if this CPU has callbacks
+	 * pending.
+	 */
+	if (!per_cpu(rcu_idle_first_pass, cpu) &&
+	    (per_cpu(rcu_nonlazy_posted, cpu) ==
+	     per_cpu(rcu_nonlazy_posted_snap, cpu))) {
+		if (rcu_cpu_has_callbacks(cpu)) {
+			tp = &per_cpu(rcu_idle_gp_timer, cpu);
+			mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu));
+		}
+		return;
+	}
+	per_cpu(rcu_idle_first_pass, cpu) = 0;
+	per_cpu(rcu_nonlazy_posted_snap, cpu) =
+		per_cpu(rcu_nonlazy_posted, cpu) - 1;
+
 	/*
 	 * If there are no callbacks on this CPU, enter dyntick-idle mode.
 	 * Also reset state to avoid prejudicing later attempts.
@@ -2140,11 +2182,15 @@ static void rcu_prepare_for_idle(int cpu)
 		per_cpu(rcu_dyntick_drain, cpu) = 0;
 		per_cpu(rcu_dyntick_holdoff, cpu) = jiffies;
 		if (rcu_cpu_has_nonlazy_callbacks(cpu))
-			hrtimer_start(&per_cpu(rcu_idle_gp_timer, cpu),
-				      rcu_idle_gp_wait, HRTIMER_MODE_REL);
+			per_cpu(rcu_idle_gp_timer_expires, cpu) =
+					   jiffies + RCU_IDLE_GP_DELAY;
 		else
-			hrtimer_start(&per_cpu(rcu_idle_gp_timer, cpu),
-				      rcu_idle_lazy_gp_wait, HRTIMER_MODE_REL);
+			per_cpu(rcu_idle_gp_timer_expires, cpu) =
+					   jiffies + RCU_IDLE_LAZY_GP_DELAY;
+		tp = &per_cpu(rcu_idle_gp_timer, cpu);
+		mod_timer_pinned(tp, per_cpu(rcu_idle_gp_timer_expires, cpu));
+		per_cpu(rcu_nonlazy_posted_snap, cpu) =
+			per_cpu(rcu_nonlazy_posted, cpu);
 		return; /* Nothing more to do immediately. */
 	} else if (--per_cpu(rcu_dyntick_drain, cpu) <= 0) {
 		/* We have hit the limit, so time to give up. */
@@ -2184,6 +2230,17 @@ static void rcu_prepare_for_idle(int cpu)
 		trace_rcu_prep_idle("Callbacks drained");
 }
 
+/*
+ * Keep a running count of callbacks posted so that rcu_prepare_for_idle()
+ * can detect when something out of the idle loop posts a callback.
+ * Of course, it had better do so either from a trace event designed to
+ * be called from idle or from within RCU_NONIDLE().
+ */
+static void rcu_idle_count_callbacks_posted(void)
+{
+	__this_cpu_add(rcu_nonlazy_posted, 1);
+}
+
 #endif /* #else #if !defined(CONFIG_RCU_FAST_NO_HZ) */
 
 #ifdef CONFIG_RCU_CPU_STALL_INFO
@@ -2192,14 +2249,12 @@ static void rcu_prepare_for_idle(int cpu)
 
 static void print_cpu_stall_fast_no_hz(char *cp, int cpu)
 {
-	struct hrtimer *hrtp = &per_cpu(rcu_idle_gp_timer, cpu);
+	struct timer_list *tltp = &per_cpu(rcu_idle_gp_timer, cpu);
 
-	sprintf(cp, "drain=%d %c timer=%lld",
+	sprintf(cp, "drain=%d %c timer=%lu",
 		per_cpu(rcu_dyntick_drain, cpu),
 		per_cpu(rcu_dyntick_holdoff, cpu) == jiffies ? 'H' : '.',
-		hrtimer_active(hrtp)
-			? ktime_to_us(hrtimer_get_remaining(hrtp))
-			: -1);
+		timer_pending(tltp) ? tltp->expires - jiffies : -1);
 }
 
 #else /* #ifdef CONFIG_RCU_FAST_NO_HZ */


  reply	other threads:[~2012-05-01 15:45 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-27 12:15 RCU related performance regression in 3.3 Pascal Chapperon
2012-04-28  3:42 ` Paul E. McKenney
2012-05-01  0:02   ` Paul E. McKenney
2012-05-01  8:55     ` Pascal Chapperon
2012-05-01 15:45       ` Paul E. McKenney [this message]
2012-05-04 14:42         ` Pascal Chapperon
2012-05-04 15:04           ` Paul E. McKenney
2012-05-04 21:41             ` Pascal Chapperon
2012-05-04 23:14               ` Paul E. McKenney
2012-05-10  8:40                 ` Pascal Chapperon
2012-05-14 22:32                 ` Paul E. McKenney
2012-05-18 11:01                   ` Pascal Chapperon
2012-05-18 12:14                     ` Paul E. McKenney
2012-05-18 14:48                       ` Pascal Chapperon
  -- strict thread matches above, loose matches on Subject: below --
2012-04-04 15:27 Josh Boyer
2012-04-04 21:36 ` Paul E. McKenney
2012-04-05 12:37   ` Josh Boyer
2012-04-05 14:00     ` Paul E. McKenney
2012-04-05 14:15       ` Pascal CHAPPERON
2012-04-05 14:39         ` Paul E. McKenney
2012-04-06  9:18           ` Pascal Chapperon
2012-04-10 16:07             ` Paul E. McKenney
2012-04-11 15:06               ` Pascal
2012-04-12 18:04                 ` Paul E. McKenney
2012-04-16 21:02                   ` Paul E. McKenney
2012-04-18  9:37                     ` Pascal Chapperon
2012-04-18 14:01                       ` Paul E. McKenney
2012-04-18 15:00                         ` Pascal Chapperon
2012-04-18 15:23                           ` Paul E. McKenney
2012-04-20 14:45                             ` Pascal Chapperon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120501154513.GD2441@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=jwboyer@redhat.com \
    --cc=kernel-team@fedoraproject.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pascal.chapperon@wanadoo.fr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.