From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5491C3093C3; Fri, 6 Mar 2026 21:21:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772832081; cv=none; b=HSYrl4cL/Zrm/vWGQZeFOirnzBVG1r26Dd/XB6P8Upsgzaqo7wR5RZFJ9TwQcBato2lHBjDAkOWDhf8dwhZsT4RxaL5qCCEASbyClvJL6bmTXxLo8sQc6ueUGkHksAT9ZFdQMcXaHOrWwa4xN597fuH+mfgWTxYQUNSibUO2d7k= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772832081; c=relaxed/simple; bh=Tw0L1IU0pBFMhrxP6ztC/mb1rm00uwCb3K5hsG69Glo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HeWVs7FhaSGpzWiAV+4NxBE/fm1/WbKJnxZvkzT0W5KJPhnOKZqQue2leQUZE+hvHh0foff6Xss/5BG4wVdxO0ARNx6YAD2i1UOfrOtJUHMAH4hY3Hcd46sp0K+oskbNd6fi7fDWdiX+QC3d5/qgYeXLHHFuaenlcq0p+yxNJ1o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WzzbHgMl; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WzzbHgMl" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 54AF7C4CEF7; Fri, 6 Mar 2026 21:21:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772832080; bh=Tw0L1IU0pBFMhrxP6ztC/mb1rm00uwCb3K5hsG69Glo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=WzzbHgMljkqzzNEdb/r1+2faxnHkENayaXJYraAjaN7FSyr+mS7dOk7h6Q35P5SFg Uf8eW8DGFMgifbi9MrC9J/nqtKm+QoThLHgxC5XQzvJ4GIG3mYxZ300IH8Z429i8o8 cLAa8QsxN6dNHB5pOSHZjVh+SK0ExOwnKp6fB0fuyHdGzyGL7FS5+UpYF6kJfCSZ0O S+jCeG38rLmdTsDCa8pXXDnLmY+8NJ80aXqdr9IxVJr03kOZQ0hSfwz5HpKapmAvEa J4q+ZoPSjnRhoO/8KWPMwRaS79Gs30rMjLASQhhliiQCHzVwdnqlAS89zyH+vxI2uI IGULnvirtZrIA== From: "Rafael J. Wysocki" To: Qais Yousef , Christian Loehle Cc: Thomas Gleixner , LKML , Peter Zijlstra , Frederic Weisbecker , Linux PM Subject: Re: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware Date: Fri, 06 Mar 2026 22:21:17 +0100 Message-ID: <6250711.lOV4Wx5bFT@rafael.j.wysocki> Organization: Linux Kernel Development In-Reply-To: <20260304030306.uk5c63xw4oqvjffb@airbuntu> References: <20260301191959.406218221@kernel.org> <20260304030306.uk5c63xw4oqvjffb@airbuntu> Precedence: bulk X-Mailing-List: linux-pm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" On Wednesday, March 4, 2026 4:03:06 AM CET Qais Yousef wrote: > On 03/02/26 22:25, Rafael J. Wysocki wrote: > > On Mon, Mar 2, 2026 at 12:04=E2=80=AFPM Christian Loehle > > wrote: > > > > > > On 3/1/26 19:30, Thomas Gleixner wrote: > > > > Guests fall back to default_idle_call() as there is no cpuidle driv= er > > > > available to them by default. That causes a problem in fully loaded > > > > scenarios where CPUs go briefly idle for a couple of microseconds: > > > > > > > > tick_nohz_idle_stop_tick() is invoked unconditionally which means u= nless > > > > there is timer pending in the next tick, the tick is stopped and a = couple > > > > of microseconds later when the idle condition goes away restarted. = That > > > > requires to program the clockevent device twice which implies a VM = exit for > > > > each reprogramming. > > > > > > > > It was suggested to remove the tick_nohz_idle_stop_tick() invocatio= n from > > > > the default idle code, but would be counterproductive. It would not= allow > > > > the host to go into deeper idle states when the guest CPU is fully = idle as > > > > it has to maintain the periodic tick. > > > > > > > > Cure this by implementing a trivial moving average filter which kee= ps track > > > > of the recent idle recidency time and only stop the tick when the a= verage > > > > is larger than a tick. > > > > > > > > Signed-off-by: Thomas Gleixner > > > > --- > > > > kernel/sched/idle.c | 65 +++++++++++++++++++++++++++++++++++++++= ++++++------- > > > > 1 file changed, 57 insertions(+), 8 deletions(-) > > > > > > > > --- a/kernel/sched/idle.c > > > > +++ b/kernel/sched/idle.c > > > > @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e > > > > static inline void cond_tick_broadcast_exit(void) { } > > > > #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */ > > > > > > > > -/** > > > > - * default_idle_call - Default CPU idle routine. > > > > - * > > > > - * To use when the cpuidle framework cannot be used. > > > > - */ > > > > -static void __cpuidle default_idle_call(void) > > > > +static void __cpuidle __default_idle_call(void) > > > > { > > > > instrumentation_begin(); > > > > if (!current_clr_polling_and_test()) { > > > > @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call( > > > > instrumentation_end(); > > > > } > > > > > > > > +#ifdef CONFIG_NO_HZ_COMMON > > > > + > > > > +/* Limit to 4 entries so it fits in a cache line */ > > > > +#define IDLE_DUR_ENTRIES 4 > > > > +#define IDLE_DUR_MASK (IDLE_DUR_ENTRIES - 1) > > > > + > > > > +struct idle_nohz_data { > > > > + u64 duration[IDLE_DUR_ENTRIES]; > > > > + u64 entry_time; > > > > + u64 sum; > > > > + unsigned int idx; > > > > +}; > > > > + > > > > +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data); > > > > + > > > > +/** > > > > + * default_idle_call - Default CPU idle routine. > > > > + * > > > > + * To use when the cpuidle framework cannot be used. > > > > + */ > > > > +static void default_idle_call(void) > > > > +{ > > > > + struct idle_nohz_data *nd =3D this_cpu_ptr(&nohz_data); > > > > + unsigned int idx =3D nd->idx; > > > > + s64 delta; > > > > + > > > > + /* > > > > + * If the CPU spends more than a tick on average in idle, try= to stop > > > > + * the tick. > > > > + */ > > > > + if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES) > > > > + tick_nohz_idle_stop_tick(); > > > > + > > > > + __default_idle_call(); > > > > + > > > > + /* > > > > + * Build a moving average of the time spent in idle to preven= t stopping > > > > + * the tick on a loaded system which only goes idle briefly. > > > > + */ > > > > + delta =3D max(sched_clock() - nd->entry_time, 0); > > > > + nd->sum +=3D delta - nd->duration[idx]; > > > > + nd->duration[idx] =3D delta; > > > > + nd->idx =3D (idx + 1) & IDLE_DUR_MASK; > > > > +} > > > > + > > > > +static void default_idle_enter(void) > > > > +{ > > > > + this_cpu_write(nohz_data.entry_time, sched_clock()); > > > > +} > > > > + > > > > +#else /* CONFIG_NO_HZ_COMMON */ > > > > +static inline void default_idle_call(void { __default_idle_call();= } > > > > +static inline void default_idle_enter(void) { } > > > > +#endif /* !CONFIG_NO_HZ_COMMON */ > > > > + > > > > static int call_cpuidle_s2idle(struct cpuidle_driver *drv, > > > > struct cpuidle_device *dev, > > > > u64 max_latency_ns) > > > > @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void) > > > > } > > > > > > > > if (cpuidle_not_available(drv, dev)) { > > > > - tick_nohz_idle_stop_tick(); > > > > - > > > > default_idle_call(); > > > > goto exit_idle; > > > > } > > > > @@ -276,6 +324,7 @@ static void do_idle(void) > > > > > > > > __current_set_polling(); > > > > tick_nohz_idle_enter(); > > > > + default_idle_enter(); > > > > > > > > while (!need_resched()) { > > > > > > > > > > > > > > How does this work? We don't stop the tick until the average idle tim= e is larger, > > > but if we don't stop the tick how is that possible? > > > > > > Why don't we just require one or two consecutive tick wakeups before = stopping? > >=20 > > Exactly my thought and I think one should be sufficient. >=20 > I concur. From our experience with TEO util threshold these averages can > backfire. I think one tick is sufficient delay to not be obviously broken. So if I'm not mistaken, it would be something like the appended prototype (completely untested, but it builds for me). =2D-- drivers/cpuidle/cpuidle.c | 10 ---------- kernel/sched/idle.c | 32 ++++++++++++++++++++++++-------- 2 files changed, 24 insertions(+), 18 deletions(-) =2D-- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -359,16 +359,6 @@ noinstr int cpuidle_enter_state(struct c int cpuidle_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick) { =2D /* =2D * If there is only a single idle state (or none), there is nothing =2D * meaningful for the governor to choose. Skip the governor and =2D * always use state 0 with the tick running. =2D */ =2D if (drv->state_count <=3D 1) { =2D *stop_tick =3D false; =2D return 0; =2D } =2D return cpuidle_curr_governor->select(drv, dev, stop_tick); } =20 =2D-- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -161,6 +161,14 @@ static int call_cpuidle(struct cpuidle_d return cpuidle_enter(drv, dev, next_state); } =20 +static void idle_call_stop_or_retain_tick(bool stop_tick) +{ + if (stop_tick || tick_nohz_tick_stopped()) + tick_nohz_idle_stop_tick(); + else + tick_nohz_idle_retain_tick(); +} + /** * cpuidle_idle_call - the main idle function * @@ -170,7 +178,7 @@ static int call_cpuidle(struct cpuidle_d * set, and it returns with polling set. If it ever stops polling, it * must clear the polling bit. */ =2Dstatic void cpuidle_idle_call(void) +static void cpuidle_idle_call(bool got_tick) { struct cpuidle_device *dev =3D cpuidle_get_device(); struct cpuidle_driver *drv =3D cpuidle_get_cpu_driver(dev); @@ -186,7 +194,7 @@ static void cpuidle_idle_call(void) } =20 if (cpuidle_not_available(drv, dev)) { =2D tick_nohz_idle_stop_tick(); + idle_call_stop_or_retain_tick(!got_tick); =20 default_idle_call(); goto exit_idle; @@ -221,7 +229,7 @@ static void cpuidle_idle_call(void) =20 next_state =3D cpuidle_find_deepest_state(drv, dev, max_latency_ns); call_cpuidle(drv, dev, next_state); =2D } else { + } else if (drv->state_count > 1) { bool stop_tick =3D true; =20 /* @@ -229,16 +237,22 @@ static void cpuidle_idle_call(void) */ next_state =3D cpuidle_select(drv, dev, &stop_tick); =20 =2D if (stop_tick || tick_nohz_tick_stopped()) =2D tick_nohz_idle_stop_tick(); =2D else =2D tick_nohz_idle_retain_tick(); + idle_call_stop_or_retain_tick(stop_tick); =20 entered_state =3D call_cpuidle(drv, dev, next_state); /* * Give the governor an opportunity to reflect on the outcome */ cpuidle_reflect(dev, entered_state); + } else { + /* + * If there is only a single idle state (or none), there is + * nothing meaningful for the governor to choose. Skip the + * governor and always use state 0. + */ + idle_call_stop_or_retain_tick(!got_tick); + + call_cpuidle(drv, dev, 0); } =20 exit_idle: @@ -259,6 +273,7 @@ exit_idle: static void do_idle(void) { int cpu =3D smp_processor_id(); + bool got_tick =3D false; =20 /* * Check if we need to update blocked load @@ -329,8 +344,9 @@ static void do_idle(void) tick_nohz_idle_restart_tick(); cpu_idle_poll(); } else { =2D cpuidle_idle_call(); + cpuidle_idle_call(got_tick); } + got_tick =3D tick_nohz_idle_got_tick(); arch_cpu_idle_exit(); } =20