From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3784C2DF13A for ; Sun, 1 Mar 2026 19:30:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772393455; cv=none; b=KIjzj45K5jvadTNTNTlAI9J1ItJee8QPYMuJ0ygwjY7xo0WIdnAORikSk66X+ym/NrAuJv1EmwWGHgzssqgFG/Cseu6pNqIJjwOoGBp6d/GJm/YMn7KEjzb6zyk0iChm92hreHb7TTJnvqo5VWUfaXxkWZRqnIDzPNu88jVt43U= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772393455; c=relaxed/simple; bh=pQ17XbQTTVUoHe02xiHOk3DtIMtO4+6VoXY8z7xABp0=; h=Date:Message-ID:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=Gbrxi3c7oDuj5xhoU/P4NwXUjJ/nrObPwMfZCBgeLqCmDsvLf6qgWbVP4FDOepwFem9J48IdqPjh1xGn/N54tasL/R9THffZQfQt/P39tjUIezCIG57vbfTBhmnCecrcnw6F1XagQJfi37SvKiYe8rTxYmvSTUV8zL5HEm4FhO8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=l6j9i8V7; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="l6j9i8V7" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 47FD1C116C6; Sun, 1 Mar 2026 19:30:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772393454; bh=pQ17XbQTTVUoHe02xiHOk3DtIMtO4+6VoXY8z7xABp0=; h=Date:From:To:Cc:Subject:References:From; b=l6j9i8V7ECyN9Ojm+JGBrd7/IWrAXNoCb19PExgKTfv5CAzjH98gWvgt8FUHoWv42 6Z4HJEDjEYZ0nHyXsJeWIXzJV8nnbJS4RzG7Y/+jmqZWeQo250wYRtqtHbOHZmPzjJ 3oXjZM0YgWRv/kJY3ZW6XbbkKCh4hH6b2ljnrDZTFnsa39/acG4iU/YpEXt5NtPTXh 5FOc05mM1sn/gY2pAvHwMNassB8lMG8esjwXeXmVw8HgLI30ECllNtewpMR7BxNLwl OFn+S1HvYSdCBsPn/bGFdIVnS3ZMmah0yIMZ6XoRpW/ZnEu/NTr1UckR01VHQ1/dis i+Ys7USvwjCSA== Date: Sun, 01 Mar 2026 20:30:51 +0100 Message-ID: <20260301192915.171574741@kernel.org> User-Agent: quilt/0.68 From: Thomas Gleixner To: LKML Cc: Peter Zijlstra , "Rafael J. Wysocki" , Frederic Weisbecker , Christian Loehle Subject: [patch 2/2] sched/idle: Make default_idle_call() NOHZ aware References: <20260301191959.406218221@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Guests fall back to default_idle_call() as there is no cpuidle driver available to them by default. That causes a problem in fully loaded scenarios where CPUs go briefly idle for a couple of microseconds: tick_nohz_idle_stop_tick() is invoked unconditionally which means unless there is timer pending in the next tick, the tick is stopped and a couple of microseconds later when the idle condition goes away restarted. That requires to program the clockevent device twice which implies a VM exit for each reprogramming. It was suggested to remove the tick_nohz_idle_stop_tick() invocation from the default idle code, but would be counterproductive. It would not allow the host to go into deeper idle states when the guest CPU is fully idle as it has to maintain the periodic tick. Cure this by implementing a trivial moving average filter which keeps track of the recent idle recidency time and only stop the tick when the average is larger than a tick. Signed-off-by: Thomas Gleixner --- kernel/sched/idle.c | 65 +++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 57 insertions(+), 8 deletions(-) --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -105,12 +105,7 @@ static inline void cond_tick_broadcast_e static inline void cond_tick_broadcast_exit(void) { } #endif /* !CONFIG_GENERIC_CLOCKEVENTS_BROADCAST_IDLE */ -/** - * default_idle_call - Default CPU idle routine. - * - * To use when the cpuidle framework cannot be used. - */ -static void __cpuidle default_idle_call(void) +static void __cpuidle __default_idle_call(void) { instrumentation_begin(); if (!current_clr_polling_and_test()) { @@ -130,6 +125,61 @@ static void __cpuidle default_idle_call( instrumentation_end(); } +#ifdef CONFIG_NO_HZ_COMMON + +/* Limit to 4 entries so it fits in a cache line */ +#define IDLE_DUR_ENTRIES 4 +#define IDLE_DUR_MASK (IDLE_DUR_ENTRIES - 1) + +struct idle_nohz_data { + u64 duration[IDLE_DUR_ENTRIES]; + u64 entry_time; + u64 sum; + unsigned int idx; +}; + +static DEFINE_PER_CPU_ALIGNED(struct idle_nohz_data, nohz_data); + +/** + * default_idle_call - Default CPU idle routine. + * + * To use when the cpuidle framework cannot be used. + */ +static void default_idle_call(void) +{ + struct idle_nohz_data *nd = this_cpu_ptr(&nohz_data); + unsigned int idx = nd->idx; + s64 delta; + + /* + * If the CPU spends more than a tick on average in idle, try to stop + * the tick. + */ + if (nd->sum > TICK_NSEC * IDLE_DUR_ENTRIES) + tick_nohz_idle_stop_tick(); + + __default_idle_call(); + + /* + * Build a moving average of the time spent in idle to prevent stopping + * the tick on a loaded system which only goes idle briefly. + */ + delta = max(sched_clock() - nd->entry_time, 0); + nd->sum += delta - nd->duration[idx]; + nd->duration[idx] = delta; + nd->idx = (idx + 1) & IDLE_DUR_MASK; +} + +static void default_idle_enter(void) +{ + this_cpu_write(nohz_data.entry_time, sched_clock()); +} + +#else /* CONFIG_NO_HZ_COMMON */ +static inline void default_idle_call(void { __default_idle_call(); } +static inline void default_idle_enter(void) { } +#endif /* !CONFIG_NO_HZ_COMMON */ + static int call_cpuidle_s2idle(struct cpuidle_driver *drv, struct cpuidle_device *dev, u64 max_latency_ns) @@ -186,8 +236,6 @@ static void cpuidle_idle_call(void) } if (cpuidle_not_available(drv, dev)) { - tick_nohz_idle_stop_tick(); - default_idle_call(); goto exit_idle; } @@ -276,6 +324,7 @@ static void do_idle(void) __current_set_polling(); tick_nohz_idle_enter(); + default_idle_enter(); while (!need_resched()) {