From: Frederic Weisbecker <frederic@kernel.org>
To: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
John Stultz <jstultz@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Eric Dumazet <edumazet@google.com>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Arjan van de Ven <arjan@infradead.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Rik van Riel <riel@surriel.com>,
Steven Rostedt <rostedt@goodmis.org>,
Sebastian Siewior <bigeasy@linutronix.de>,
Giovanni Gherdovich <ggherdovich@suse.cz>,
Lukasz Luba <lukasz.luba@arm.com>,
"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
Srinivas Pandruvada <srinivas.pandruvada@intel.com>,
K Prateek Nayak <kprateek.nayak@amd.com>
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model
Date: Tue, 30 Jan 2024 22:13:15 +0100 [thread overview]
Message-ID: <Zblma8NNZOftJ5fb@pavilion.home> (raw)
In-Reply-To: <87v87a9033.fsf@somnus>
Le Tue, Jan 30, 2024 at 06:56:32PM +0100, Anna-Maria Behnsen a écrit :
> Frederic Weisbecker <frederic@kernel.org> writes:
> > CPU 1 went idle, CPU 3 will take care of CPU 1's timer. Then come two
> > things happening at the same time: CPU 0 has a timer interrupt, due to
> > RCU callbacks handling for example, and CPU 3 goes offline:
> >
> > CPU 0 CPU 3
> > ----- -----
> > // On top level [GRP1:0], just set migrator = TMIGR_NONE
> > tmigr_inactive_up() {
> > cpu = cpumask_any_but(cpu_online_mask, cpu);
> > //cpu == 0
> > tmc_resched = per_cpu_ptr(&tmigr_cpu, CPU 0);
> > raw_spin_lock(&tmc_resched->lock);
> > tmc_resched->wakeup_recalc = true;
> > raw_spin_unlock(&tmc_resched->lock);
> > // timer interrupt
> > run_local_timers() {
> > tmigr_requires_handle_remote() {
> > data.firstexp = KTIME_MAX;
> > // CPU 0 sees the tmc_resched->wakeup_recalc
> > // latest update
> > if (tmc->wakeup_recalc) {
> > tmigr_requires_handle_remote_up() {
> > // CPU 0 doesn't see GRP0:0
> > // latest update from CPU 1,
> > // because it has no locking
> > // and does a racy check.
> > if (!tmigr_check_migrator(group, childmask))
> > return true;
> > }
> > raw_spin_lock(&tmc->lock);
> > WRITE_ONCE(tmc->wakeup, data.firstexp);
> > tmc->wakeup_recalc = false;
> > raw_spin_unlock(&tmc->lock)
> > return 0;
> > }
> > // IPI is sent only now
> > smp_send_reschedule(cpu);
> > }
> >
> >
> > There is nothing that prevents CPU 0 from not seeing the hierarchy updates from
> > other CPUs since it checks the migrators in a racy way. As a result the timer of
> > CPU 1 may be ignored by CPU 0.
> >
> > You'd need to lock the tmc while calling tmigr_requires_handle_remote_up(), so
> > that CPU 0 "inherits" everything that CPU 3 has seen, and that includes changes
> > from CPU 1.
> >
>
> puhh. ok. But for the !idle case the lockless walk of
> tmigr_requires_handle_remote_up() is ok?
Looks ok to me. It's racy but if the !idle migrator doesn't notice in the
current tick, it will in the next one.
> It's also possible, that the
> CPU misses an update of the state - another CPU goes idle and selects
> this CPU as the new migrator. And this CPU reads a stale value where the
> other CPU is migrator. But this will be revisited on the next
> tick. hmm...
Exactly, and I'm not worried. There has to be strong ordering with atomics
or locking in the idle case because the CPU goes to sleep and it must make
sure not to miss a timer. But in the !idle case the check is periodic, so you
don't need any of that. We can live with an unnoticed timer for a tick or two.
>
> >
> > But I see that tmigr_cpu_new_timer() does it right. Wouldn't it be possible to
> > exlusively let tmigr_cpu_new_timer() handle the wakeup_recalc thing? This is
> > going to be called after the end of the IRQ (whether timer interrupt or sched
> > IPI) in any case.
>
> Should work, yes. But when a timer has to be handled right away and it
> is checked after the end of the IRQ, then the tick might be reprogrammed
> so that CPU comes out of idle, or am I wrong?
If there is a pending timer, it can wait a tick. That's what happens if
we wait for tmigr_cpu_new_timer() to handle it.
But you know what, let's make it more simple. CPU down hotplug is not a
fast path and it doesn't deserve so many optimizations. Just remove ->wakeup_recalc
entirely and if the offlining CPU detects it's the last active CPU in the
hierarchy, just queue an empty work to the first online CPU. It will briefly
force that CPU out of idle and trigger an activate. Then either the CPU
periodically checks remote timers or it will go back idle and notice.
Something like this (untested):
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index de1905b0bae7..0f15215ef257 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -548,7 +548,6 @@ static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
tmc->cpuevt.ignore = true;
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
- tmc->wakeup_recalc = false;
walk_groups(&tmigr_active_up, &data, tmc);
}
@@ -1041,41 +1040,11 @@ int tmigr_requires_handle_remote(void)
}
/*
- * If the CPU is idle, check whether the recalculation of @tmc->wakeup
- * is required. @tmc->wakeup_recalc is set, when the last active CPU
- * went offline. The last active CPU delegated the handling of the timer
- * migration hierarchy to another (this) CPU by updating this flag and
- * sending a reschedule.
- *
- * Racy lockless check is valid:
- * - @tmc->wakeup_recalc is set by the remote CPU before it issues
- * reschedule IPI.
- * - As interrupts are disabled here this CPU will either observe
- * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
- * it will observe it when this function is called again on return
- * from handling the reschedule IPI.
- */
- if (tmc->wakeup_recalc) {
- __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
-
- if (data.firstexp != KTIME_MAX)
- ret = 1;
-
- raw_spin_lock(&tmc->lock);
- WRITE_ONCE(tmc->wakeup, data.firstexp);
- tmc->wakeup_recalc = false;
- raw_spin_unlock(&tmc->lock);
-
- return ret;
- }
-
- /*
- * When the CPU is idle and @tmc->wakeup is reliable as
- * @tmc->wakeup_recalc is not set, compare it with @data.now. The lock
- * is required on 32bit architectures to read the variable consistently
- * with a concurrent writer. On 64bit the lock is not required because
- * the read operation is not split and so it is always consistent.
-
+ * When the CPU is idle and @tmc->wakeup is reliable, compare it with
+ * @data.now. The lock is required on 32bit architectures to read the
+ * variable consistently with a concurrent writer. On 64bit the lock
+ * is not required because the read operation is not split and so it is
+ * always consistent.
*/
if (IS_ENABLED(CONFIG_64BIT)) {
if (data.now >= READ_ONCE(tmc->wakeup))
@@ -1119,21 +1088,7 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
tmc->cpuevt.ignore) {
ret = tmigr_new_timer(tmc, nextexp);
}
- } else if (tmc->wakeup_recalc) {
- struct tmigr_remote_data data;
-
- data.now = KTIME_MAX;
- data.childmask = tmc->childmask;
- data.firstexp = KTIME_MAX;
- data.tmc_active = false;
- data.check = false;
-
- __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
-
- ret = data.firstexp;
}
- tmc->wakeup_recalc = false;
-
/*
* Make sure the reevaluation of timers in idle path will not miss an
* event.
@@ -1212,36 +1167,7 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
* hierarchy) and
* - there is a pending event in the hierarchy
*/
- if (data->firstexp != KTIME_MAX) {
- WARN_ON_ONCE(group->parent);
- /*
- * Top level path: If this CPU is about going offline and was
- * the last active CPU, wake up some random other CPU so it will
- * take over the migrator duty and program its timer
- * properly. Ideally wake the CPU with the closest expiry time,
- * but that's overkill to figure out.
- *
- * Set wakeup_recalc of remote CPU, to make sure the complete
- * idle hierarchy with enqueued timers is reevaluated.
- */
- if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
- struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
- unsigned int cpu = smp_processor_id();
- struct tmigr_cpu *tmc_resched;
-
- cpu = cpumask_any_but(cpu_online_mask, cpu);
- tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
-
- raw_spin_unlock(&tmc->lock);
-
- raw_spin_lock(&tmc_resched->lock);
- tmc_resched->wakeup_recalc = true;
- raw_spin_unlock(&tmc_resched->lock);
-
- raw_spin_lock(&tmc->lock);
- smp_send_reschedule(cpu);
- }
- }
+ WARN_ON_ONCE(data->firstexp != KTIME_MAX && group->parent);
return walk_done;
}
@@ -1579,9 +1505,20 @@ static int tmigr_cpu_online(unsigned int cpu)
return 0;
}
+long tmigr_trigger_active(void *unused)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+ WARN_ON_ONCE(!tmc->online || tmc->idle);
+
+ return 0;
+}
+
static int tmigr_cpu_offline(unsigned int cpu)
{
struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ int migrator;
+ u64 firstexp;
raw_spin_lock_irq(&tmc->lock);
tmc->online = false;
@@ -1591,9 +1528,14 @@ static int tmigr_cpu_offline(unsigned int cpu)
* CPU has to handle the local events on his own, when on the way to
* offline; Therefore nextevt value is set to KTIME_MAX
*/
- __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+ firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
raw_spin_unlock_irq(&tmc->lock);
+ if (firstexp != KTIME_MAX) {
+ migrator = cpumask_any_but(cpu_online_mask, cpu);
+ work_on_cpu(migrator, tmigr_trigger_active, NULL);
+ }
+
return 0;
}
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index c32947cf429b..c556d5824792 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -78,18 +78,12 @@ struct tmigr_group {
* @idle: Indicates whether the CPU is idle in the timer migration
* hierarchy
* @remote: Is set when timers of the CPU are expired remotely
- * @wakeup_recalc: Indicates, whether a recalculation of the @wakeup value
- * is required. @wakeup_recalc is only used by this CPU
- * when it is marked idle in the timer migration
- * hierarchy. It is set by a remote CPU which was the last
- * active CPU and is on the way to idle.
* @tmgroup: Pointer to the parent group
* @childmask: childmask of tmigr_cpu in the parent group
* @wakeup: Stores the first timer when the timer migration
* hierarchy is completely idle and remote expiry was done;
* is returned to timer code in the idle path and is only
- * used in idle path; it is only valid, when @wakeup_recalc
- * is not set.
+ * used in idle path.
* @cpuevt: CPU event which could be enqueued into the parent group
*/
struct tmigr_cpu {
@@ -97,7 +91,6 @@ struct tmigr_cpu {
bool online;
bool idle;
bool remote;
- bool wakeup_recalc;
struct tmigr_group *tmgroup;
u8 childmask;
u64 wakeup;
next prev parent reply other threads:[~2024-01-30 21:13 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-15 14:37 [PATCH v10 00/20] timers: Move from a push remote at enqueue to a pull at expiry model Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 01/20] timers: Restructure get_next_timer_interrupt() Anna-Maria Behnsen
2024-01-17 15:01 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 02/20] timers: Split out get next timer interrupt Anna-Maria Behnsen
2024-01-17 15:06 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 03/20] timers: Move marking timer bases idle into tick_nohz_stop_tick() Anna-Maria Behnsen
2024-01-17 16:02 ` Frederic Weisbecker
2024-01-22 11:45 ` Anna-Maria Behnsen
2024-01-22 21:49 ` Frederic Weisbecker
2024-02-19 8:52 ` [PATCH v10a] " Anna-Maria Behnsen
2024-02-19 22:37 ` Frederic Weisbecker
2024-02-20 10:48 ` Anna-Maria Behnsen
2024-02-20 11:41 ` Frederic Weisbecker
2024-02-20 12:02 ` Anna-Maria Behnsen
2024-02-20 12:34 ` Frederic Weisbecker
2024-02-20 14:00 ` Anna-Maria Behnsen
2024-02-20 15:10 ` Frederic Weisbecker
2024-02-20 15:23 ` Anna-Maria Behnsen
2024-02-20 15:25 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 04/20] timers: Optimization for timer_base_try_to_set_idle() Anna-Maria Behnsen
2024-01-17 16:45 ` Frederic Weisbecker
2024-01-22 11:48 ` Anna-Maria Behnsen
2024-01-22 22:22 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 05/20] timers: Introduce add_timer() variants which modify timer flags Anna-Maria Behnsen
2024-01-17 17:01 ` Frederic Weisbecker
2024-01-22 11:50 ` Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 06/20] workqueue: Use global variant for add_timer() Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 07/20] timers: add_timer_on(): Make sure TIMER_PINNED flag is set Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 08/20] timers: Ease code in run_local_timers() Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 09/20] timers: Split next timer interrupt logic Anna-Maria Behnsen
2024-01-23 14:28 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 10/20] timers: Keep the pinned timers separate from the others Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 11/20] timers: Retrieve next expiry of pinned/non-pinned timers separately Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 12/20] timers: Split out "get next timer interrupt" functionality Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 13/20] timers: Add get next timer interrupt functionality for remote CPUs Anna-Maria Behnsen
2024-02-19 16:04 ` Frederic Weisbecker
2024-02-19 16:57 ` Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 14/20] timers: Restructure internal locking Anna-Maria Behnsen
2024-01-24 13:56 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 15/20] timers: Check if timers base is handled already Anna-Maria Behnsen
2024-01-24 14:22 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 16/20] tick/sched: Split out jiffies update helper function Anna-Maria Behnsen
2024-01-24 14:42 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 17/20] timers: Introduce function to check timer base is_idle flag Anna-Maria Behnsen
2024-01-24 14:52 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 18/20] timers: Implement the hierarchical pull model Anna-Maria Behnsen
2024-01-25 14:30 ` Frederic Weisbecker
2024-01-28 15:58 ` Anna-Maria Behnsen
2024-01-30 15:29 ` Frederic Weisbecker
2024-01-30 16:45 ` Anna-Maria Behnsen
2024-01-26 12:53 ` Frederic Weisbecker
2024-01-27 22:54 ` Frederic Weisbecker
2024-01-29 10:50 ` Anna-Maria Behnsen
2024-01-29 22:21 ` Frederic Weisbecker
2024-01-30 13:32 ` Anna-Maria Behnsen
2024-01-29 13:50 ` Paul E. McKenney
2024-01-29 1:04 ` Frederic Weisbecker
2024-01-30 17:56 ` Anna-Maria Behnsen
2024-01-30 21:13 ` Frederic Weisbecker [this message]
2024-01-31 11:19 ` Anna-Maria Behnsen
2024-01-30 15:37 ` Frederic Weisbecker
2024-02-01 14:59 ` Anna-Maria Behnsen
2024-02-01 15:05 ` Frederic Weisbecker
2024-02-01 16:15 ` Anna-Maria Behnsen
2024-02-01 17:43 ` Frederic Weisbecker
2024-02-01 20:52 ` Anna-Maria Behnsen
2024-02-05 13:29 ` Anna-Maria Behnsen
2024-02-05 20:30 ` Frederic Weisbecker
2024-02-06 10:06 ` Anna-Maria Behnsen
2024-02-06 10:29 ` Frederic Weisbecker
2024-02-01 16:33 ` Frederic Weisbecker
2024-02-05 15:59 ` Anna-Maria Behnsen
2024-02-05 20:28 ` Frederic Weisbecker
2024-02-04 22:02 ` Frederic Weisbecker
2024-02-06 11:03 ` Anna-Maria Behnsen
2024-02-06 11:11 ` Frederic Weisbecker
2024-02-04 22:32 ` Frederic Weisbecker
2024-02-06 11:36 ` Anna-Maria Behnsen
2024-02-06 13:21 ` Frederic Weisbecker
2024-02-06 14:13 ` Anna-Maria Behnsen
2024-02-06 14:21 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 19/20] timer_migration: Add tracepoints Anna-Maria Behnsen
2024-02-01 16:47 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 20/20] timers: Always queue timers on the local CPU Anna-Maria Behnsen
2024-02-01 17:36 ` Frederic Weisbecker
2024-02-01 20:58 ` Anna-Maria Behnsen
2024-02-02 11:57 ` Frederic Weisbecker
2024-01-30 22:07 ` [PATCH v10 00/20] timers: Move from a push remote at enqueue to a pull at expiry model Christian Loehle
2024-02-01 15:03 ` Anna-Maria Behnsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zblma8NNZOftJ5fb@pavilion.home \
--to=frederic@kernel.org \
--cc=anna-maria@linutronix.de \
--cc=arjan@infradead.org \
--cc=bigeasy@linutronix.de \
--cc=edumazet@google.com \
--cc=gautham.shenoy@amd.com \
--cc=ggherdovich@suse.cz \
--cc=jstultz@google.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lukasz.luba@arm.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=srinivas.pandruvada@intel.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.