public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements
@ 2024-07-01 10:18 Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 1/8] timers/migration: Do not rely always on group->parent Anna-Maria Behnsen
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel
  Cc: Borislav Petkov, Anna-Maria Behnsen

Borislav reported a warning in timer migration deactive path

  https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local

Sadly it doesn't reproduce directly. But with the change of timing (by
adding a trace prinkt before the warning), it is possible to trigger the
warning reliable at least in my test setup. The problem here is a racy
check agains group->parent pointer. This is also used in other places in
the code and fixing this racy usage is adressed by the first patch.

There were two other races reported by Frederic in setup path:

  https://lore.kernel.org/r/ZnWOswTMML6ShzYO@localhost.localdomain

  https://lore.kernel.org/r/ZnoIlO22habOyQRe@lothringen

Those races are both is addressed by the change of patch 2.

Some updates/cleanups are provided by patch 3-8. ("timers/migration:
Improve tracing" and "timers/migration: Spare write when nothing changed"
are the same as provided by v2).

Patches are available here:

  https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc

---
Changes in v3:
- Address the new reported possible race (childmask and parent pointer)
  together with the existing race (both reported by Frederic).
- New cleanup: Two patches to access childmask and parent pointer only in
  one place
- New cleanup: Rename childmask to parentmask as during discussions there
  was some kind of confusion because of the naming
- New cleanup: Fix typo
- Fix prefix in all patches (s$timer_migration$timers/migration$)
- Link to v2: https://lore.kernel.org/r/20240624-tmigr-fixes-v2-0-3eb4c0604790@linutronix.de

Changes in v2:
- Address another possible race in setup code (reported by Frederic) and
  recycle therefore one improvement patch
- Change order and move the already existing improvement patch to the end
  of the queue
- Existing patches didn't change
- Link to v1: https://lore.kernel.org/r/20240621-tmigr-fixes-v1-0-8c8a2d8e8d77@linutronix.de

Thanks,

        Anna-Maria

---
Anna-Maria Behnsen (8):
      timers/migration: Do not rely always on group->parent
      timers/migration: Move hierarchy setup into cpuhotplug prepare callback
      timers/migration: Improve tracing
      timers/migration: Use a single struct for hierarchy walk data
      timers/migration: Read childmask and parent pointer in a single place
      timers/migration: Rename childmask by parentmask to make naming more obvious
      timers/migration: Spare write when nothing changed
      timers/migration: Fix grammar in comment

 include/linux/cpuhotplug.h             |   1 +
 include/trace/events/timer_migration.h |   4 +-
 kernel/time/timer_migration.c          | 366 ++++++++++++++++-----------------
 kernel/time/timer_migration.h          |  27 ++-
 4 files changed, 197 insertions(+), 201 deletions(-)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v3 1/8] timers/migration: Do not rely always on group->parent
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback Anna-Maria Behnsen
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel
  Cc: Borislav Petkov, Anna-Maria Behnsen

When reading group->parent without holding the group lock it is racy
against CPUs coming online the first time and thereby creating another
level of the hierarchy. This is not a problem when this value is read once
to decide whether to abort a propagation or not. The worst outcome is an
unnecessary/early CPU wake up. But it is racy when reading it several times
during a single 'action' (like activation, deactivation, checking for
remote timer expiry,...) and relying on the consitency of this value
without holding the lock. This happens at the moment e.g. in
tmigr_inactive_up() which is also calling tmigr_udpate_events(). Code relys
on group->parent not to change during this 'action'.

Update parent struct member description to explain the above only
once. Remove parent pointer checks when they are not mandatory (like update
of data->childmask). Remove a warning, which would be nice but the trigger
of this warning is not reliable and add expand the data structure member
description instead. Expand a comment, why it is safe to rely on parent
pointer here (inside hierarchy update).

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/timer_migration.c | 33 +++++++++++++++------------------
 kernel/time/timer_migration.h | 12 +++++++++++-
 2 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 84413114db5c..d91efe1dc3bf 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -507,7 +507,14 @@ static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
  *			(get_next_timer_interrupt())
  * @firstexp:		Contains the first event expiry information when last
  *			active CPU of hierarchy is on the way to idle to make
- *			sure CPU will be back in time.
+ *			sure CPU will be back in time. It is updated in top
+ *			level group only. Be aware, there could occur a new top
+ *			level of the hierarchy between the 'top level call' in
+ *			tmigr_update_events() and the check for the parent group
+ *			in walk_groups(). Then @firstexp might contain a value
+ *			!= KTIME_MAX even if it was not the final top
+ *			level. This is not a problem, as the worst outcome is a
+ *			CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
  * @childmask:		childmask of child group
@@ -649,7 +656,7 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
-	if ((walk_done == false) && group->parent)
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
 	/*
@@ -1317,20 +1324,9 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	/* Event Handling */
 	tmigr_update_events(group, child, data);
 
-	if (group->parent && (walk_done == false))
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	/*
-	 * data->firstexp was set by tmigr_update_events() and contains the
-	 * expiry of the first global event which needs to be handled. It
-	 * differs from KTIME_MAX if:
-	 * - group is the top level group and
-	 * - group is idle (which means CPU was the last active CPU in the
-	 *   hierarchy) and
-	 * - there is a pending event in the hierarchy
-	 */
-	WARN_ON_ONCE(data->firstexp != KTIME_MAX && group->parent);
-
 	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 
 	return walk_done;
@@ -1552,10 +1548,11 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 		data.childmask = child->childmask;
 
 		/*
-		 * There is only one new level per time. When connecting the
-		 * child and the parent and set the child active when the parent
-		 * is inactive, the parent needs to be the uppermost
-		 * level. Otherwise there went something wrong!
+		 * There is only one new level per time (which is protected by
+		 * tmigr_mutex). When connecting the child and the parent and
+		 * set the child active when the parent is inactive, the parent
+		 * needs to be the uppermost level. Otherwise there went
+		 * something wrong!
 		 */
 		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 	}
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 6c37d94a37d9..494f68cc13f4 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -22,7 +22,17 @@ struct tmigr_event {
  * struct tmigr_group - timer migration hierarchy group
  * @lock:		Lock protecting the event information and group hierarchy
  *			information during setup
- * @parent:		Pointer to the parent group
+ * @parent:		Pointer to the parent group. Pointer is updated when a
+ *			new hierarchy level is added because of a CPU coming
+ *			online the first time. Once it is set, the pointer will
+ *			not be removed or updated. When accessing parent pointer
+ *			lock less to decide whether to abort a propagation or
+ *			not, it is not a problem. The worst outcome is an
+ *			unnecessary/early CPU wake up. But do not access parent
+ *			pointer several times in the same 'action' (like
+ *			activation, deactivation, check for remote expiry,...)
+ *			without holding the lock as it is not ensured that value
+ *			will not change.
  * @groupevt:		Next event of the group which is only used when the
  *			group is !active. The group event is then queued into
  *			the parent timer queue.

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 1/8] timers/migration: Do not rely always on group->parent Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-01 21:49   ` Frederic Weisbecker
  2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 3/8] timers/migration: Improve tracing Anna-Maria Behnsen
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

When a CPU comes online the first time, it is possible that a new top level
group will be created. In general all propagation is done from the bottom
to top. This minimizes complexity and prevents possible races. But when a
new top level group is created, the formely top level group needs to be
connected to the new level. This is the only time, when the direction to
propagate changes is changed: the changes are propagated from top (new top
level group) to bottom (formerly top level group).

This introduces two races (see (A) and (B)) as reported by Frederic:


(A) This race happens, when marking the formely top level group as active,
but the last active CPU of the formerly top level group goes idle. Then
it's likely that formerly group is no longer active, but marked
nevertheless as active in new top level group:


		  [GRP0:0]
	       migrator = 0
	       active   = 0
	       nextevt  = KTIME_MAX
	       /         \
	      0         1 .. 7
	  active         idle

0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

1) CPU 8 is booting and creates a new group in first level GRP0:1 and
   therefore also a new top group GRP1:0. For now the setup code proceeded
   only until the connected between GRP0:1 to the new top group. The
   connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
   still !online.


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
	      active   = NONE              active   = NONE
	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
   tmigr_update_events() and it propagates the change to the top (no change
   there and no wakeup programmed since there is no timer).


			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
	      active   = NONE             active   = NONE
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
   active in GRP1:0


			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0, GRP0:1
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = 8
	      active   = NONE             active   = 8
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                active

5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
   of tmigr_cpu_online().


			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = T8
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
	      active   = NONE               active   = NONE
	      nextevt  = KTIME_MAX          nextevt  = T8
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.

--> The update which is done in third step is not noticed by setup code. So
    a wrong migrator is set to top level group and a timer could get
    ignored.


(B) Reading group->parent and group->childmask when an hierarchy update is
ongoing and reaches the formerly top level group is racy as those values
could be inconsistent. (The notation of migrator and active now slightly
changes in contrast to the above example, as now the childmasks are used.)



			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = NULL		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
   but did not yet connect GRP0:0 to GRP1:0.


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
   parent pointer of GRP0:0 and ...


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
   TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
   structure tmigr_walk (as update of childmask is not yet
   visible/updated). And now ...


			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
   code).


			     [GRP1:0]
			migrator = 0x00
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
   GRP1:0 but with childmask = 0 as stored in propagation data structure.

--> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
    it looks like GRP1:0 is always active.


To prevent those races, the setup of the hierarchy is moved into the
cpuhotplug prepare callback. The prepare callback is not executed by the
CPU which will come online, it is executed by the CPU which prepares
onlining of the other CPU. This CPU will not go idle and so it is ensured,
that the formerly top level group cannot go idle and change top level group
state and the propagation could be done without a risk. The direction for
the updates is now forced to look like "from bottom to top".

Split setup functionality of online callback into the cpuhotplug prepare
callback and setup hotplug state. Reorder the code, that all prepare
related functions are close to each other and online and offline callbacks
are also close together.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 include/linux/cpuhotplug.h    |   1 +
 kernel/time/timer_migration.c | 179 +++++++++++++++++++++++-------------------
 2 files changed, 99 insertions(+), 81 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 7a5785f405b6..df59666a2a66 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -122,6 +122,7 @@ enum cpuhp_state {
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
 	CPUHP_ZCOMP_PREPARE,
 	CPUHP_TIMERS_PREPARE,
+	CPUHP_TMIGR_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index d91efe1dc3bf..5c030b30ed0a 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1438,6 +1438,66 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+/*
+ * tmigr_trigger_active() - trigger a CPU to become active again
+ *
+ * This function is executed on a CPU which is part of cpu_online_mask, when the
+ * last active CPU in the hierarchy is offlining. With this, it is ensured that
+ * the other CPU is active and takes over the migrator duty.
+ */
+static long tmigr_trigger_active(void *unused)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	WARN_ON_ONCE(!tmc->online || tmc->idle);
+
+	return 0;
+}
+
+static int tmigr_cpu_offline(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	int migrator;
+	u64 firstexp;
+
+	raw_spin_lock_irq(&tmc->lock);
+	tmc->online = false;
+	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+	/*
+	 * CPU has to handle the local events on his own, when on the way to
+	 * offline; Therefore nextevt value is set to KTIME_MAX
+	 */
+	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+	trace_tmigr_cpu_offline(tmc);
+	raw_spin_unlock_irq(&tmc->lock);
+
+	if (firstexp != KTIME_MAX) {
+		migrator = cpumask_any_but(cpu_online_mask, cpu);
+		work_on_cpu(migrator, tmigr_trigger_active, NULL);
+	}
+
+	return 0;
+}
+
+static int tmigr_cpu_online(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	/* Check whether CPU data was successfully initialized */
+	if (WARN_ON_ONCE(!tmc->tmgroup))
+		return -EINVAL;
+
+	raw_spin_lock_irq(&tmc->lock);
+	trace_tmigr_cpu_online(tmc);
+	tmc->idle = timer_base_is_idle();
+	if (!tmc->idle)
+		__tmigr_cpu_activate(tmc);
+	tmc->online = true;
+	raw_spin_unlock_irq(&tmc->lock);
+	return 0;
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1512,7 +1572,7 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
+	struct tmigr_walk data;
 
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
@@ -1540,22 +1600,22 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   child to the new parent. So tmigr_connect_child_parent() is
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
+	 *
+	 * * It is ensured, that the child is active, as this setup path is
+	 *   executed in hotplug prepare callback. This is exectued by an
+	 *   already connected and !idle CPU. Even if all other CPUs go idle,
+	 *   the CPU executing the setup will be responsible up to current top
+	 *   level group. Therefore propagate active state unconditionally.
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
+	data.childmask = child->childmask;
 
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	/*
+	 * There is only one new level per time (which is protected by
+	 * tmigr_mutex). When connecting the child and the parent and set the
+	 * child active when the parent is inactive, the parent needs to be the
+	 * uppermost level. Otherwise there went something wrong!
+	 */
+	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
@@ -1661,80 +1721,32 @@ static int tmigr_add_cpu(unsigned int cpu)
 	return ret;
 }
 
-static int tmigr_cpu_online(unsigned int cpu)
+static int tmigr_cpu_prepare(unsigned int cpu)
 {
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int ret;
-
-	/* First online attempt? Initialize CPU data */
-	if (!tmc->tmgroup) {
-		raw_spin_lock_init(&tmc->lock);
-
-		ret = tmigr_add_cpu(cpu);
-		if (ret < 0)
-			return ret;
-
-		if (tmc->childmask == 0)
-			return -EINVAL;
+	struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
+	int ret = 0;
 
-		timerqueue_init(&tmc->cpuevt.nextevt);
-		tmc->cpuevt.nextevt.expires = KTIME_MAX;
-		tmc->cpuevt.ignore = true;
-		tmc->cpuevt.cpu = cpu;
-
-		tmc->remote = false;
-		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
-	}
-	raw_spin_lock_irq(&tmc->lock);
-	trace_tmigr_cpu_online(tmc);
-	tmc->idle = timer_base_is_idle();
-	if (!tmc->idle)
-		__tmigr_cpu_activate(tmc);
-	tmc->online = true;
-	raw_spin_unlock_irq(&tmc->lock);
-	return 0;
-}
-
-/*
- * tmigr_trigger_active() - trigger a CPU to become active again
- *
- * This function is executed on a CPU which is part of cpu_online_mask, when the
- * last active CPU in the hierarchy is offlining. With this, it is ensured that
- * the other CPU is active and takes over the migrator duty.
- */
-static long tmigr_trigger_active(void *unused)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	/* Not first online attempt? */
+	if (tmc->tmgroup)
+		return ret;
 
-	WARN_ON_ONCE(!tmc->online || tmc->idle);
+	raw_spin_lock_init(&tmc->lock);
 
-	return 0;
-}
+	ret = tmigr_add_cpu(cpu);
+	if (ret < 0)
+		return ret;
 
-static int tmigr_cpu_offline(unsigned int cpu)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int migrator;
-	u64 firstexp;
+	if (tmc->childmask == 0)
+		return -EINVAL;
 
-	raw_spin_lock_irq(&tmc->lock);
-	tmc->online = false;
+	timerqueue_init(&tmc->cpuevt.nextevt);
+	tmc->cpuevt.nextevt.expires = KTIME_MAX;
+	tmc->cpuevt.ignore = true;
+	tmc->cpuevt.cpu = cpu;
+	tmc->remote = false;
 	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
 
-	/*
-	 * CPU has to handle the local events on his own, when on the way to
-	 * offline; Therefore nextevt value is set to KTIME_MAX
-	 */
-	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
-	trace_tmigr_cpu_offline(tmc);
-	raw_spin_unlock_irq(&tmc->lock);
-
-	if (firstexp != KTIME_MAX) {
-		migrator = cpumask_any_but(cpu_online_mask, cpu);
-		work_on_cpu(migrator, tmigr_trigger_active, NULL);
-	}
-
-	return 0;
+	return ret;
 }
 
 static int __init tmigr_init(void)
@@ -1793,6 +1805,11 @@ static int __init tmigr_init(void)
 		tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
 		tmigr_crossnode_level);
 
+	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:prepare",
+				tmigr_cpu_prepare, NULL);
+	if (ret)
+		goto err;
+
 	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
 				tmigr_cpu_online, tmigr_cpu_offline);
 	if (ret)

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 3/8] timers/migration: Improve tracing
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 1/8] timers/migration: Do not rely always on group->parent Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data Anna-Maria Behnsen
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

Trace points of inactive and active propagation are located at the end of
the related functions. The interesting information of those trace points is
the updated group state. When trace points are not located directly at the
place where group state changed, order of trace points in traces could be
confusing.

Move inactive and active propagation trace points directly after update of
group state values.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 5c030b30ed0a..0ae7f2084d27 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -656,6 +656,8 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
+	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
@@ -673,8 +675,6 @@ static bool tmigr_active_up(struct tmigr_group *group,
 	 */
 	group->groupevt.ignore = true;
 
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
-
 	return walk_done;
 }
 
@@ -1306,9 +1306,10 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 
 		WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
 
-		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state,
-				       newstate.state))
+		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
+			trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 			break;
+		}
 
 		/*
 		 * The memory barrier is paired with the cmpxchg() in
@@ -1327,8 +1328,6 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
-
 	return walk_done;
 }
 

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (2 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 3/8] timers/migration: Improve tracing Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-02 11:43   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

Two different structs are defined for propagating data from one to another
level when walking the hierarchy. Several struct members exist in both
structs which makes generalization harder.

Merge those two structs into a single one and use it directly in
walk_groups() and the corresponding function pointers instead of
introducing pointer casting all over the place.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 126 ++++++++++++++++++------------------------
 1 file changed, 55 insertions(+), 71 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 0ae7f2084d27..b4391abfb4a9 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -475,69 +475,31 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
 	return bitmap_weight(&active, BIT_CNT) <= 1;
 }
 
-typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
-
-static void __walk_groups(up_f up, void *data,
-			  struct tmigr_cpu *tmc)
-{
-	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
-
-	do {
-		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
-
-		if (up(group, child, data))
-			break;
-
-		child = group;
-		group = group->parent;
-	} while (group);
-}
-
-static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
-{
-	lockdep_assert_held(&tmc->lock);
-
-	__walk_groups(up, data, tmc);
-}
-
 /**
  * struct tmigr_walk - data required for walking the hierarchy
  * @nextexp:		Next CPU event expiry information which is handed into
  *			the timer migration code by the timer code
  *			(get_next_timer_interrupt())
- * @firstexp:		Contains the first event expiry information when last
- *			active CPU of hierarchy is on the way to idle to make
- *			sure CPU will be back in time. It is updated in top
- *			level group only. Be aware, there could occur a new top
- *			level of the hierarchy between the 'top level call' in
- *			tmigr_update_events() and the check for the parent group
- *			in walk_groups(). Then @firstexp might contain a value
- *			!= KTIME_MAX even if it was not the final top
- *			level. This is not a problem, as the worst outcome is a
- *			CPU which might wake up a little early.
+ * @firstexp:		Contains the first event expiry information when
+ *			hierarchy is completely idle.  When CPU itself was the
+ *			last going idle, information makes sure, that CPU will
+ *			be back in time. When using this value in the remote
+ *			expiry case, firstexp is stored in the per CPU tmigr_cpu
+ *			struct of CPU which expires remote timers. It is updated
+ *			in top level group only. Be aware, there could occur a
+ *			new top level of the hierarchy between the 'top level
+ *			call' in tmigr_update_events() and the check for the
+ *			parent group in walk_groups(). Then @firstexp might
+ *			contain a value != KTIME_MAX even if it was not the
+ *			final top level. This is not a problem, as the worst
+ *			outcome is a CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
  * @childmask:		childmask of child group
  * @remote:		Is set, when the new timer path is executed in
  *			tmigr_handle_remote_cpu()
- */
-struct tmigr_walk {
-	u64			nextexp;
-	u64			firstexp;
-	struct tmigr_event	*evt;
-	u8			childmask;
-	bool			remote;
-};
-
-/**
- * struct tmigr_remote_data - data required for remote expiry hierarchy walk
  * @basej:		timer base in jiffies
  * @now:		timer base monotonic
- * @firstexp:		returns expiry of the first timer in the idle timer
- *			migration hierarchy to make sure the timer is handled in
- *			time; it is stored in the per CPU tmigr_cpu struct of
- *			CPU which expires remote timers
- * @childmask:		childmask of child group
  * @check:		is set if there is the need to handle remote timers;
  *			required in tmigr_requires_handle_remote() only
  * @tmc_active:		this flag indicates, whether the CPU which triggers
@@ -546,15 +508,43 @@ struct tmigr_walk {
  *			idle, only the first event of the top level has to be
  *			considered.
  */
-struct tmigr_remote_data {
-	unsigned long	basej;
-	u64		now;
-	u64		firstexp;
-	u8		childmask;
-	bool		check;
-	bool		tmc_active;
+struct tmigr_walk {
+	u64			nextexp;
+	u64			firstexp;
+	struct tmigr_event	*evt;
+	u8			childmask;
+	bool			remote;
+	unsigned long		basej;
+	u64			now;
+	bool			check;
+	bool			tmc_active;
 };
 
+typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *);
+
+static void __walk_groups(up_f up, struct tmigr_walk *data,
+			  struct tmigr_cpu *tmc)
+{
+	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
+
+	do {
+		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
+
+		if (up(group, child, data))
+			break;
+
+		child = group;
+		group = group->parent;
+	} while (group);
+}
+
+static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc)
+{
+	lockdep_assert_held(&tmc->lock);
+
+	__walk_groups(up, data, tmc);
+}
+
 /*
  * Returns the next event of the timerqueue @group->events
  *
@@ -625,10 +615,9 @@ static u64 tmigr_next_groupevt_expires(struct tmigr_group *group)
 
 static bool tmigr_active_up(struct tmigr_group *group,
 			    struct tmigr_group *child,
-			    void *ptr)
+			    struct tmigr_walk *data)
 {
 	union tmigr_state curstate, newstate;
-	struct tmigr_walk *data = ptr;
 	bool walk_done;
 	u8 childmask;
 
@@ -867,10 +856,8 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 
 static bool tmigr_new_timer_up(struct tmigr_group *group,
 			       struct tmigr_group *child,
-			       void *ptr)
+			       struct tmigr_walk *data)
 {
-	struct tmigr_walk *data = ptr;
-
 	return tmigr_update_events(group, child, data);
 }
 
@@ -1002,9 +989,8 @@ static void tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
 
 static bool tmigr_handle_remote_up(struct tmigr_group *group,
 				   struct tmigr_group *child,
-				   void *ptr)
+				   struct tmigr_walk *data)
 {
-	struct tmigr_remote_data *data = ptr;
 	struct tmigr_event *evt;
 	unsigned long jif;
 	u8 childmask;
@@ -1062,7 +1048,7 @@ static bool tmigr_handle_remote_up(struct tmigr_group *group,
 void tmigr_handle_remote(void)
 {
 	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	struct tmigr_remote_data data;
+	struct tmigr_walk data;
 
 	if (tmigr_is_not_available(tmc))
 		return;
@@ -1104,9 +1090,8 @@ void tmigr_handle_remote(void)
 
 static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 					    struct tmigr_group *child,
-					    void *ptr)
+					    struct tmigr_walk *data)
 {
-	struct tmigr_remote_data *data = ptr;
 	u8 childmask;
 
 	childmask = data->childmask;
@@ -1164,7 +1149,7 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 bool tmigr_requires_handle_remote(void)
 {
 	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	struct tmigr_remote_data data;
+	struct tmigr_walk data;
 	unsigned long jif;
 	bool ret = false;
 
@@ -1252,10 +1237,9 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
 
 static bool tmigr_inactive_up(struct tmigr_group *group,
 			      struct tmigr_group *child,
-			      void *ptr)
+			      struct tmigr_walk *data)
 {
 	union tmigr_state curstate, newstate, childstate;
-	struct tmigr_walk *data = ptr;
 	bool walk_done;
 	u8 childmask;
 

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (3 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-02 12:04   ` Frederic Weisbecker
                     ` (2 more replies)
  2024-07-01 10:18 ` [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious Anna-Maria Behnsen
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

Reading the childmask and parent pointer is required when propagating
changes through the hierarchy. At the moment this reads are spread all over
the place which makes it harder to follow.

Move those reads to a single place to keep code clean.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index b4391abfb4a9..a681cf89910e 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -535,6 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 
 		child = group;
 		group = group->parent;
+		data->childmask = group->childmask;
 	} while (group);
 }
 
@@ -647,9 +648,6 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
 
-	if (walk_done == false)
-		data->childmask = group->childmask;
-
 	/*
 	 * The group is active (again). The group event might be still queued
 	 * into the parent group's timerqueue but can now be handled by the
@@ -1027,12 +1025,10 @@ static bool tmigr_handle_remote_up(struct tmigr_group *group,
 	}
 
 	/*
-	 * Update of childmask for the next level and keep track of the expiry
-	 * of the first event that needs to be handled (group->next_expiry was
-	 * updated by tmigr_next_expired_groupevt(), next was set by
-	 * tmigr_handle_remote_cpu()).
+	 * Keep track of the expiry of the first event that needs to be handled
+	 * (group->next_expiry was updated by tmigr_next_expired_groupevt(),
+	 * next was set by tmigr_handle_remote_cpu()).
 	 */
-	data->childmask = group->childmask;
 	data->firstexp = group->next_expiry;
 
 	raw_spin_unlock_irq(&group->lock);
@@ -1110,7 +1106,7 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 	 * group before reading the next_expiry value.
 	 */
 	if (group->parent && !data->tmc_active)
-		goto out;
+		return false;
 
 	/*
 	 * The lock is required on 32bit architectures to read the variable
@@ -1135,9 +1131,6 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 		raw_spin_unlock(&group->lock);
 	}
 
-out:
-	/* Update of childmask for the next level */
-	data->childmask = group->childmask;
 	return false;
 }
 
@@ -1309,9 +1302,6 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	/* Event Handling */
 	tmigr_update_events(group, child, data);
 
-	if (walk_done == false)
-		data->childmask = group->childmask;
-
 	return walk_done;
 }
 

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (4 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-02 12:45   ` Frederic Weisbecker
  2024-07-03 20:31   ` [PATCH v4 6/8] timers/migration: Rename childmask by groupmask " Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 7/8] timers/migration: Spare write when nothing changed Anna-Maria Behnsen
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.

Clarify this by renaming childmask in the tmigr_group and tmc_group by
parentmask.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 include/trace/events/timer_migration.h |  4 ++--
 kernel/time/timer_migration.c          | 24 ++++++++++++------------
 kernel/time/timer_migration.h          | 15 +++++++--------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/timer_migration.h b/include/trace/events/timer_migration.h
index 79f19e76a80b..e3c5afd1638d 100644
--- a/include/trace/events/timer_migration.h
+++ b/include/trace/events/timer_migration.h
@@ -52,7 +52,7 @@ TRACE_EVENT(tmigr_connect_child_parent,
 		__entry->lvl		= child->parent->level;
 		__entry->numa_node	= child->parent->numa_node;
 		__entry->num_children	= child->parent->num_children;
-		__entry->childmask	= child->childmask;
+		__entry->childmask	= child->parentmask;
 	),
 
 	TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
@@ -81,7 +81,7 @@ TRACE_EVENT(tmigr_connect_cpu_parent,
 		__entry->lvl		= tmc->tmgroup->level;
 		__entry->numa_node	= tmc->tmgroup->numa_node;
 		__entry->num_children	= tmc->tmgroup->num_children;
-		__entry->childmask	= tmc->childmask;
+		__entry->childmask	= tmc->parentmask;
 	),
 
 	TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index a681cf89910e..b73d89e78163 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -495,7 +495,7 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
  *			outcome is a CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
- * @childmask:		childmask of child group
+ * @childmask:		parentmask of child group
  * @remote:		Is set, when the new timer path is executed in
  *			tmigr_handle_remote_cpu()
  * @basej:		timer base in jiffies
@@ -535,7 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 
 		child = group;
 		group = group->parent;
-		data->childmask = group->childmask;
+		data->childmask = group->parentmask;
 	} while (group);
 }
 
@@ -669,7 +669,7 @@ static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
 {
 	struct tmigr_walk data;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->parentmask;
 
 	trace_tmigr_cpu_active(tmc);
 
@@ -1049,7 +1049,7 @@ void tmigr_handle_remote(void)
 	if (tmigr_is_not_available(tmc))
 		return;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->parentmask;
 	data.firstexp = KTIME_MAX;
 
 	/*
@@ -1057,7 +1057,7 @@ void tmigr_handle_remote(void)
 	 * in tmigr_handle_remote_up() anyway. Keep this check to speed up the
 	 * return when nothing has to be done.
 	 */
-	if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) {
+	if (!tmigr_check_migrator(tmc->tmgroup, tmc->parentmask)) {
 		/*
 		 * If this CPU was an idle migrator, make sure to clear its wakeup
 		 * value so it won't chase timers that have already expired elsewhere.
@@ -1150,7 +1150,7 @@ bool tmigr_requires_handle_remote(void)
 		return ret;
 
 	data.now = get_jiffies_update(&jif);
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->parentmask;
 	data.firstexp = KTIME_MAX;
 	data.tmc_active = !tmc->idle;
 	data.check = false;
@@ -1310,7 +1310,7 @@ static u64 __tmigr_cpu_deactivate(struct tmigr_cpu *tmc, u64 nextexp)
 	struct tmigr_walk data = { .nextexp = nextexp,
 				   .firstexp = KTIME_MAX,
 				   .evt = &tmc->cpuevt,
-				   .childmask = tmc->childmask };
+				   .childmask = tmc->parentmask };
 
 	/*
 	 * If nextexp is KTIME_MAX, the CPU event will be ignored because the
@@ -1388,7 +1388,7 @@ u64 tmigr_quick_check(u64 nextevt)
 	if (WARN_ON_ONCE(tmc->idle))
 		return nextevt;
 
-	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->childmask))
+	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->parentmask))
 		return KTIME_MAX;
 
 	do {
@@ -1551,7 +1551,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
 	child->parent = parent;
-	child->childmask = BIT(parent->num_children++);
+	child->parentmask = BIT(parent->num_children++);
 
 	raw_spin_unlock(&parent->lock);
 	raw_spin_unlock_irq(&child->lock);
@@ -1580,7 +1580,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   the CPU executing the setup will be responsible up to current top
 	 *   level group. Therefore propagate active state unconditionally.
 	 */
-	data.childmask = child->childmask;
+	data.childmask = child->parentmask;
 
 	/*
 	 * There is only one new level per time (which is protected by
@@ -1646,7 +1646,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 			raw_spin_lock_irq(&group->lock);
 
 			tmc->tmgroup = group;
-			tmc->childmask = BIT(group->num_children++);
+			tmc->parentmask = BIT(group->num_children++);
 
 			raw_spin_unlock_irq(&group->lock);
 
@@ -1709,7 +1709,7 @@ static int tmigr_cpu_prepare(unsigned int cpu)
 	if (ret < 0)
 		return ret;
 
-	if (tmc->childmask == 0)
+	if (tmc->parentmask == 0)
 		return -EINVAL;
 
 	timerqueue_init(&tmc->cpuevt.nextevt);
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 494f68cc13f4..3d08e611c3d6 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -51,9 +51,8 @@ struct tmigr_event {
  * @num_children:	Counter of group children to make sure the group is only
  *			filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
  *			only
- * @childmask:		childmask of the group in the parent group; is set
- *			during setup and will never change; can be read
- *			lockless
+ * @parentmask:		mask of the group in the parent group; is set during
+ *			setup and will never change; can be read lockless
  * @list:		List head that is added to the per level
  *			tmigr_level_list; is required during setup when a
  *			new group needs to be connected to the existing
@@ -69,7 +68,7 @@ struct tmigr_group {
 	unsigned int		level;
 	int			numa_node;
 	unsigned int		num_children;
-	u8			childmask;
+	u8			parentmask;
 	struct list_head	list;
 };
 
@@ -89,7 +88,7 @@ struct tmigr_group {
  *			hierarchy
  * @remote:		Is set when timers of the CPU are expired remotely
  * @tmgroup:		Pointer to the parent group
- * @childmask:		childmask of tmigr_cpu in the parent group
+ * @parentmask:		mask of tmigr_cpu in the parent group
  * @wakeup:		Stores the first timer when the timer migration
  *			hierarchy is completely idle and remote expiry was done;
  *			is returned to timer code in the idle path and is only
@@ -102,7 +101,7 @@ struct tmigr_cpu {
 	bool			idle;
 	bool			remote;
 	struct tmigr_group	*tmgroup;
-	u8			childmask;
+	u8			parentmask;
 	u64			wakeup;
 	struct tmigr_event	cpuevt;
 };
@@ -118,8 +117,8 @@ union tmigr_state {
 	u32 state;
 	/**
 	 * struct - split state of tmigr_group
-	 * @active:	Contains each childmask bit of the active children
-	 * @migrator:	Contains childmask of the child which is migrator
+	 * @active:	Contains each mask bit of the active children
+	 * @migrator:	Contains mask of the child which is migrator
 	 * @seq:	Sequence counter needs to be increased when an update
 	 *		to the tmigr_state is done. It prevents a race when
 	 *		updates in the child groups are propagated in changed

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 7/8] timers/migration: Spare write when nothing changed
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (5 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-01 10:18 ` [PATCH v3 8/8] timers/migration: Fix grammar in comment Anna-Maria Behnsen
  2024-07-11 15:44 ` [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
  8 siblings, 1 reply; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When
there was no new next timer expiry that needs to be propagated, then the
value that was read before is written. This is not required.

Move write to the place where wakeup value could have changed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index b73d89e78163..bbc849539dd2 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1215,14 +1215,13 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
 		if (nextexp != tmc->cpuevt.nextevt.expires ||
 		    tmc->cpuevt.ignore) {
 			ret = tmigr_new_timer(tmc, nextexp);
+			/*
+			 * Make sure the reevaluation of timers in idle path
+			 * will not miss an event.
+			 */
+			WRITE_ONCE(tmc->wakeup, ret);
 		}
 	}
-	/*
-	 * Make sure the reevaluation of timers in idle path will not miss an
-	 * event.
-	 */
-	WRITE_ONCE(tmc->wakeup, ret);
-
 	trace_tmigr_cpu_new_timer_idle(tmc, nextexp);
 	raw_spin_unlock(&tmc->lock);
 	return ret;

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3 8/8] timers/migration: Fix grammar in comment
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (6 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 7/8] timers/migration: Spare write when nothing changed Anna-Maria Behnsen
@ 2024-07-01 10:18 ` Anna-Maria Behnsen
  2024-07-02 12:51   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-11 15:44 ` [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
  8 siblings, 2 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-01 10:18 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index bbc849539dd2..bc40cac6b58d 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1368,7 +1368,7 @@ u64 tmigr_cpu_deactivate(u64 nextexp)
  *			  the only one in the level 0 group; and if it is the
  *			  only one in level 0 group, but there are more than a
  *			  single group active on the way to top level)
- * * nextevt		- when CPU is offline and has to handle timer on his own
+ * * nextevt		- when CPU is offline and has to handle timer on its own
  *			  or when on the way to top in every group only a single
  *			  child is active but @nextevt is before the lowest
  *			  next_expiry encountered while walking up to top level.

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-01 10:18 ` [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback Anna-Maria Behnsen
@ 2024-07-01 21:49   ` Frederic Weisbecker
  2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-01 21:49 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:38PM +0200, Anna-Maria Behnsen a écrit :
> To prevent those races, the setup of the hierarchy is moved into the
> cpuhotplug prepare callback. The prepare callback is not executed by the
> CPU which will come online, it is executed by the CPU which prepares
> onlining of the other CPU. This CPU will not go idle and so it is ensured,
> that the formerly top level group cannot go idle and change top level group
> state and the propagation could be done without a risk. The direction for
> the updates is now forced to look like "from bottom to top".

You might want to elaborate a bit on those last two sentences. For example:

"""
This CPU is active while it is connecting the formerly top level to the new
one. This prevents from (A) to happen and it also prevents from any further walk
above the formerly top level until that active CPU becomes inactive, releasing
the new ->parent and ->childmask updates to be visible by any subsequent walk
up above the formerly top level hierarchy. This prevents from (B) to happen.
"""

However if the active CPU prevents from tmigr_cpu_(in)active() to walk
up with the update not-or-half visible, nothing prevents walking up to
the new top with a 0 childmask in tmigr_handle_remote_up() or
tmigr_requires_handle_remote_up() if the active CPU doing the prepare is not
the migrator. But then it looks fine because:

* tmigr_check_migrator() should just return false
* The migrator is active and should eventually observe the new childmask
  at some point in a future tick.

But still it can be a good idea to mention that somewhere.

> @@ -1540,22 +1600,22 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
>  	 *   child to the new parent. So tmigr_connect_child_parent() is
>  	 *   executed with the formerly top level group (child) and the newly
>  	 *   created group (parent).
> +	 *
> +	 * * It is ensured, that the child is active, as this setup path is

First comma is not needed.

> +	 *   executed in hotplug prepare callback. This is exectued by an
> +	 *   already connected and !idle CPU. Even if all other CPUs go idle,
> +	 *   the CPU executing the setup will be responsible up to current top
> +	 *   level group.

And the next time it goes inactive, it will release the new childmask and parent
to subsequent walkers through this @child.

> +       * Therefore propagate active state unconditionally.

As for the change itself:

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data
  2024-07-01 10:18 ` [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data Anna-Maria Behnsen
@ 2024-07-02 11:43   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-02 11:43 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:40PM +0200, Anna-Maria Behnsen a écrit :
> Two different structs are defined for propagating data from one to another
> level when walking the hierarchy. Several struct members exist in both
> structs which makes generalization harder.
> 
> Merge those two structs into a single one and use it directly in
> walk_groups() and the corresponding function pointers instead of
> introducing pointer casting all over the place.
> 
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
> ---
>  kernel/time/timer_migration.c | 126 ++++++++++++++++++------------------------
>  1 file changed, 55 insertions(+), 71 deletions(-)
> 
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index 0ae7f2084d27..b4391abfb4a9 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -475,69 +475,31 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
>  	return bitmap_weight(&active, BIT_CNT) <= 1;
>  }
>  
> -typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
> -
> -static void __walk_groups(up_f up, void *data,
> -			  struct tmigr_cpu *tmc)
> -{
> -	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
> -
> -	do {
> -		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
> -
> -		if (up(group, child, data))
> -			break;
> -
> -		child = group;
> -		group = group->parent;
> -	} while (group);
> -}
> -
> -static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
> -{
> -	lockdep_assert_held(&tmc->lock);
> -
> -	__walk_groups(up, data, tmc);
> -}
> -
>  /**
>   * struct tmigr_walk - data required for walking the hierarchy
>   * @nextexp:		Next CPU event expiry information which is handed into
>   *			the timer migration code by the timer code
>   *			(get_next_timer_interrupt())
> - * @firstexp:		Contains the first event expiry information when last
> - *			active CPU of hierarchy is on the way to idle to make
> - *			sure CPU will be back in time. It is updated in top
> - *			level group only. Be aware, there could occur a new top
> - *			level of the hierarchy between the 'top level call' in
> - *			tmigr_update_events() and the check for the parent group
> - *			in walk_groups(). Then @firstexp might contain a value
> - *			!= KTIME_MAX even if it was not the final top
> - *			level. This is not a problem, as the worst outcome is a
> - *			CPU which might wake up a little early.
> + * @firstexp:		Contains the first event expiry information when
> + *			hierarchy is completely idle.  When CPU itself was the
> + *			last going idle, information makes sure, that CPU will
> + *			be back in time. When using this value in the remote
> + *			expiry case, firstexp is stored in the per CPU tmigr_cpu
> + *			struct of CPU which expires remote timers. It is updated
> + *			in top level group only. Be aware, there could occur a
> + *			new top level of the hierarchy between the 'top level
> + *			call' in tmigr_update_events() and the check for the
> + *			parent group in walk_groups(). Then @firstexp might
> + *			contain a value != KTIME_MAX even if it was not the
> + *			final top level. This is not a problem, as the worst
> + *			outcome is a CPU which might wake up a little early.
>   * @evt:		Pointer to tmigr_event which needs to be queued (of idle
>   *			child group)
>   * @childmask:		childmask of child group
>   * @remote:		Is set, when the new timer path is executed in
>   *			tmigr_handle_remote_cpu()
> - */
> -struct tmigr_walk {
> -	u64			nextexp;
> -	u64			firstexp;
> -	struct tmigr_event	*evt;
> -	u8			childmask;
> -	bool			remote;
> -};
> -
> -/**
> - * struct tmigr_remote_data - data required for remote expiry hierarchy walk
>   * @basej:		timer base in jiffies
>   * @now:		timer base monotonic
> - * @firstexp:		returns expiry of the first timer in the idle timer
> - *			migration hierarchy to make sure the timer is handled in
> - *			time; it is stored in the per CPU tmigr_cpu struct of
> - *			CPU which expires remote timers
> - * @childmask:		childmask of child group
>   * @check:		is set if there is the need to handle remote timers;
>   *			required in tmigr_requires_handle_remote() only
>   * @tmc_active:		this flag indicates, whether the CPU which triggers
> @@ -546,15 +508,43 @@ struct tmigr_walk {
>   *			idle, only the first event of the top level has to be
>   *			considered.
>   */
> -struct tmigr_remote_data {
> -	unsigned long	basej;
> -	u64		now;
> -	u64		firstexp;
> -	u8		childmask;
> -	bool		check;
> -	bool		tmc_active;
> +struct tmigr_walk {
> +	u64			nextexp;
> +	u64			firstexp;
> +	struct tmigr_event	*evt;
> +	u8			childmask;
> +	bool			remote;
> +	unsigned long		basej;
> +	u64			now;
> +	bool			check;
> +	bool			tmc_active;


Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Ideas for a subsequent patch:

Is tmc_active actually useful? It must always be true in
tmigr_requires_handle_remote_up() since that function is only called
on active CPUs.

In fact the following condition is dead code:

	/*
	 * When there is a parent group and the CPU which triggered the
	 * hierarchy walk is not active, proceed the walk to reach the top level
	 * group before reading the next_expiry value.
	 */
	if (group->parent && !data->tmc_active)
		goto out;

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place
  2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
@ 2024-07-02 12:04   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-15 19:28   ` [PATCH v3 5/8] " Frederic Weisbecker
  2 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-02 12:04 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:41PM +0200, Anna-Maria Behnsen a écrit :
> Reading the childmask and parent pointer is required when propagating
> changes through the hierarchy. At the moment this reads are spread all over

*these

> the place which makes it harder to follow.
> 
> Move those reads to a single place to keep code clean.
> 
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious
  2024-07-01 10:18 ` [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious Anna-Maria Behnsen
@ 2024-07-02 12:45   ` Frederic Weisbecker
  2024-07-03 20:31   ` [PATCH v4 6/8] timers/migration: Rename childmask by groupmask " Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-02 12:45 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:42PM +0200, Anna-Maria Behnsen a écrit :
> childmask in the group reflects the mask that is required to 'reference'
> this group in the parent. When reading childmask, this might be confusing,
> as this suggests, that this is the mask of the child of the group.

OTOH ->parentmask could suggest this is the mask of the parent of the
group. I'm not sure which one is better. ->childmask never sounded ambiguous
to me so far.

How about ->grpmask ? (deliberately stolen from RCU's node tree).

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 8/8] timers/migration: Fix grammar in comment
  2024-07-01 10:18 ` [PATCH v3 8/8] timers/migration: Fix grammar in comment Anna-Maria Behnsen
@ 2024-07-02 12:51   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-02 12:51 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:44PM +0200, Anna-Maria Behnsen a écrit :
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-01 10:18 ` [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback Anna-Maria Behnsen
  2024-07-01 21:49   ` Frederic Weisbecker
@ 2024-07-03 20:28   ` Anna-Maria Behnsen
  2024-07-03 21:24     ` Frederic Weisbecker
                       ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-03 20:28 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

When a CPU comes online the first time, it is possible that a new top level
group will be created. In general all propagation is done from the bottom
to top. This minimizes complexity and prevents possible races. But when a
new top level group is created, the formely top level group needs to be
connected to the new level. This is the only time, when the direction to
propagate changes is changed: the changes are propagated from top (new top
level group) to bottom (formerly top level group).

This introduces two races (see (A) and (B)) as reported by Frederic:

(A) This race happens, when marking the formely top level group as active,
but the last active CPU of the formerly top level group goes idle. Then
it's likely that formerly group is no longer active, but marked
nevertheless as active in new top level group:

		  [GRP0:0]
	       migrator = 0
	       active   = 0
	       nextevt  = KTIME_MAX
	       /         \
	      0         1 .. 7
	  active         idle

0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

1) CPU 8 is booting and creates a new group in first level GRP0:1 and
   therefore also a new top group GRP1:0. For now the setup code proceeded
   only until the connected between GRP0:1 to the new top group. The
   connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
   still !online.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
	      active   = NONE              active   = NONE
	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
   tmigr_update_events() and it propagates the change to the top (no change
   there and no wakeup programmed since there is no timer).

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
	      active   = NONE             active   = NONE
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
   active in GRP1:0

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0, GRP0:1
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = 8
	      active   = NONE             active   = 8
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                active

5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
   of tmigr_cpu_online().

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = T8
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
	      active   = NONE               active   = NONE
	      nextevt  = KTIME_MAX          nextevt  = T8
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.

--> The update which is done in third step is not noticed by setup code. So
    a wrong migrator is set to top level group and a timer could get
    ignored.

(B) Reading group->parent and group->childmask when an hierarchy update is
ongoing and reaches the formerly top level group is racy as those values
could be inconsistent. (The notation of migrator and active now slightly
changes in contrast to the above example, as now the childmasks are used.)

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = NULL		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
   but did not yet connect GRP0:0 to GRP1:0.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
   parent pointer of GRP0:0 and ...

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
   TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
   structure tmigr_walk (as update of childmask is not yet
   visible/updated). And now ...

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
   code).

			     [GRP1:0]
			migrator = 0x00
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
   GRP1:0 but with childmask = 0 as stored in propagation data structure.

--> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
    it looks like GRP1:0 is always active.

To prevent those races, the setup of the hierarchy is moved into the
cpuhotplug prepare callback. The prepare callback is not executed by the
CPU which will come online, it is executed by the CPU which prepares
onlining of the other CPU. This CPU is active while it is connecting the
formerly top level to the new one. This prevents from (A) to happen and it
also prevents from any further walk above the formerly top level until that
active CPU becomes inactive, releasing the new ->parent and ->childmask
updates to be visible by any subsequent walk up above the formerly top
level hierarchy. This prevents from (B) to happen. The direction for the
updates is now forced to look like "from bottom to top".

However if the active CPU prevents from tmigr_cpu_(in)active() to walk up
with the update not-or-half visible, nothing prevents walking up to the new
top with a 0 childmask in tmigr_handle_remote_up() or
tmigr_requires_handle_remote_up() if the active CPU doing the prepare is
not the migrator. But then it looks fine because:

  * tmigr_check_migrator() should just return false
  * The migrator is active and should eventually observe the new childmask
    at some point in a future tick.

Split setup functionality of online callback into the cpuhotplug prepare
callback and setup hotplug state. Reorder the code, that all prepare
related functions are close to each other and online and offline callbacks
are also close together.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/cpuhotplug.h    |   1 +
 kernel/time/timer_migration.c | 181 +++++++++++++++++++---------------
 2 files changed, 101 insertions(+), 81 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 7a5785f405b6..df59666a2a66 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -122,6 +122,7 @@ enum cpuhp_state {
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
 	CPUHP_ZCOMP_PREPARE,
 	CPUHP_TIMERS_PREPARE,
+	CPUHP_TMIGR_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index d91efe1dc3bf..9b86efded4d5 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1438,6 +1438,66 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+/*
+ * tmigr_trigger_active() - trigger a CPU to become active again
+ *
+ * This function is executed on a CPU which is part of cpu_online_mask, when the
+ * last active CPU in the hierarchy is offlining. With this, it is ensured that
+ * the other CPU is active and takes over the migrator duty.
+ */
+static long tmigr_trigger_active(void *unused)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	WARN_ON_ONCE(!tmc->online || tmc->idle);
+
+	return 0;
+}
+
+static int tmigr_cpu_offline(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	int migrator;
+	u64 firstexp;
+
+	raw_spin_lock_irq(&tmc->lock);
+	tmc->online = false;
+	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+	/*
+	 * CPU has to handle the local events on his own, when on the way to
+	 * offline; Therefore nextevt value is set to KTIME_MAX
+	 */
+	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+	trace_tmigr_cpu_offline(tmc);
+	raw_spin_unlock_irq(&tmc->lock);
+
+	if (firstexp != KTIME_MAX) {
+		migrator = cpumask_any_but(cpu_online_mask, cpu);
+		work_on_cpu(migrator, tmigr_trigger_active, NULL);
+	}
+
+	return 0;
+}
+
+static int tmigr_cpu_online(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	/* Check whether CPU data was successfully initialized */
+	if (WARN_ON_ONCE(!tmc->tmgroup))
+		return -EINVAL;
+
+	raw_spin_lock_irq(&tmc->lock);
+	trace_tmigr_cpu_online(tmc);
+	tmc->idle = timer_base_is_idle();
+	if (!tmc->idle)
+		__tmigr_cpu_activate(tmc);
+	tmc->online = true;
+	raw_spin_unlock_irq(&tmc->lock);
+	return 0;
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1512,7 +1572,7 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
+	struct tmigr_walk data;
 
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
@@ -1540,22 +1600,24 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   child to the new parent. So tmigr_connect_child_parent() is
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
+	 *
+	 * * It is ensured that the child is active, as this setup path is
+	 *   executed in hotplug prepare callback. This is exectued by an
+	 *   already connected and !idle CPU. Even if all other CPUs go idle,
+	 *   the CPU executing the setup will be responsible up to current top
+	 *   level group. And the next time it goes inactive, it will release
+	 *   the new childmask and parent to subsequent walkers through this
+	 *   @child. Therefore propagate active state unconditionally.
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
+	data.childmask = child->childmask;
 
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	/*
+	 * There is only one new level per time (which is protected by
+	 * tmigr_mutex). When connecting the child and the parent and set the
+	 * child active when the parent is inactive, the parent needs to be the
+	 * uppermost level. Otherwise there went something wrong!
+	 */
+	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
@@ -1661,80 +1723,32 @@ static int tmigr_add_cpu(unsigned int cpu)
 	return ret;
 }
 
-static int tmigr_cpu_online(unsigned int cpu)
+static int tmigr_cpu_prepare(unsigned int cpu)
 {
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int ret;
-
-	/* First online attempt? Initialize CPU data */
-	if (!tmc->tmgroup) {
-		raw_spin_lock_init(&tmc->lock);
-
-		ret = tmigr_add_cpu(cpu);
-		if (ret < 0)
-			return ret;
-
-		if (tmc->childmask == 0)
-			return -EINVAL;
+	struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
+	int ret = 0;
 
-		timerqueue_init(&tmc->cpuevt.nextevt);
-		tmc->cpuevt.nextevt.expires = KTIME_MAX;
-		tmc->cpuevt.ignore = true;
-		tmc->cpuevt.cpu = cpu;
-
-		tmc->remote = false;
-		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
-	}
-	raw_spin_lock_irq(&tmc->lock);
-	trace_tmigr_cpu_online(tmc);
-	tmc->idle = timer_base_is_idle();
-	if (!tmc->idle)
-		__tmigr_cpu_activate(tmc);
-	tmc->online = true;
-	raw_spin_unlock_irq(&tmc->lock);
-	return 0;
-}
-
-/*
- * tmigr_trigger_active() - trigger a CPU to become active again
- *
- * This function is executed on a CPU which is part of cpu_online_mask, when the
- * last active CPU in the hierarchy is offlining. With this, it is ensured that
- * the other CPU is active and takes over the migrator duty.
- */
-static long tmigr_trigger_active(void *unused)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	/* Not first online attempt? */
+	if (tmc->tmgroup)
+		return ret;
 
-	WARN_ON_ONCE(!tmc->online || tmc->idle);
+	raw_spin_lock_init(&tmc->lock);
 
-	return 0;
-}
+	ret = tmigr_add_cpu(cpu);
+	if (ret < 0)
+		return ret;
 
-static int tmigr_cpu_offline(unsigned int cpu)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int migrator;
-	u64 firstexp;
+	if (tmc->childmask == 0)
+		return -EINVAL;
 
-	raw_spin_lock_irq(&tmc->lock);
-	tmc->online = false;
+	timerqueue_init(&tmc->cpuevt.nextevt);
+	tmc->cpuevt.nextevt.expires = KTIME_MAX;
+	tmc->cpuevt.ignore = true;
+	tmc->cpuevt.cpu = cpu;
+	tmc->remote = false;
 	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
 
-	/*
-	 * CPU has to handle the local events on his own, when on the way to
-	 * offline; Therefore nextevt value is set to KTIME_MAX
-	 */
-	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
-	trace_tmigr_cpu_offline(tmc);
-	raw_spin_unlock_irq(&tmc->lock);
-
-	if (firstexp != KTIME_MAX) {
-		migrator = cpumask_any_but(cpu_online_mask, cpu);
-		work_on_cpu(migrator, tmigr_trigger_active, NULL);
-	}
-
-	return 0;
+	return ret;
 }
 
 static int __init tmigr_init(void)
@@ -1793,6 +1807,11 @@ static int __init tmigr_init(void)
 		tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
 		tmigr_crossnode_level);
 
+	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:prepare",
+				tmigr_cpu_prepare, NULL);
+	if (ret)
+		goto err;
+
 	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
 				tmigr_cpu_online, tmigr_cpu_offline);
 	if (ret)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/8] timers/migration: Rename childmask by groupmask to make naming more obvious
  2024-07-01 10:18 ` [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious Anna-Maria Behnsen
  2024-07-02 12:45   ` Frederic Weisbecker
@ 2024-07-03 20:31   ` Anna-Maria Behnsen
  2024-07-03 21:33     ` Frederic Weisbecker
  2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  1 sibling, 2 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-03 20:31 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.

Clarify this by renaming childmask in the tmigr_group and tmc_group by
groupmask.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 include/trace/events/timer_migration.h |  4 ++--
 kernel/time/timer_migration.c          | 24 ++++++++++++------------
 kernel/time/timer_migration.h          | 15 +++++++--------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/timer_migration.h b/include/trace/events/timer_migration.h
index 79f19e76a80b..f1a447f43f72 100644
--- a/include/trace/events/timer_migration.h
+++ b/include/trace/events/timer_migration.h
@@ -52,7 +52,7 @@ TRACE_EVENT(tmigr_connect_child_parent,
 		__entry->lvl		= child->parent->level;
 		__entry->numa_node	= child->parent->numa_node;
 		__entry->num_children	= child->parent->num_children;
-		__entry->childmask	= child->childmask;
+		__entry->childmask	= child->groupmask;
 	),
 
 	TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
@@ -81,7 +81,7 @@ TRACE_EVENT(tmigr_connect_cpu_parent,
 		__entry->lvl		= tmc->tmgroup->level;
 		__entry->numa_node	= tmc->tmgroup->numa_node;
 		__entry->num_children	= tmc->tmgroup->num_children;
-		__entry->childmask	= tmc->childmask;
+		__entry->childmask	= tmc->groupmask;
 	),
 
 	TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 1fb930537f33..9ff61f9912da 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -495,7 +495,7 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
  *			outcome is a CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
- * @childmask:		childmask of child group
+ * @childmask:		groupmask of child group
  * @remote:		Is set, when the new timer path is executed in
  *			tmigr_handle_remote_cpu()
  * @basej:		timer base in jiffies
@@ -535,7 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 
 		child = group;
 		group = group->parent;
-		data->childmask = group->childmask;
+		data->childmask = group->groupmask;
 	} while (group);
 }
 
@@ -669,7 +669,7 @@ static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
 {
 	struct tmigr_walk data;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 
 	trace_tmigr_cpu_active(tmc);
 
@@ -1049,7 +1049,7 @@ void tmigr_handle_remote(void)
 	if (tmigr_is_not_available(tmc))
 		return;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 	data.firstexp = KTIME_MAX;
 
 	/*
@@ -1057,7 +1057,7 @@ void tmigr_handle_remote(void)
 	 * in tmigr_handle_remote_up() anyway. Keep this check to speed up the
 	 * return when nothing has to be done.
 	 */
-	if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) {
+	if (!tmigr_check_migrator(tmc->tmgroup, tmc->groupmask)) {
 		/*
 		 * If this CPU was an idle migrator, make sure to clear its wakeup
 		 * value so it won't chase timers that have already expired elsewhere.
@@ -1150,7 +1150,7 @@ bool tmigr_requires_handle_remote(void)
 		return ret;
 
 	data.now = get_jiffies_update(&jif);
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 	data.firstexp = KTIME_MAX;
 	data.tmc_active = !tmc->idle;
 	data.check = false;
@@ -1310,7 +1310,7 @@ static u64 __tmigr_cpu_deactivate(struct tmigr_cpu *tmc, u64 nextexp)
 	struct tmigr_walk data = { .nextexp = nextexp,
 				   .firstexp = KTIME_MAX,
 				   .evt = &tmc->cpuevt,
-				   .childmask = tmc->childmask };
+				   .childmask = tmc->groupmask };
 
 	/*
 	 * If nextexp is KTIME_MAX, the CPU event will be ignored because the
@@ -1388,7 +1388,7 @@ u64 tmigr_quick_check(u64 nextevt)
 	if (WARN_ON_ONCE(tmc->idle))
 		return nextevt;
 
-	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->childmask))
+	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->groupmask))
 		return KTIME_MAX;
 
 	do {
@@ -1551,7 +1551,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
 	child->parent = parent;
-	child->childmask = BIT(parent->num_children++);
+	child->groupmask = BIT(parent->num_children++);
 
 	raw_spin_unlock(&parent->lock);
 	raw_spin_unlock_irq(&child->lock);
@@ -1582,7 +1582,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   the new childmask and parent to subsequent walkers through this
 	 *   @child. Therefore propagate active state unconditionally.
 	 */
-	data.childmask = child->childmask;
+	data.childmask = child->groupmask;
 
 	/*
 	 * There is only one new level per time (which is protected by
@@ -1648,7 +1648,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 			raw_spin_lock_irq(&group->lock);
 
 			tmc->tmgroup = group;
-			tmc->childmask = BIT(group->num_children++);
+			tmc->groupmask = BIT(group->num_children++);
 
 			raw_spin_unlock_irq(&group->lock);
 
@@ -1711,7 +1711,7 @@ static int tmigr_cpu_prepare(unsigned int cpu)
 	if (ret < 0)
 		return ret;
 
-	if (tmc->childmask == 0)
+	if (tmc->groupmask == 0)
 		return -EINVAL;
 
 	timerqueue_init(&tmc->cpuevt.nextevt);
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 494f68cc13f4..154accc7a543 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -51,9 +51,8 @@ struct tmigr_event {
  * @num_children:	Counter of group children to make sure the group is only
  *			filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
  *			only
- * @childmask:		childmask of the group in the parent group; is set
- *			during setup and will never change; can be read
- *			lockless
+ * @groupmask:		mask of the group in the parent group; is set during
+ *			setup and will never change; can be read lockless
  * @list:		List head that is added to the per level
  *			tmigr_level_list; is required during setup when a
  *			new group needs to be connected to the existing
@@ -69,7 +68,7 @@ struct tmigr_group {
 	unsigned int		level;
 	int			numa_node;
 	unsigned int		num_children;
-	u8			childmask;
+	u8			groupmask;
 	struct list_head	list;
 };
 
@@ -89,7 +88,7 @@ struct tmigr_group {
  *			hierarchy
  * @remote:		Is set when timers of the CPU are expired remotely
  * @tmgroup:		Pointer to the parent group
- * @childmask:		childmask of tmigr_cpu in the parent group
+ * @groupmask:		mask of tmigr_cpu in the parent group
  * @wakeup:		Stores the first timer when the timer migration
  *			hierarchy is completely idle and remote expiry was done;
  *			is returned to timer code in the idle path and is only
@@ -102,7 +101,7 @@ struct tmigr_cpu {
 	bool			idle;
 	bool			remote;
 	struct tmigr_group	*tmgroup;
-	u8			childmask;
+	u8			groupmask;
 	u64			wakeup;
 	struct tmigr_event	cpuevt;
 };
@@ -118,8 +117,8 @@ union tmigr_state {
 	u32 state;
 	/**
 	 * struct - split state of tmigr_group
-	 * @active:	Contains each childmask bit of the active children
-	 * @migrator:	Contains childmask of the child which is migrator
+	 * @active:	Contains each mask bit of the active children
+	 * @migrator:	Contains mask of the child which is migrator
 	 * @seq:	Sequence counter needs to be increased when an update
 	 *		to the tmigr_state is done. It prevents a race when
 	 *		updates in the child groups are propagated in changed
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
@ 2024-07-03 21:24     ` Frederic Weisbecker
  2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  2024-07-11  8:56     ` [PATCH v4 2/8] " Alexander Stein
  2 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 21:24 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Wed, Jul 03, 2024 at 10:28:39PM +0200, Anna-Maria Behnsen a écrit :
> When a CPU comes online the first time, it is possible that a new top level
> group will be created. In general all propagation is done from the bottom
> to top. This minimizes complexity and prevents possible races. But when a
> new top level group is created, the formely top level group needs to be
> connected to the new level. This is the only time, when the direction to
> propagate changes is changed: the changes are propagated from top (new top
> level group) to bottom (formerly top level group).
> 
> This introduces two races (see (A) and (B)) as reported by Frederic:
> 
> (A) This race happens, when marking the formely top level group as active,
> but the last active CPU of the formerly top level group goes idle. Then
> it's likely that formerly group is no longer active, but marked
> nevertheless as active in new top level group:
> 
> 		  [GRP0:0]
> 	       migrator = 0
> 	       active   = 0
> 	       nextevt  = KTIME_MAX
> 	       /         \
> 	      0         1 .. 7
> 	  active         idle
> 
> 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 					 \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 1) CPU 8 is booting and creates a new group in first level GRP0:1 and
>    therefore also a new top group GRP1:0. For now the setup code proceeded
>    only until the connected between GRP0:1 to the new top group. The
>    connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
>    still !online.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
> 	      active   = NONE              active   = NONE
> 	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
>    tmigr_update_events() and it propagates the change to the top (no change
>    there and no wakeup programmed since there is no timer).
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
> 	      active   = NONE             active   = NONE
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
>    active in GRP1:0
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0, GRP0:1
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = 8
> 	      active   = NONE             active   = 8
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                active
> 
> 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
>    of tmigr_cpu_online().
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = T8
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
> 	      active   = NONE               active   = NONE
> 	      nextevt  = KTIME_MAX          nextevt  = T8
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                  idle
> 
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
> 
> --> The update which is done in third step is not noticed by setup code. So
>     a wrong migrator is set to top level group and a timer could get
>     ignored.
> 
> (B) Reading group->parent and group->childmask when an hierarchy update is
> ongoing and reaches the formerly top level group is racy as those values
> could be inconsistent. (The notation of migrator and active now slightly
> changes in contrast to the above example, as now the childmasks are used.)
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 					 \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
> 	      active   = 0x00           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = NULL		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  idle           idle                !online
> 	  childmask=1
> 
> 1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
>    but did not yet connect GRP0:0 to GRP1:0.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
> 	      active   = 0x00           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  idle           idle                !online
> 	  childmask=1
> 
> 2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
>    parent pointer of GRP0:0 and ...
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
>    TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
>    structure tmigr_walk (as update of childmask is not yet
>    visible/updated). And now ...
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 2		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
>    code).
> 
> 			     [GRP1:0]
> 			migrator = 0x00
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 2		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
>    GRP1:0 but with childmask = 0 as stored in propagation data structure.
> 
> --> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
>     it looks like GRP1:0 is always active.
> 
> To prevent those races, the setup of the hierarchy is moved into the
> cpuhotplug prepare callback. The prepare callback is not executed by the
> CPU which will come online, it is executed by the CPU which prepares
> onlining of the other CPU. This CPU is active while it is connecting the
> formerly top level to the new one. This prevents from (A) to happen and it
> also prevents from any further walk above the formerly top level until that
> active CPU becomes inactive, releasing the new ->parent and ->childmask
> updates to be visible by any subsequent walk up above the formerly top
> level hierarchy. This prevents from (B) to happen. The direction for the
> updates is now forced to look like "from bottom to top".
> 
> However if the active CPU prevents from tmigr_cpu_(in)active() to walk up
> with the update not-or-half visible, nothing prevents walking up to the new
> top with a 0 childmask in tmigr_handle_remote_up() or
> tmigr_requires_handle_remote_up() if the active CPU doing the prepare is
> not the migrator. But then it looks fine because:
> 
>   * tmigr_check_migrator() should just return false
>   * The migrator is active and should eventually observe the new childmask
>     at some point in a future tick.
> 
> Split setup functionality of online callback into the cpuhotplug prepare
> callback and setup hotplug state. Reorder the code, that all prepare
> related functions are close to each other and online and offline callbacks
> are also close together.
> 
> Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

Looks very good, thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/8] timers/migration: Rename childmask by groupmask to make naming more obvious
  2024-07-03 20:31   ` [PATCH v4 6/8] timers/migration: Rename childmask by groupmask " Anna-Maria Behnsen
@ 2024-07-03 21:33     ` Frederic Weisbecker
  2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-03 21:33 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Wed, Jul 03, 2024 at 10:31:15PM +0200, Anna-Maria Behnsen a écrit :
> childmask in the group reflects the mask that is required to 'reference'
> this group in the parent. When reading childmask, this might be confusing,
> as this suggests, that this is the mask of the child of the group.
> 
> Clarify this by renaming childmask in the tmigr_group and tmc_group by
> groupmask.
> 
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Fix grammar in comment
  2024-07-01 10:18 ` [PATCH v3 8/8] timers/migration: Fix grammar in comment Anna-Maria Behnsen
  2024-07-02 12:51   ` Frederic Weisbecker
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     746770499be55cf375a108a321a818b238182446
Gitweb:        https://git.kernel.org/tip/746770499be55cf375a108a321a818b238182446
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:44 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Fix grammar in comment



Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-8-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index a2d156b..23c5bbc 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1368,7 +1368,7 @@ u64 tmigr_cpu_deactivate(u64 nextexp)
  *			  the only one in the level 0 group; and if it is the
  *			  only one in level 0 group, but there are more than a
  *			  single group active on the way to top level)
- * * nextevt		- when CPU is offline and has to handle timer on his own
+ * * nextevt		- when CPU is offline and has to handle timer on its own
  *			  or when on the way to top in every group only a single
  *			  child is active but @nextevt is before the lowest
  *			  next_expiry encountered while walking up to top level.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Spare write when nothing changed
  2024-07-01 10:18 ` [PATCH v3 7/8] timers/migration: Spare write when nothing changed Anna-Maria Behnsen
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  0 siblings, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     2e0bd37f7b395173f879225b9d8b1811af4a8a35
Gitweb:        https://git.kernel.org/tip/2e0bd37f7b395173f879225b9d8b1811af4a8a35
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:43 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Spare write when nothing changed

The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When
there was no new next timer expiry that needs to be propagated, then the
value that was read before is written. This is not required.

Move write to the place where wakeup value could have changed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-7-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 9ff61f9..a2d156b 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1215,14 +1215,13 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
 		if (nextexp != tmc->cpuevt.nextevt.expires ||
 		    tmc->cpuevt.ignore) {
 			ret = tmigr_new_timer(tmc, nextexp);
+			/*
+			 * Make sure the reevaluation of timers in idle path
+			 * will not miss an event.
+			 */
+			WRITE_ONCE(tmc->wakeup, ret);
 		}
 	}
-	/*
-	 * Make sure the reevaluation of timers in idle path will not miss an
-	 * event.
-	 */
-	WRITE_ONCE(tmc->wakeup, ret);
-
 	trace_tmigr_cpu_new_timer_idle(tmc, nextexp);
 	raw_spin_unlock(&tmc->lock);
 	return ret;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Read childmask and parent pointer in a single place
  2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
  2024-07-02 12:04   ` Frederic Weisbecker
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  2024-07-15 19:28   ` [PATCH v3 5/8] " Frederic Weisbecker
  2 siblings, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     56f9a5fd69eae92d7051a228ee192452248a6bdc
Gitweb:        https://git.kernel.org/tip/56f9a5fd69eae92d7051a228ee192452248a6bdc
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:41 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Read childmask and parent pointer in a single place

Reading the childmask and parent pointer is required when propagating
changes through the hierarchy. At the moment this reads are spread all over
the place which makes it harder to follow.

Move those reads to a single place to keep code clean.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-5-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 20 +++++---------------
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 39eb77e..1fb9305 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -535,6 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 
 		child = group;
 		group = group->parent;
+		data->childmask = group->childmask;
 	} while (group);
 }
 
@@ -647,9 +648,6 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
 
-	if (walk_done == false)
-		data->childmask = group->childmask;
-
 	/*
 	 * The group is active (again). The group event might be still queued
 	 * into the parent group's timerqueue but can now be handled by the
@@ -1027,12 +1025,10 @@ again:
 	}
 
 	/*
-	 * Update of childmask for the next level and keep track of the expiry
-	 * of the first event that needs to be handled (group->next_expiry was
-	 * updated by tmigr_next_expired_groupevt(), next was set by
-	 * tmigr_handle_remote_cpu()).
+	 * Keep track of the expiry of the first event that needs to be handled
+	 * (group->next_expiry was updated by tmigr_next_expired_groupevt(),
+	 * next was set by tmigr_handle_remote_cpu()).
 	 */
-	data->childmask = group->childmask;
 	data->firstexp = group->next_expiry;
 
 	raw_spin_unlock_irq(&group->lock);
@@ -1110,7 +1106,7 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 	 * group before reading the next_expiry value.
 	 */
 	if (group->parent && !data->tmc_active)
-		goto out;
+		return false;
 
 	/*
 	 * The lock is required on 32bit architectures to read the variable
@@ -1135,9 +1131,6 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 		raw_spin_unlock(&group->lock);
 	}
 
-out:
-	/* Update of childmask for the next level */
-	data->childmask = group->childmask;
 	return false;
 }
 
@@ -1309,9 +1302,6 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	/* Event Handling */
 	tmigr_update_events(group, child, data);
 
-	if (walk_done == false)
-		data->childmask = group->childmask;
-
 	return walk_done;
 }
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Rename childmask by groupmask to make naming more obvious
  2024-07-03 20:31   ` [PATCH v4 6/8] timers/migration: Rename childmask by groupmask " Anna-Maria Behnsen
  2024-07-03 21:33     ` Frederic Weisbecker
@ 2024-07-04 18:32     ` tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     633c77727d32ab3487a10ec8f125c26b416236ad
Gitweb:        https://git.kernel.org/tip/633c77727d32ab3487a10ec8f125c26b416236ad
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Wed, 03 Jul 2024 22:31:15 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Rename childmask by groupmask to make naming more obvious

childmask in the group reflects the mask that is required to 'reference'
this group in the parent. When reading childmask, this might be confusing,
as this suggests, that this is the mask of the child of the group.

Clarify this by renaming childmask in the tmigr_group and tmc_group by
groupmask.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240703203115.9028-1-anna-maria@linutronix.de

---
 include/trace/events/timer_migration.h |  4 ++--
 kernel/time/timer_migration.c          | 24 ++++++++++++------------
 kernel/time/timer_migration.h          | 15 +++++++--------
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/timer_migration.h b/include/trace/events/timer_migration.h
index 79f19e7..f1a447f 100644
--- a/include/trace/events/timer_migration.h
+++ b/include/trace/events/timer_migration.h
@@ -52,7 +52,7 @@ TRACE_EVENT(tmigr_connect_child_parent,
 		__entry->lvl		= child->parent->level;
 		__entry->numa_node	= child->parent->numa_node;
 		__entry->num_children	= child->parent->num_children;
-		__entry->childmask	= child->childmask;
+		__entry->childmask	= child->groupmask;
 	),
 
 	TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
@@ -81,7 +81,7 @@ TRACE_EVENT(tmigr_connect_cpu_parent,
 		__entry->lvl		= tmc->tmgroup->level;
 		__entry->numa_node	= tmc->tmgroup->numa_node;
 		__entry->num_children	= tmc->tmgroup->num_children;
-		__entry->childmask	= tmc->childmask;
+		__entry->childmask	= tmc->groupmask;
 	),
 
 	TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 1fb9305..9ff61f9 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -495,7 +495,7 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
  *			outcome is a CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
- * @childmask:		childmask of child group
+ * @childmask:		groupmask of child group
  * @remote:		Is set, when the new timer path is executed in
  *			tmigr_handle_remote_cpu()
  * @basej:		timer base in jiffies
@@ -535,7 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 
 		child = group;
 		group = group->parent;
-		data->childmask = group->childmask;
+		data->childmask = group->groupmask;
 	} while (group);
 }
 
@@ -669,7 +669,7 @@ static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
 {
 	struct tmigr_walk data;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 
 	trace_tmigr_cpu_active(tmc);
 
@@ -1049,7 +1049,7 @@ void tmigr_handle_remote(void)
 	if (tmigr_is_not_available(tmc))
 		return;
 
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 	data.firstexp = KTIME_MAX;
 
 	/*
@@ -1057,7 +1057,7 @@ void tmigr_handle_remote(void)
 	 * in tmigr_handle_remote_up() anyway. Keep this check to speed up the
 	 * return when nothing has to be done.
 	 */
-	if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) {
+	if (!tmigr_check_migrator(tmc->tmgroup, tmc->groupmask)) {
 		/*
 		 * If this CPU was an idle migrator, make sure to clear its wakeup
 		 * value so it won't chase timers that have already expired elsewhere.
@@ -1150,7 +1150,7 @@ bool tmigr_requires_handle_remote(void)
 		return ret;
 
 	data.now = get_jiffies_update(&jif);
-	data.childmask = tmc->childmask;
+	data.childmask = tmc->groupmask;
 	data.firstexp = KTIME_MAX;
 	data.tmc_active = !tmc->idle;
 	data.check = false;
@@ -1310,7 +1310,7 @@ static u64 __tmigr_cpu_deactivate(struct tmigr_cpu *tmc, u64 nextexp)
 	struct tmigr_walk data = { .nextexp = nextexp,
 				   .firstexp = KTIME_MAX,
 				   .evt = &tmc->cpuevt,
-				   .childmask = tmc->childmask };
+				   .childmask = tmc->groupmask };
 
 	/*
 	 * If nextexp is KTIME_MAX, the CPU event will be ignored because the
@@ -1388,7 +1388,7 @@ u64 tmigr_quick_check(u64 nextevt)
 	if (WARN_ON_ONCE(tmc->idle))
 		return nextevt;
 
-	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->childmask))
+	if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->groupmask))
 		return KTIME_MAX;
 
 	do {
@@ -1551,7 +1551,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
 	child->parent = parent;
-	child->childmask = BIT(parent->num_children++);
+	child->groupmask = BIT(parent->num_children++);
 
 	raw_spin_unlock(&parent->lock);
 	raw_spin_unlock_irq(&child->lock);
@@ -1582,7 +1582,7 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   the new childmask and parent to subsequent walkers through this
 	 *   @child. Therefore propagate active state unconditionally.
 	 */
-	data.childmask = child->childmask;
+	data.childmask = child->groupmask;
 
 	/*
 	 * There is only one new level per time (which is protected by
@@ -1648,7 +1648,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 			raw_spin_lock_irq(&group->lock);
 
 			tmc->tmgroup = group;
-			tmc->childmask = BIT(group->num_children++);
+			tmc->groupmask = BIT(group->num_children++);
 
 			raw_spin_unlock_irq(&group->lock);
 
@@ -1711,7 +1711,7 @@ static int tmigr_cpu_prepare(unsigned int cpu)
 	if (ret < 0)
 		return ret;
 
-	if (tmc->childmask == 0)
+	if (tmc->groupmask == 0)
 		return -EINVAL;
 
 	timerqueue_init(&tmc->cpuevt.nextevt);
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 494f68c..154accc 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -51,9 +51,8 @@ struct tmigr_event {
  * @num_children:	Counter of group children to make sure the group is only
  *			filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
  *			only
- * @childmask:		childmask of the group in the parent group; is set
- *			during setup and will never change; can be read
- *			lockless
+ * @groupmask:		mask of the group in the parent group; is set during
+ *			setup and will never change; can be read lockless
  * @list:		List head that is added to the per level
  *			tmigr_level_list; is required during setup when a
  *			new group needs to be connected to the existing
@@ -69,7 +68,7 @@ struct tmigr_group {
 	unsigned int		level;
 	int			numa_node;
 	unsigned int		num_children;
-	u8			childmask;
+	u8			groupmask;
 	struct list_head	list;
 };
 
@@ -89,7 +88,7 @@ struct tmigr_group {
  *			hierarchy
  * @remote:		Is set when timers of the CPU are expired remotely
  * @tmgroup:		Pointer to the parent group
- * @childmask:		childmask of tmigr_cpu in the parent group
+ * @groupmask:		mask of tmigr_cpu in the parent group
  * @wakeup:		Stores the first timer when the timer migration
  *			hierarchy is completely idle and remote expiry was done;
  *			is returned to timer code in the idle path and is only
@@ -102,7 +101,7 @@ struct tmigr_cpu {
 	bool			idle;
 	bool			remote;
 	struct tmigr_group	*tmgroup;
-	u8			childmask;
+	u8			groupmask;
 	u64			wakeup;
 	struct tmigr_event	cpuevt;
 };
@@ -118,8 +117,8 @@ union tmigr_state {
 	u32 state;
 	/**
 	 * struct - split state of tmigr_group
-	 * @active:	Contains each childmask bit of the active children
-	 * @migrator:	Contains childmask of the child which is migrator
+	 * @active:	Contains each mask bit of the active children
+	 * @migrator:	Contains mask of the child which is migrator
 	 * @seq:	Sequence counter needs to be increased when an update
 	 *		to the tmigr_state is done. It prevents a race when
 	 *		updates in the child groups are propagated in changed

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Use a single struct for hierarchy walk data
  2024-07-01 10:18 ` [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data Anna-Maria Behnsen
  2024-07-02 11:43   ` Frederic Weisbecker
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  1 sibling, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     6dbb59418c5c7b014e542db76595417c9b95ccde
Gitweb:        https://git.kernel.org/tip/6dbb59418c5c7b014e542db76595417c9b95ccde
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:40 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Use a single struct for hierarchy walk data

Two different structs are defined for propagating data from one to another
level when walking the hierarchy. Several struct members exist in both
structs which makes generalization harder.

Merge those two structs into a single one and use it directly in
walk_groups() and the corresponding function pointers instead of
introducing pointer casting all over the place.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-4-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 126 ++++++++++++++-------------------
 1 file changed, 55 insertions(+), 71 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index f78258a..39eb77e 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -475,69 +475,31 @@ static bool tmigr_check_lonely(struct tmigr_group *group)
 	return bitmap_weight(&active, BIT_CNT) <= 1;
 }
 
-typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
-
-static void __walk_groups(up_f up, void *data,
-			  struct tmigr_cpu *tmc)
-{
-	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
-
-	do {
-		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
-
-		if (up(group, child, data))
-			break;
-
-		child = group;
-		group = group->parent;
-	} while (group);
-}
-
-static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
-{
-	lockdep_assert_held(&tmc->lock);
-
-	__walk_groups(up, data, tmc);
-}
-
 /**
  * struct tmigr_walk - data required for walking the hierarchy
  * @nextexp:		Next CPU event expiry information which is handed into
  *			the timer migration code by the timer code
  *			(get_next_timer_interrupt())
- * @firstexp:		Contains the first event expiry information when last
- *			active CPU of hierarchy is on the way to idle to make
- *			sure CPU will be back in time. It is updated in top
- *			level group only. Be aware, there could occur a new top
- *			level of the hierarchy between the 'top level call' in
- *			tmigr_update_events() and the check for the parent group
- *			in walk_groups(). Then @firstexp might contain a value
- *			!= KTIME_MAX even if it was not the final top
- *			level. This is not a problem, as the worst outcome is a
- *			CPU which might wake up a little early.
+ * @firstexp:		Contains the first event expiry information when
+ *			hierarchy is completely idle.  When CPU itself was the
+ *			last going idle, information makes sure, that CPU will
+ *			be back in time. When using this value in the remote
+ *			expiry case, firstexp is stored in the per CPU tmigr_cpu
+ *			struct of CPU which expires remote timers. It is updated
+ *			in top level group only. Be aware, there could occur a
+ *			new top level of the hierarchy between the 'top level
+ *			call' in tmigr_update_events() and the check for the
+ *			parent group in walk_groups(). Then @firstexp might
+ *			contain a value != KTIME_MAX even if it was not the
+ *			final top level. This is not a problem, as the worst
+ *			outcome is a CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
  * @childmask:		childmask of child group
  * @remote:		Is set, when the new timer path is executed in
  *			tmigr_handle_remote_cpu()
- */
-struct tmigr_walk {
-	u64			nextexp;
-	u64			firstexp;
-	struct tmigr_event	*evt;
-	u8			childmask;
-	bool			remote;
-};
-
-/**
- * struct tmigr_remote_data - data required for remote expiry hierarchy walk
  * @basej:		timer base in jiffies
  * @now:		timer base monotonic
- * @firstexp:		returns expiry of the first timer in the idle timer
- *			migration hierarchy to make sure the timer is handled in
- *			time; it is stored in the per CPU tmigr_cpu struct of
- *			CPU which expires remote timers
- * @childmask:		childmask of child group
  * @check:		is set if there is the need to handle remote timers;
  *			required in tmigr_requires_handle_remote() only
  * @tmc_active:		this flag indicates, whether the CPU which triggers
@@ -546,15 +508,43 @@ struct tmigr_walk {
  *			idle, only the first event of the top level has to be
  *			considered.
  */
-struct tmigr_remote_data {
-	unsigned long	basej;
-	u64		now;
-	u64		firstexp;
-	u8		childmask;
-	bool		check;
-	bool		tmc_active;
+struct tmigr_walk {
+	u64			nextexp;
+	u64			firstexp;
+	struct tmigr_event	*evt;
+	u8			childmask;
+	bool			remote;
+	unsigned long		basej;
+	u64			now;
+	bool			check;
+	bool			tmc_active;
 };
 
+typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *);
+
+static void __walk_groups(up_f up, struct tmigr_walk *data,
+			  struct tmigr_cpu *tmc)
+{
+	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
+
+	do {
+		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
+
+		if (up(group, child, data))
+			break;
+
+		child = group;
+		group = group->parent;
+	} while (group);
+}
+
+static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc)
+{
+	lockdep_assert_held(&tmc->lock);
+
+	__walk_groups(up, data, tmc);
+}
+
 /*
  * Returns the next event of the timerqueue @group->events
  *
@@ -625,10 +615,9 @@ static u64 tmigr_next_groupevt_expires(struct tmigr_group *group)
 
 static bool tmigr_active_up(struct tmigr_group *group,
 			    struct tmigr_group *child,
-			    void *ptr)
+			    struct tmigr_walk *data)
 {
 	union tmigr_state curstate, newstate;
-	struct tmigr_walk *data = ptr;
 	bool walk_done;
 	u8 childmask;
 
@@ -867,10 +856,8 @@ unlock:
 
 static bool tmigr_new_timer_up(struct tmigr_group *group,
 			       struct tmigr_group *child,
-			       void *ptr)
+			       struct tmigr_walk *data)
 {
-	struct tmigr_walk *data = ptr;
-
 	return tmigr_update_events(group, child, data);
 }
 
@@ -1002,9 +989,8 @@ unlock:
 
 static bool tmigr_handle_remote_up(struct tmigr_group *group,
 				   struct tmigr_group *child,
-				   void *ptr)
+				   struct tmigr_walk *data)
 {
-	struct tmigr_remote_data *data = ptr;
 	struct tmigr_event *evt;
 	unsigned long jif;
 	u8 childmask;
@@ -1062,7 +1048,7 @@ again:
 void tmigr_handle_remote(void)
 {
 	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	struct tmigr_remote_data data;
+	struct tmigr_walk data;
 
 	if (tmigr_is_not_available(tmc))
 		return;
@@ -1104,9 +1090,8 @@ void tmigr_handle_remote(void)
 
 static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
 					    struct tmigr_group *child,
-					    void *ptr)
+					    struct tmigr_walk *data)
 {
-	struct tmigr_remote_data *data = ptr;
 	u8 childmask;
 
 	childmask = data->childmask;
@@ -1164,7 +1149,7 @@ out:
 bool tmigr_requires_handle_remote(void)
 {
 	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	struct tmigr_remote_data data;
+	struct tmigr_walk data;
 	unsigned long jif;
 	bool ret = false;
 
@@ -1252,10 +1237,9 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
 
 static bool tmigr_inactive_up(struct tmigr_group *group,
 			      struct tmigr_group *child,
-			      void *ptr)
+			      struct tmigr_walk *data)
 {
 	union tmigr_state curstate, newstate, childstate;
-	struct tmigr_walk *data = ptr;
 	bool walk_done;
 	u8 childmask;
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Improve tracing
  2024-07-01 10:18 ` [PATCH v3 3/8] timers/migration: Improve tracing Anna-Maria Behnsen
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  0 siblings, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, x86,
	linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     6d8a8f54e045e2030eebb53b5ce859c80d9425f6
Gitweb:        https://git.kernel.org/tip/6d8a8f54e045e2030eebb53b5ce859c80d9425f6
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:39 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:24:57 +02:00

timers/migration: Improve tracing

Trace points of inactive and active propagation are located at the end of
the related functions. The interesting information of those trace points is
the updated group state. When trace points are not located directly at the
place where group state changed, order of trace points in traces could be
confusing.

Move inactive and active propagation trace points directly after update of
group state values.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-3-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 9b86efd..f78258a 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -656,6 +656,8 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
+	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
@@ -673,8 +675,6 @@ static bool tmigr_active_up(struct tmigr_group *group,
 	 */
 	group->groupevt.ignore = true;
 
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
-
 	return walk_done;
 }
 
@@ -1306,9 +1306,10 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 
 		WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
 
-		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state,
-				       newstate.state))
+		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
+			trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 			break;
+		}
 
 		/*
 		 * The memory barrier is paired with the cmpxchg() in
@@ -1327,8 +1328,6 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
-
 	return walk_done;
 }
 

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
  2024-07-03 21:24     ` Frederic Weisbecker
@ 2024-07-04 18:32     ` tip-bot2 for Anna-Maria Behnsen
  2024-07-11  8:56     ` [PATCH v4 2/8] " Alexander Stein
  2 siblings, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Anna-Maria Behnsen, Thomas Gleixner, Frederic Weisbecker, stable,
	x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     8cdb61838ee5c63556773ea2eed24deab6b15257
Gitweb:        https://git.kernel.org/tip/8cdb61838ee5c63556773ea2eed24deab6b15257
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Wed, 03 Jul 2024 22:28:39 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:22:22 +02:00

timers/migration: Move hierarchy setup into cpuhotplug prepare callback

When a CPU comes online the first time, it is possible that a new top level
group will be created. In general all propagation is done from the bottom
to top. This minimizes complexity and prevents possible races. But when a
new top level group is created, the formely top level group needs to be
connected to the new level. This is the only time, when the direction to
propagate changes is changed: the changes are propagated from top (new top
level group) to bottom (formerly top level group).

This introduces two races (see (A) and (B)) as reported by Frederic:

(A) This race happens, when marking the formely top level group as active,
but the last active CPU of the formerly top level group goes idle. Then
it's likely that formerly group is no longer active, but marked
nevertheless as active in new top level group:

		  [GRP0:0]
	       migrator = 0
	       active   = 0
	       nextevt  = KTIME_MAX
	       /         \
	      0         1 .. 7
	  active         idle

0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

1) CPU 8 is booting and creates a new group in first level GRP0:1 and
   therefore also a new top group GRP1:0. For now the setup code proceeded
   only until the connected between GRP0:1 to the new top group. The
   connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
   still !online.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0              migrator = TMIGR_NONE
	      active   = 0              active   = NONE
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	  active         idle                !online

2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = NONE
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
	      active   = NONE              active   = NONE
	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
   tmigr_update_events() and it propagates the change to the top (no change
   there and no wakeup programmed since there is no timer).

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
	      active   = NONE             active   = NONE
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \
	      0          1 .. 7                8
	    idle         idle                !online

4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
   active in GRP1:0

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0, GRP0:1
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE       migrator = 8
	      active   = NONE             active   = 8
	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                active

5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
   of tmigr_cpu_online().

			     [GRP1:0]
			migrator = GRP0:0
			active   = GRP0:0
			nextevt  = T8
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
	      active   = NONE               active   = NONE
	      nextevt  = KTIME_MAX          nextevt  = T8
		/         \                    |
	      0          1 .. 7                8
	    idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.

--> The update which is done in third step is not noticed by setup code. So
    a wrong migrator is set to top level group and a timer could get
    ignored.

(B) Reading group->parent and group->childmask when an hierarchy update is
ongoing and reaches the formerly top level group is racy as those values
could be inconsistent. (The notation of migrator and active now slightly
changes in contrast to the above example, as now the childmasks are used.)

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
					 \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = NULL		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
   but did not yet connect GRP0:0 to GRP1:0.

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
	      active   = 0x00           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  idle           idle                !online
	  childmask=1

2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
   parent pointer of GRP0:0 and ...

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 0		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
   TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
   structure tmigr_walk (as update of childmask is not yet
   visible/updated). And now ...

			     [GRP1:0]
			migrator = TMIGR_NONE
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
   code).

			     [GRP1:0]
			migrator = 0x00
			active   = 0x00
			nextevt  = KTIME_MAX
		       /                  \
		 [GRP0:0]                  [GRP0:1]
	      migrator = 0x01           migrator = TMIGR_NONE
	      active   = 0x01           active   = 0x00
	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
	      childmask= 2		childmask= 1
	      parent   = GRP1:0		parent   = GRP1:0
		/         \
	      0          1 .. 7                8
	  active          idle                !online
	  childmask=1

	  tmigr_walk.childmask = 0

5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
   GRP1:0 but with childmask = 0 as stored in propagation data structure.

--> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
    it looks like GRP1:0 is always active.

To prevent those races, the setup of the hierarchy is moved into the
cpuhotplug prepare callback. The prepare callback is not executed by the
CPU which will come online, it is executed by the CPU which prepares
onlining of the other CPU. This CPU is active while it is connecting the
formerly top level to the new one. This prevents from (A) to happen and it
also prevents from any further walk above the formerly top level until that
active CPU becomes inactive, releasing the new ->parent and ->childmask
updates to be visible by any subsequent walk up above the formerly top
level hierarchy. This prevents from (B) to happen. The direction for the
updates is now forced to look like "from bottom to top".

However if the active CPU prevents from tmigr_cpu_(in)active() to walk up
with the update not-or-half visible, nothing prevents walking up to the new
top with a 0 childmask in tmigr_handle_remote_up() or
tmigr_requires_handle_remote_up() if the active CPU doing the prepare is
not the migrator. But then it looks fine because:

  * tmigr_check_migrator() should just return false
  * The migrator is active and should eventually observe the new childmask
    at some point in a future tick.

Split setup functionality of online callback into the cpuhotplug prepare
callback and setup hotplug state. Reorder the code, that all prepare
related functions are close to each other and online and offline callbacks
are also close together.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240703202839.8921-1-anna-maria@linutronix.de

---
 include/linux/cpuhotplug.h    |   1 +-
 kernel/time/timer_migration.c | 181 ++++++++++++++++++---------------
 2 files changed, 101 insertions(+), 81 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 7a5785f..df59666 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -122,6 +122,7 @@ enum cpuhp_state {
 	CPUHP_KVM_PPC_BOOK3S_PREPARE,
 	CPUHP_ZCOMP_PREPARE,
 	CPUHP_TIMERS_PREPARE,
+	CPUHP_TMIGR_PREPARE,
 	CPUHP_MIPS_SOC_PREPARE,
 	CPUHP_BP_PREPARE_DYN,
 	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index d91efe1..9b86efd 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1438,6 +1438,66 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+/*
+ * tmigr_trigger_active() - trigger a CPU to become active again
+ *
+ * This function is executed on a CPU which is part of cpu_online_mask, when the
+ * last active CPU in the hierarchy is offlining. With this, it is ensured that
+ * the other CPU is active and takes over the migrator duty.
+ */
+static long tmigr_trigger_active(void *unused)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	WARN_ON_ONCE(!tmc->online || tmc->idle);
+
+	return 0;
+}
+
+static int tmigr_cpu_offline(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	int migrator;
+	u64 firstexp;
+
+	raw_spin_lock_irq(&tmc->lock);
+	tmc->online = false;
+	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+	/*
+	 * CPU has to handle the local events on his own, when on the way to
+	 * offline; Therefore nextevt value is set to KTIME_MAX
+	 */
+	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+	trace_tmigr_cpu_offline(tmc);
+	raw_spin_unlock_irq(&tmc->lock);
+
+	if (firstexp != KTIME_MAX) {
+		migrator = cpumask_any_but(cpu_online_mask, cpu);
+		work_on_cpu(migrator, tmigr_trigger_active, NULL);
+	}
+
+	return 0;
+}
+
+static int tmigr_cpu_online(unsigned int cpu)
+{
+	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+	/* Check whether CPU data was successfully initialized */
+	if (WARN_ON_ONCE(!tmc->tmgroup))
+		return -EINVAL;
+
+	raw_spin_lock_irq(&tmc->lock);
+	trace_tmigr_cpu_online(tmc);
+	tmc->idle = timer_base_is_idle();
+	if (!tmc->idle)
+		__tmigr_cpu_activate(tmc);
+	tmc->online = true;
+	raw_spin_unlock_irq(&tmc->lock);
+	return 0;
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1512,7 +1572,7 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
+	struct tmigr_walk data;
 
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
@@ -1540,22 +1600,24 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 *   child to the new parent. So tmigr_connect_child_parent() is
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
+	 *
+	 * * It is ensured that the child is active, as this setup path is
+	 *   executed in hotplug prepare callback. This is exectued by an
+	 *   already connected and !idle CPU. Even if all other CPUs go idle,
+	 *   the CPU executing the setup will be responsible up to current top
+	 *   level group. And the next time it goes inactive, it will release
+	 *   the new childmask and parent to subsequent walkers through this
+	 *   @child. Therefore propagate active state unconditionally.
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
+	data.childmask = child->childmask;
 
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	/*
+	 * There is only one new level per time (which is protected by
+	 * tmigr_mutex). When connecting the child and the parent and set the
+	 * child active when the parent is inactive, the parent needs to be the
+	 * uppermost level. Otherwise there went something wrong!
+	 */
+	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
@@ -1661,80 +1723,32 @@ static int tmigr_add_cpu(unsigned int cpu)
 	return ret;
 }
 
-static int tmigr_cpu_online(unsigned int cpu)
+static int tmigr_cpu_prepare(unsigned int cpu)
 {
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int ret;
-
-	/* First online attempt? Initialize CPU data */
-	if (!tmc->tmgroup) {
-		raw_spin_lock_init(&tmc->lock);
-
-		ret = tmigr_add_cpu(cpu);
-		if (ret < 0)
-			return ret;
-
-		if (tmc->childmask == 0)
-			return -EINVAL;
+	struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
+	int ret = 0;
 
-		timerqueue_init(&tmc->cpuevt.nextevt);
-		tmc->cpuevt.nextevt.expires = KTIME_MAX;
-		tmc->cpuevt.ignore = true;
-		tmc->cpuevt.cpu = cpu;
-
-		tmc->remote = false;
-		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
-	}
-	raw_spin_lock_irq(&tmc->lock);
-	trace_tmigr_cpu_online(tmc);
-	tmc->idle = timer_base_is_idle();
-	if (!tmc->idle)
-		__tmigr_cpu_activate(tmc);
-	tmc->online = true;
-	raw_spin_unlock_irq(&tmc->lock);
-	return 0;
-}
-
-/*
- * tmigr_trigger_active() - trigger a CPU to become active again
- *
- * This function is executed on a CPU which is part of cpu_online_mask, when the
- * last active CPU in the hierarchy is offlining. With this, it is ensured that
- * the other CPU is active and takes over the migrator duty.
- */
-static long tmigr_trigger_active(void *unused)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+	/* Not first online attempt? */
+	if (tmc->tmgroup)
+		return ret;
 
-	WARN_ON_ONCE(!tmc->online || tmc->idle);
+	raw_spin_lock_init(&tmc->lock);
 
-	return 0;
-}
+	ret = tmigr_add_cpu(cpu);
+	if (ret < 0)
+		return ret;
 
-static int tmigr_cpu_offline(unsigned int cpu)
-{
-	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
-	int migrator;
-	u64 firstexp;
+	if (tmc->childmask == 0)
+		return -EINVAL;
 
-	raw_spin_lock_irq(&tmc->lock);
-	tmc->online = false;
+	timerqueue_init(&tmc->cpuevt.nextevt);
+	tmc->cpuevt.nextevt.expires = KTIME_MAX;
+	tmc->cpuevt.ignore = true;
+	tmc->cpuevt.cpu = cpu;
+	tmc->remote = false;
 	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
 
-	/*
-	 * CPU has to handle the local events on his own, when on the way to
-	 * offline; Therefore nextevt value is set to KTIME_MAX
-	 */
-	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
-	trace_tmigr_cpu_offline(tmc);
-	raw_spin_unlock_irq(&tmc->lock);
-
-	if (firstexp != KTIME_MAX) {
-		migrator = cpumask_any_but(cpu_online_mask, cpu);
-		work_on_cpu(migrator, tmigr_trigger_active, NULL);
-	}
-
-	return 0;
+	return ret;
 }
 
 static int __init tmigr_init(void)
@@ -1793,6 +1807,11 @@ static int __init tmigr_init(void)
 		tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
 		tmigr_crossnode_level);
 
+	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:prepare",
+				tmigr_cpu_prepare, NULL);
+	if (ret)
+		goto err;
+
 	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
 				tmigr_cpu_online, tmigr_cpu_offline);
 	if (ret)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [tip: timers/core] timers/migration: Do not rely always on group->parent
  2024-07-01 10:18 ` [PATCH v3 1/8] timers/migration: Do not rely always on group->parent Anna-Maria Behnsen
@ 2024-07-04 18:32   ` tip-bot2 for Anna-Maria Behnsen
  0 siblings, 0 replies; 32+ messages in thread
From: tip-bot2 for Anna-Maria Behnsen @ 2024-07-04 18:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Borislav Petkov, Anna-Maria Behnsen, Thomas Gleixner,
	Frederic Weisbecker, stable, x86, linux-kernel

The following commit has been merged into the timers/core branch of tip:

Commit-ID:     cabe7ebd57a0bfd0ef8e74f0c70423ae8d4d9171
Gitweb:        https://git.kernel.org/tip/cabe7ebd57a0bfd0ef8e74f0c70423ae8d4d9171
Author:        Anna-Maria Behnsen <anna-maria@linutronix.de>
AuthorDate:    Mon, 01 Jul 2024 12:18:37 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 04 Jul 2024 20:22:21 +02:00

timers/migration: Do not rely always on group->parent

When reading group->parent without holding the group lock it is racy
against CPUs coming online the first time and thereby creating another
level of the hierarchy. This is not a problem when this value is read once
to decide whether to abort a propagation or not. The worst outcome is an
unnecessary/early CPU wake up. But it is racy when reading it several times
during a single 'action' (like activation, deactivation, checking for
remote timer expiry,...) and relying on the consitency of this value
without holding the lock. This happens at the moment e.g. in
tmigr_inactive_up() which is also calling tmigr_udpate_events(). Code relys
on group->parent not to change during this 'action'.

Update parent struct member description to explain the above only
once. Remove parent pointer checks when they are not mandatory (like update
of data->childmask). Remove a warning, which would be nice but the trigger
of this warning is not reliable and add expand the data structure member
description instead. Expand a comment, why it is safe to rely on parent
pointer here (inside hierarchy update).

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240701-tmigr-fixes-v3-1-25cd5de318fb@linutronix.de

---
 kernel/time/timer_migration.c | 33 +++++++++++++++------------------
 kernel/time/timer_migration.h | 12 +++++++++++-
 2 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 8441311..d91efe1 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -507,7 +507,14 @@ static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
  *			(get_next_timer_interrupt())
  * @firstexp:		Contains the first event expiry information when last
  *			active CPU of hierarchy is on the way to idle to make
- *			sure CPU will be back in time.
+ *			sure CPU will be back in time. It is updated in top
+ *			level group only. Be aware, there could occur a new top
+ *			level of the hierarchy between the 'top level call' in
+ *			tmigr_update_events() and the check for the parent group
+ *			in walk_groups(). Then @firstexp might contain a value
+ *			!= KTIME_MAX even if it was not the final top
+ *			level. This is not a problem, as the worst outcome is a
+ *			CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
  * @childmask:		childmask of child group
@@ -649,7 +656,7 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
-	if ((walk_done == false) && group->parent)
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
 	/*
@@ -1317,20 +1324,9 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	/* Event Handling */
 	tmigr_update_events(group, child, data);
 
-	if (group->parent && (walk_done == false))
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	/*
-	 * data->firstexp was set by tmigr_update_events() and contains the
-	 * expiry of the first global event which needs to be handled. It
-	 * differs from KTIME_MAX if:
-	 * - group is the top level group and
-	 * - group is idle (which means CPU was the last active CPU in the
-	 *   hierarchy) and
-	 * - there is a pending event in the hierarchy
-	 */
-	WARN_ON_ONCE(data->firstexp != KTIME_MAX && group->parent);
-
 	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 
 	return walk_done;
@@ -1552,10 +1548,11 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 		data.childmask = child->childmask;
 
 		/*
-		 * There is only one new level per time. When connecting the
-		 * child and the parent and set the child active when the parent
-		 * is inactive, the parent needs to be the uppermost
-		 * level. Otherwise there went something wrong!
+		 * There is only one new level per time (which is protected by
+		 * tmigr_mutex). When connecting the child and the parent and
+		 * set the child active when the parent is inactive, the parent
+		 * needs to be the uppermost level. Otherwise there went
+		 * something wrong!
 		 */
 		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 	}
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 6c37d94..494f68c 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -22,7 +22,17 @@ struct tmigr_event {
  * struct tmigr_group - timer migration hierarchy group
  * @lock:		Lock protecting the event information and group hierarchy
  *			information during setup
- * @parent:		Pointer to the parent group
+ * @parent:		Pointer to the parent group. Pointer is updated when a
+ *			new hierarchy level is added because of a CPU coming
+ *			online the first time. Once it is set, the pointer will
+ *			not be removed or updated. When accessing parent pointer
+ *			lock less to decide whether to abort a propagation or
+ *			not, it is not a problem. The worst outcome is an
+ *			unnecessary/early CPU wake up. But do not access parent
+ *			pointer several times in the same 'action' (like
+ *			activation, deactivation, check for remote expiry,...)
+ *			without holding the lock as it is not ensured that value
+ *			will not change.
  * @groupevt:		Next event of the group which is only used when the
  *			group is !active. The group event is then queued into
  *			the parent timer queue.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
  2024-07-03 21:24     ` Frederic Weisbecker
  2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
@ 2024-07-11  8:56     ` Alexander Stein
  2024-07-15 10:39       ` Jon Hunter
  2 siblings, 1 reply; 32+ messages in thread
From: Alexander Stein @ 2024-07-11  8:56 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel,
	Anna-Maria Behnsen
  Cc: Anna-Maria Behnsen

Hi,

Am Mittwoch, 3. Juli 2024, 22:28:39 CEST schrieb Anna-Maria Behnsen:
> When a CPU comes online the first time, it is possible that a new top level
> group will be created. In general all propagation is done from the bottom
> to top. This minimizes complexity and prevents possible races. But when a
> new top level group is created, the formely top level group needs to be
> connected to the new level. This is the only time, when the direction to
> propagate changes is changed: the changes are propagated from top (new top
> level group) to bottom (formerly top level group).

Now that this commit is integrated in linux-next I'm starting to see the
kernel error message:
> Timer migration setup failed
on my arm64 i.MX8MP platform (imx8mp-tqma8mpql-mba8mpxl.dts).

Reverting the following commits
* 746770499be55cf375a108a321a818b238182446
* 2e0bd37f7b395173f879225b9d8b1811af4a8a35

* 633c77727d32ab3487a10ec8f125c26b416236ad
* 56f9a5fd69eae92d7051a228ee192452248a6bdc
* 6dbb59418c5c7b014e542db76595417c9b95ccde
* 6d8a8f54e045e2030eebb53b5ce859c80d9425f6
* 8cdb61838ee5c63556773ea2eed24deab6b15257

I could get rid of that error message.

Best regards,
Alexander

> This introduces two races (see (A) and (B)) as reported by Frederic:
> 
> (A) This race happens, when marking the formely top level group as active,
> but the last active CPU of the formerly top level group goes idle. Then
> it's likely that formerly group is no longer active, but marked
> nevertheless as active in new top level group:
> 
> 		  [GRP0:0]
> 	       migrator = 0
> 	       active   = 0
> 	       nextevt  = KTIME_MAX
> 	       /         \
> 	      0         1 .. 7
> 	  active         idle
> 
> 0) Hierarchy has for now only 8 CPUs and CPU 0 is the only active CPU.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 					 \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 1) CPU 8 is booting and creates a new group in first level GRP0:1 and
>    therefore also a new top group GRP1:0. For now the setup code proceeded
>    only until the connected between GRP0:1 to the new top group. The
>    connection between CPU8 and GRP0:1 is not yet established and CPU 8 is
>    still !online.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0              migrator = TMIGR_NONE
> 	      active   = 0              active   = NONE
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	  active         idle                !online
> 
> 2) Setup code now connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = NONE
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE        migrator = TMIGR_NONE
> 	      active   = NONE              active   = NONE
> 	      nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling
>    tmigr_update_events() and it propagates the change to the top (no change
>    there and no wakeup programmed since there is no timer).
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = TMIGR_NONE
> 	      active   = NONE             active   = NONE
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \
> 	      0          1 .. 7                8
> 	    idle         idle                !online
> 
> 4) Now the setup code finally calls tmigr_active_up() to and sets GRP0:0
>    active in GRP1:0
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0, GRP0:1
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE       migrator = 8
> 	      active   = NONE             active   = 8
> 	      nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                active
> 
> 5) Now CPU 8 is connected with GRP0:1 and CPU 8 calls tmigr_active_up() out
>    of tmigr_cpu_online().
> 
> 			     [GRP1:0]
> 			migrator = GRP0:0
> 			active   = GRP0:0
> 			nextevt  = T8
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE         migrator = TMIGR_NONE
> 	      active   = NONE               active   = NONE
> 	      nextevt  = KTIME_MAX          nextevt  = T8
> 		/         \                    |
> 	      0          1 .. 7                8
> 	    idle         idle                  idle
> 
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
> 
> --> The update which is done in third step is not noticed by setup code. So
>     a wrong migrator is set to top level group and a timer could get
>     ignored.
> 
> (B) Reading group->parent and group->childmask when an hierarchy update is
> ongoing and reaches the formerly top level group is racy as those values
> could be inconsistent. (The notation of migrator and active now slightly
> changes in contrast to the above example, as now the childmasks are used.)
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 					 \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
> 	      active   = 0x00           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = NULL		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  idle           idle                !online
> 	  childmask=1
> 
> 1) Hierarchy has 8 CPUs. CPU 8 is at the moment in the process of onlining
>    but did not yet connect GRP0:0 to GRP1:0.
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = TMIGR_NONE     migrator = TMIGR_NONE
> 	      active   = 0x00           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  idle           idle                !online
> 	  childmask=1
> 
> 2) Setup code (running on CPU 8) now connects GRP0:0 to GRP1:0, updates
>    parent pointer of GRP0:0 and ...
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 0		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 3) ... CPU 0 comes active in the same time. As migrator in GRP0:0 was
>    TMIGR_NONE, childmask of GRP0:0 is stored in update propagation data
>    structure tmigr_walk (as update of childmask is not yet
>    visible/updated). And now ...
> 
> 			     [GRP1:0]
> 			migrator = TMIGR_NONE
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 2		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 4) ... childmask of GRP0:0 is updated by CPU 8 (still part of setup
>    code).
> 
> 			     [GRP1:0]
> 			migrator = 0x00
> 			active   = 0x00
> 			nextevt  = KTIME_MAX
> 		       /                  \
> 		 [GRP0:0]                  [GRP0:1]
> 	      migrator = 0x01           migrator = TMIGR_NONE
> 	      active   = 0x01           active   = 0x00
> 	      nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> 	      childmask= 2		childmask= 1
> 	      parent   = GRP1:0		parent   = GRP1:0
> 		/         \
> 	      0          1 .. 7                8
> 	  active          idle                !online
> 	  childmask=1
> 
> 	  tmigr_walk.childmask = 0
> 
> 5) CPU 0 sees the connection to GRP1:0 and now propagates active state to
>    GRP1:0 but with childmask = 0 as stored in propagation data structure.
> 
> --> Now GRP1:0 always has a migrator as 0x00 != TMIGR_NONE and for all CPUs
>     it looks like GRP1:0 is always active.
> 
> To prevent those races, the setup of the hierarchy is moved into the
> cpuhotplug prepare callback. The prepare callback is not executed by the
> CPU which will come online, it is executed by the CPU which prepares
> onlining of the other CPU. This CPU is active while it is connecting the
> formerly top level to the new one. This prevents from (A) to happen and it
> also prevents from any further walk above the formerly top level until that
> active CPU becomes inactive, releasing the new ->parent and ->childmask
> updates to be visible by any subsequent walk up above the formerly top
> level hierarchy. This prevents from (B) to happen. The direction for the
> updates is now forced to look like "from bottom to top".
> 
> However if the active CPU prevents from tmigr_cpu_(in)active() to walk up
> with the update not-or-half visible, nothing prevents walking up to the new
> top with a 0 childmask in tmigr_handle_remote_up() or
> tmigr_requires_handle_remote_up() if the active CPU doing the prepare is
> not the migrator. But then it looks fine because:
> 
>   * tmigr_check_migrator() should just return false
>   * The migrator is active and should eventually observe the new childmask
>     at some point in a future tick.
> 
> Split setup functionality of online callback into the cpuhotplug prepare
> callback and setup hotplug state. Reorder the code, that all prepare
> related functions are close to each other and online and offline callbacks
> are also close together.
> 
> Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  include/linux/cpuhotplug.h    |   1 +
>  kernel/time/timer_migration.c | 181 +++++++++++++++++++---------------
>  2 files changed, 101 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 7a5785f405b6..df59666a2a66 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -122,6 +122,7 @@ enum cpuhp_state {
>  	CPUHP_KVM_PPC_BOOK3S_PREPARE,
>  	CPUHP_ZCOMP_PREPARE,
>  	CPUHP_TIMERS_PREPARE,
> +	CPUHP_TMIGR_PREPARE,
>  	CPUHP_MIPS_SOC_PREPARE,
>  	CPUHP_BP_PREPARE_DYN,
>  	CPUHP_BP_PREPARE_DYN_END		= CPUHP_BP_PREPARE_DYN + 20,
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index d91efe1dc3bf..9b86efded4d5 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -1438,6 +1438,66 @@ u64 tmigr_quick_check(u64 nextevt)
>  	return KTIME_MAX;
>  }
>  
> +/*
> + * tmigr_trigger_active() - trigger a CPU to become active again
> + *
> + * This function is executed on a CPU which is part of cpu_online_mask, when the
> + * last active CPU in the hierarchy is offlining. With this, it is ensured that
> + * the other CPU is active and takes over the migrator duty.
> + */
> +static long tmigr_trigger_active(void *unused)
> +{
> +	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> +
> +	WARN_ON_ONCE(!tmc->online || tmc->idle);
> +
> +	return 0;
> +}
> +
> +static int tmigr_cpu_offline(unsigned int cpu)
> +{
> +	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> +	int migrator;
> +	u64 firstexp;
> +
> +	raw_spin_lock_irq(&tmc->lock);
> +	tmc->online = false;
> +	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> +
> +	/*
> +	 * CPU has to handle the local events on his own, when on the way to
> +	 * offline; Therefore nextevt value is set to KTIME_MAX
> +	 */
> +	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
> +	trace_tmigr_cpu_offline(tmc);
> +	raw_spin_unlock_irq(&tmc->lock);
> +
> +	if (firstexp != KTIME_MAX) {
> +		migrator = cpumask_any_but(cpu_online_mask, cpu);
> +		work_on_cpu(migrator, tmigr_trigger_active, NULL);
> +	}
> +
> +	return 0;
> +}
> +
> +static int tmigr_cpu_online(unsigned int cpu)
> +{
> +	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> +
> +	/* Check whether CPU data was successfully initialized */
> +	if (WARN_ON_ONCE(!tmc->tmgroup))
> +		return -EINVAL;
> +
> +	raw_spin_lock_irq(&tmc->lock);
> +	trace_tmigr_cpu_online(tmc);
> +	tmc->idle = timer_base_is_idle();
> +	if (!tmc->idle)
> +		__tmigr_cpu_activate(tmc);
> +	tmc->online = true;
> +	raw_spin_unlock_irq(&tmc->lock);
> +	return 0;
> +}
> +
>  static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
>  			     int node)
>  {
> @@ -1512,7 +1572,7 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
>  static void tmigr_connect_child_parent(struct tmigr_group *child,
>  				       struct tmigr_group *parent)
>  {
> -	union tmigr_state childstate;
> +	struct tmigr_walk data;
>  
>  	raw_spin_lock_irq(&child->lock);
>  	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
> @@ -1540,22 +1600,24 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
>  	 *   child to the new parent. So tmigr_connect_child_parent() is
>  	 *   executed with the formerly top level group (child) and the newly
>  	 *   created group (parent).
> +	 *
> +	 * * It is ensured that the child is active, as this setup path is
> +	 *   executed in hotplug prepare callback. This is exectued by an
> +	 *   already connected and !idle CPU. Even if all other CPUs go idle,
> +	 *   the CPU executing the setup will be responsible up to current top
> +	 *   level group. And the next time it goes inactive, it will release
> +	 *   the new childmask and parent to subsequent walkers through this
> +	 *   @child. Therefore propagate active state unconditionally.
>  	 */
> -	childstate.state = atomic_read(&child->migr_state);
> -	if (childstate.migrator != TMIGR_NONE) {
> -		struct tmigr_walk data;
> -
> -		data.childmask = child->childmask;
> +	data.childmask = child->childmask;
>  
> -		/*
> -		 * There is only one new level per time (which is protected by
> -		 * tmigr_mutex). When connecting the child and the parent and
> -		 * set the child active when the parent is inactive, the parent
> -		 * needs to be the uppermost level. Otherwise there went
> -		 * something wrong!
> -		 */
> -		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
> -	}
> +	/*
> +	 * There is only one new level per time (which is protected by
> +	 * tmigr_mutex). When connecting the child and the parent and set the
> +	 * child active when the parent is inactive, the parent needs to be the
> +	 * uppermost level. Otherwise there went something wrong!
> +	 */
> +	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
>  }
>  
>  static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
> @@ -1661,80 +1723,32 @@ static int tmigr_add_cpu(unsigned int cpu)
>  	return ret;
>  }
>  
> -static int tmigr_cpu_online(unsigned int cpu)
> +static int tmigr_cpu_prepare(unsigned int cpu)
>  {
> -	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> -	int ret;
> -
> -	/* First online attempt? Initialize CPU data */
> -	if (!tmc->tmgroup) {
> -		raw_spin_lock_init(&tmc->lock);
> -
> -		ret = tmigr_add_cpu(cpu);
> -		if (ret < 0)
> -			return ret;
> -
> -		if (tmc->childmask == 0)
> -			return -EINVAL;
> +	struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
> +	int ret = 0;
>  
> -		timerqueue_init(&tmc->cpuevt.nextevt);
> -		tmc->cpuevt.nextevt.expires = KTIME_MAX;
> -		tmc->cpuevt.ignore = true;
> -		tmc->cpuevt.cpu = cpu;
> -
> -		tmc->remote = false;
> -		WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> -	}
> -	raw_spin_lock_irq(&tmc->lock);
> -	trace_tmigr_cpu_online(tmc);
> -	tmc->idle = timer_base_is_idle();
> -	if (!tmc->idle)
> -		__tmigr_cpu_activate(tmc);
> -	tmc->online = true;
> -	raw_spin_unlock_irq(&tmc->lock);
> -	return 0;
> -}
> -
> -/*
> - * tmigr_trigger_active() - trigger a CPU to become active again
> - *
> - * This function is executed on a CPU which is part of cpu_online_mask, when the
> - * last active CPU in the hierarchy is offlining. With this, it is ensured that
> - * the other CPU is active and takes over the migrator duty.
> - */
> -static long tmigr_trigger_active(void *unused)
> -{
> -	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> +	/* Not first online attempt? */
> +	if (tmc->tmgroup)
> +		return ret;
>  
> -	WARN_ON_ONCE(!tmc->online || tmc->idle);
> +	raw_spin_lock_init(&tmc->lock);
>  
> -	return 0;
> -}
> +	ret = tmigr_add_cpu(cpu);
> +	if (ret < 0)
> +		return ret;
>  
> -static int tmigr_cpu_offline(unsigned int cpu)
> -{
> -	struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> -	int migrator;
> -	u64 firstexp;
> +	if (tmc->childmask == 0)
> +		return -EINVAL;
>  
> -	raw_spin_lock_irq(&tmc->lock);
> -	tmc->online = false;
> +	timerqueue_init(&tmc->cpuevt.nextevt);
> +	tmc->cpuevt.nextevt.expires = KTIME_MAX;
> +	tmc->cpuevt.ignore = true;
> +	tmc->cpuevt.cpu = cpu;
> +	tmc->remote = false;
>  	WRITE_ONCE(tmc->wakeup, KTIME_MAX);
>  
> -	/*
> -	 * CPU has to handle the local events on his own, when on the way to
> -	 * offline; Therefore nextevt value is set to KTIME_MAX
> -	 */
> -	firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX);
> -	trace_tmigr_cpu_offline(tmc);
> -	raw_spin_unlock_irq(&tmc->lock);
> -
> -	if (firstexp != KTIME_MAX) {
> -		migrator = cpumask_any_but(cpu_online_mask, cpu);
> -		work_on_cpu(migrator, tmigr_trigger_active, NULL);
> -	}
> -
> -	return 0;
> +	return ret;
>  }
>  
>  static int __init tmigr_init(void)
> @@ -1793,6 +1807,11 @@ static int __init tmigr_init(void)
>  		tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
>  		tmigr_crossnode_level);
>  
> +	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:prepare",
> +				tmigr_cpu_prepare, NULL);
> +	if (ret)
> +		goto err;
> +
>  	ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
>  				tmigr_cpu_online, tmigr_cpu_offline);
>  	if (ret)
> 


-- 
TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
Amtsgericht München, HRB 105018
Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
http://www.tq-group.com/



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements
  2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
                   ` (7 preceding siblings ...)
  2024-07-01 10:18 ` [PATCH v3 8/8] timers/migration: Fix grammar in comment Anna-Maria Behnsen
@ 2024-07-11 15:44 ` Anna-Maria Behnsen
  8 siblings, 0 replies; 32+ messages in thread
From: Anna-Maria Behnsen @ 2024-07-11 15:44 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel
  Cc: Borislav Petkov, Oliver Sang

Hi,

(cc Oliver Sang)

Anna-Maria Behnsen <anna-maria@linutronix.de> writes:

> Borislav reported a warning in timer migration deactive path
>
>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
>
> Sadly it doesn't reproduce directly. But with the change of timing (by
> adding a trace prinkt before the warning), it is possible to trigger the
> warning reliable at least in my test setup. The problem here is a racy
> check agains group->parent pointer. This is also used in other places in
> the code and fixing this racy usage is adressed by the first patch.
>
> There were two other races reported by Frederic in setup path:
>
>   https://lore.kernel.org/r/ZnWOswTMML6ShzYO@localhost.localdomain
>
>   https://lore.kernel.org/r/ZnoIlO22habOyQRe@lothringen
>
> Those races are both is addressed by the change of patch 2.
>
> Some updates/cleanups are provided by patch 3-8. ("timers/migration:
> Improve tracing" and "timers/migration: Spare write when nothing changed"
> are the same as provided by v2).
>
> Patches are available here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
>

Thomas, please remove this queue when possible from
tip/timers/urgent. There are some things broken and needs to be
fixed. Otherwise we get a Fixes-Fixes-Patch. See report of kernel test
robot:

  https://lore.kernel.org/r/202407101636.d9d4e8be-oliver.sang@intel.com

Two main problems are:
 - wrong CPU hotplug state is used for prepare in cpuhp_setup_state()
 - using this_cpu_ptr() instead of per_cpu_ptr()

Working on preparation of v4.

Thanks,

	Anna-Maria

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-11  8:56     ` [PATCH v4 2/8] " Alexander Stein
@ 2024-07-15 10:39       ` Jon Hunter
  2024-07-15 10:44         ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Jon Hunter @ 2024-07-15 10:39 UTC (permalink / raw)
  To: Alexander Stein, Frederic Weisbecker, Thomas Gleixner,
	linux-kernel, Anna-Maria Behnsen
  Cc: linux-tegra@vger.kernel.org


On 11/07/2024 09:56, Alexander Stein wrote:
> Hi,
> 
> Am Mittwoch, 3. Juli 2024, 22:28:39 CEST schrieb Anna-Maria Behnsen:
>> When a CPU comes online the first time, it is possible that a new top level
>> group will be created. In general all propagation is done from the bottom
>> to top. This minimizes complexity and prevents possible races. But when a
>> new top level group is created, the formely top level group needs to be
>> connected to the new level. This is the only time, when the direction to
>> propagate changes is changed: the changes are propagated from top (new top
>> level group) to bottom (formerly top level group).
> 
> Now that this commit is integrated in linux-next I'm starting to see the
> kernel error message:
>> Timer migration setup failed
> on my arm64 i.MX8MP platform (imx8mp-tqma8mpql-mba8mpxl.dts).


I am also seeing the same on our 32-bit and 64-bit ARM Tegra boards.

Any feedback on this?

Thanks
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-15 10:39       ` Jon Hunter
@ 2024-07-15 10:44         ` Frederic Weisbecker
  2024-07-15 13:25           ` Jon Hunter
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-15 10:44 UTC (permalink / raw)
  To: Jon Hunter
  Cc: Alexander Stein, Thomas Gleixner, linux-kernel,
	Anna-Maria Behnsen, linux-tegra@vger.kernel.org

Hi,

Le Mon, Jul 15, 2024 at 11:39:40AM +0100, Jon Hunter a écrit :
> 
> On 11/07/2024 09:56, Alexander Stein wrote:
> > Hi,
> > 
> > Am Mittwoch, 3. Juli 2024, 22:28:39 CEST schrieb Anna-Maria Behnsen:
> > > When a CPU comes online the first time, it is possible that a new top level
> > > group will be created. In general all propagation is done from the bottom
> > > to top. This minimizes complexity and prevents possible races. But when a
> > > new top level group is created, the formely top level group needs to be
> > > connected to the new level. This is the only time, when the direction to
> > > propagate changes is changed: the changes are propagated from top (new top
> > > level group) to bottom (formerly top level group).
> > 
> > Now that this commit is integrated in linux-next I'm starting to see the
> > kernel error message:
> > > Timer migration setup failed
> > on my arm64 i.MX8MP platform (imx8mp-tqma8mpql-mba8mpxl.dts).
> 
> 
> I am also seeing the same on our 32-bit and 64-bit ARM Tegra boards.
> 
> Any feedback on this?

The patches have been reverted on -tip (still in -next?) and Anna-Maria is
working on a new version.

Thanks!



> 
> Thanks
> Jon
> 
> -- 
> nvpublic

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback
  2024-07-15 10:44         ` Frederic Weisbecker
@ 2024-07-15 13:25           ` Jon Hunter
  0 siblings, 0 replies; 32+ messages in thread
From: Jon Hunter @ 2024-07-15 13:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Alexander Stein, Thomas Gleixner, linux-kernel,
	Anna-Maria Behnsen, linux-tegra@vger.kernel.org


On 15/07/2024 11:44, Frederic Weisbecker wrote:
> Hi,
> 
> Le Mon, Jul 15, 2024 at 11:39:40AM +0100, Jon Hunter a écrit :
>>
>> On 11/07/2024 09:56, Alexander Stein wrote:
>>> Hi,
>>>
>>> Am Mittwoch, 3. Juli 2024, 22:28:39 CEST schrieb Anna-Maria Behnsen:
>>>> When a CPU comes online the first time, it is possible that a new top level
>>>> group will be created. In general all propagation is done from the bottom
>>>> to top. This minimizes complexity and prevents possible races. But when a
>>>> new top level group is created, the formely top level group needs to be
>>>> connected to the new level. This is the only time, when the direction to
>>>> propagate changes is changed: the changes are propagated from top (new top
>>>> level group) to bottom (formerly top level group).
>>>
>>> Now that this commit is integrated in linux-next I'm starting to see the
>>> kernel error message:
>>>> Timer migration setup failed
>>> on my arm64 i.MX8MP platform (imx8mp-tqma8mpql-mba8mpxl.dts).
>>
>>
>> I am also seeing the same on our 32-bit and 64-bit ARM Tegra boards.
>>
>> Any feedback on this?
> 
> The patches have been reverted on -tip (still in -next?) and Anna-Maria is
> working on a new version.


Thanks! They were in next-20240715 looking at this mornings tests.

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place
  2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
  2024-07-02 12:04   ` Frederic Weisbecker
  2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
@ 2024-07-15 19:28   ` Frederic Weisbecker
  2 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2024-07-15 19:28 UTC (permalink / raw)
  To: Anna-Maria Behnsen; +Cc: Thomas Gleixner, linux-kernel

Le Mon, Jul 01, 2024 at 12:18:41PM +0200, Anna-Maria Behnsen a écrit :
> Reading the childmask and parent pointer is required when propagating
> changes through the hierarchy. At the moment this reads are spread all over
> the place which makes it harder to follow.
> 
> Move those reads to a single place to keep code clean.
> 
> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
> ---
>  kernel/time/timer_migration.c | 20 +++++---------------
>  1 file changed, 5 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> index b4391abfb4a9..a681cf89910e 100644
> --- a/kernel/time/timer_migration.c
> +++ b/kernel/time/timer_migration.c
> @@ -535,6 +535,7 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
>  
>  		child = group;
>  		group = group->parent;
> +		data->childmask = group->childmask;

Should be "data->childmask = child->childmask;" ?

>  	} while (group);
>  }

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-07-15 19:28 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-01 10:18 [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 1/8] timers/migration: Do not rely always on group->parent Anna-Maria Behnsen
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 2/8] timers/migration: Move hierarchy setup into cpuhotplug prepare callback Anna-Maria Behnsen
2024-07-01 21:49   ` Frederic Weisbecker
2024-07-03 20:28   ` [PATCH v4 " Anna-Maria Behnsen
2024-07-03 21:24     ` Frederic Weisbecker
2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-11  8:56     ` [PATCH v4 2/8] " Alexander Stein
2024-07-15 10:39       ` Jon Hunter
2024-07-15 10:44         ` Frederic Weisbecker
2024-07-15 13:25           ` Jon Hunter
2024-07-01 10:18 ` [PATCH v3 3/8] timers/migration: Improve tracing Anna-Maria Behnsen
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 4/8] timers/migration: Use a single struct for hierarchy walk data Anna-Maria Behnsen
2024-07-02 11:43   ` Frederic Weisbecker
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 5/8] timers/migration: Read childmask and parent pointer in a single place Anna-Maria Behnsen
2024-07-02 12:04   ` Frederic Weisbecker
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-15 19:28   ` [PATCH v3 5/8] " Frederic Weisbecker
2024-07-01 10:18 ` [PATCH v3 6/8] timers/migration: Rename childmask by parentmask to make naming more obvious Anna-Maria Behnsen
2024-07-02 12:45   ` Frederic Weisbecker
2024-07-03 20:31   ` [PATCH v4 6/8] timers/migration: Rename childmask by groupmask " Anna-Maria Behnsen
2024-07-03 21:33     ` Frederic Weisbecker
2024-07-04 18:32     ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 7/8] timers/migration: Spare write when nothing changed Anna-Maria Behnsen
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-01 10:18 ` [PATCH v3 8/8] timers/migration: Fix grammar in comment Anna-Maria Behnsen
2024-07-02 12:51   ` Frederic Weisbecker
2024-07-04 18:32   ` [tip: timers/core] " tip-bot2 for Anna-Maria Behnsen
2024-07-11 15:44 ` [PATCH v3 0/8] timers/migration: Fix three possible races and some improvements Anna-Maria Behnsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox