public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline
@ 2026-03-21 10:24 Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 1/7] timer/migration: Fix kernel-doc warnings for union tmigr_state Ionut Nechita (Wind River)
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable
  Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik,
	Ionut Nechita (Wind River)

This series backports 7 upstream commits to the timer migration
subsystem for linux-6.12.y stable.

The most important patch is 7/7 which fixes imbalanced NUMA trees in
the timer migration hierarchy (Fixes: 7ee988770326). Patches 5/7 and
6/7 are its stable dependencies. The remaining patches are cleanups
and annotations that complete the backport.

After applying this series, kernel/time/timer_migration.c and
kernel/time/timer_migration.h match mainline exactly for these
functions.

All patches are clean cherry-picks with no conflicts.

Changes since v1:
  - Added upstream commit IDs to each patch (Greg KH)

Upstream commits:
  4477b0601471 ("timer/migration: Fix kernel-doc warnings for union tmigr_state")
  922efd298bb2 ("timers/migration: Annotate accesses to ignore flag")
  dcf6230555dc ("timers/migration: Simplify top level detection on group setup")
  ff56a3e2a861 ("timers/migration: Clean up the loop in tmigr_quick_check()")
  6c181b5667ee ("timers/migration: Convert "while" loops to use "for"")
  fa9620355d41 ("timers/migration: Remove locking on group connection")
  5eb579dfd46b ("timers/migration: Fix imbalanced NUMA trees")

Frederic Weisbecker (5):
  timers/migration: Annotate accesses to ignore flag
  timers/migration: Simplify top level detection on group setup
  timers/migration: Convert "while" loops to use "for"
  timers/migration: Remove locking on group connection
  timers/migration: Fix imbalanced NUMA trees

Petr Tesarik (1):
  timers/migration: Clean up the loop in tmigr_quick_check()

Randy Dunlap (1):
  timer/migration: Fix kernel-doc warnings for union tmigr_state

 kernel/time/timer_migration.c | 296 ++++++++++++++++++----------------
 kernel/time/timer_migration.h |  21 ++-
 2 files changed, 167 insertions(+), 150 deletions(-)

-- 
2.53.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 1/7] timer/migration: Fix kernel-doc warnings for union tmigr_state
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag Ionut Nechita (Wind River)
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Randy Dunlap <rdunlap@infradead.org>

commit 4477b0601471ba4fc67501b62b78aebd327fefd7 upstream.

Use the correct kernel-doc notation for nested structs/unions to
eliminate warnings:

timer_migration.h:119: warning: Incorrect use of kernel-doc format:          * struct - split state of tmigr_group
timer_migration.h:134: warning: Function parameter or struct member 'active' not described in 'tmigr_state'
timer_migration.h:134: warning: Function parameter or struct member 'migrator' not described in 'tmigr_state'
timer_migration.h:134: warning: Function parameter or struct member 'seq' not described in 'tmigr_state'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250111063156.910903-1-rdunlap@infradead.org
---
 kernel/time/timer_migration.h | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 154accc7a543c..ae19f70f8170f 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -110,22 +110,19 @@ struct tmigr_cpu {
  * union tmigr_state - state of tmigr_group
  * @state:	Combined version of the state - only used for atomic
  *		read/cmpxchg function
- * @struct:	Split version of the state - only use the struct members to
+ * &anon struct: Split version of the state - only use the struct members to
  *		update information to stay independent of endianness
+ * @active:	Contains each mask bit of the active children
+ * @migrator:	Contains mask of the child which is migrator
+ * @seq:	Sequence counter needs to be increased when an update
+ *		to the tmigr_state is done. It prevents a race when
+ *		updates in the child groups are propagated in changed
+ *		order. Detailed information about the scenario is
+ *		given in the documentation at the begin of
+ *		timer_migration.c.
  */
 union tmigr_state {
 	u32 state;
-	/**
-	 * struct - split state of tmigr_group
-	 * @active:	Contains each mask bit of the active children
-	 * @migrator:	Contains mask of the child which is migrator
-	 * @seq:	Sequence counter needs to be increased when an update
-	 *		to the tmigr_state is done. It prevents a race when
-	 *		updates in the child groups are propagated in changed
-	 *		order. Detailed information about the scenario is
-	 *		given in the documentation at the begin of
-	 *		timer_migration.c.
-	 */
 	struct {
 		u8	active;
 		u8	migrator;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 1/7] timer/migration: Fix kernel-doc warnings for union tmigr_state Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 11:03   ` Greg KH
  2026-03-21 10:24 ` [PATCH v2 6.12.y 3/7] timers/migration: Simplify top level detection on group setup Ionut Nechita (Wind River)
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable
  Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita,
	kernel test robot

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Frederic Weisbecker <frederic@kernel.org>

commit 922efd298bb2636880408c00942dbd54d8bf6e0d upstream.

The group's ignore flag is:

_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit

When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.

On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.

Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:

		 BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report

		 write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
		 __tmigr_cpu_activate
		 tmigr_cpu_activate
		 timer_clear_idle
		 tick_nohz_restart_sched_tick
		 tick_nohz_idle_exit
		 do_idle
		 cpu_startup_entry
		 kernel_init
		 do_initcalls
		 clear_bss
		 reserve_bios_regions
		 common_startup_64

		 read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
		 print_report
		 kcsan_report_known_origin
		 kcsan_setup_watchpoint
		 tmigr_next_groupevt
		 tmigr_update_events
		 tmigr_inactive_up
		 __walk_groups+0x50/0x77
		 walk_groups
		 __tmigr_cpu_deactivate
		 tmigr_cpu_deactivate
		 __get_next_timer_interrupt
		 timer_base_try_to_set_idle
		 tick_nohz_stop_tick
		 tick_nohz_idle_stop_tick
		 cpuidle_idle_call
		 do_idle

Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.

Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
---
 kernel/time/timer_migration.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 72538baa7a1fb..0707f1ef05f7e 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -569,7 +569,7 @@ static struct tmigr_event *tmigr_next_groupevt(struct tmigr_group *group)
 	while ((node = timerqueue_getnext(&group->events))) {
 		evt = container_of(node, struct tmigr_event, nextevt);
 
-		if (!evt->ignore) {
+		if (!READ_ONCE(evt->ignore)) {
 			WRITE_ONCE(group->next_expiry, evt->nextevt.expires);
 			return evt;
 		}
@@ -665,7 +665,7 @@ static bool tmigr_active_up(struct tmigr_group *group,
 	 * lock is held while updating the ignore flag in idle path. So this
 	 * state change will not be lost.
 	 */
-	group->groupevt.ignore = true;
+	WRITE_ONCE(group->groupevt.ignore, true);
 
 	return walk_done;
 }
@@ -726,6 +726,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 	union tmigr_state childstate, groupstate;
 	bool remote = data->remote;
 	bool walk_done = false;
+	bool ignore;
 	u64 nextexp;
 
 	if (child) {
@@ -744,11 +745,19 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 		nextexp = child->next_expiry;
 		evt = &child->groupevt;
 
-		evt->ignore = (nextexp == KTIME_MAX) ? true : false;
+		/*
+		 * This can race with concurrent idle exit (activate).
+		 * If the current writer wins, a useless remote expiration may
+		 * be scheduled. If the activate wins, the event is properly
+		 * ignored.
+		 */
+		ignore = (nextexp == KTIME_MAX) ? true : false;
+		WRITE_ONCE(evt->ignore, ignore);
 	} else {
 		nextexp = data->nextexp;
 
 		first_childevt = evt = data->evt;
+		ignore = evt->ignore;
 
 		/*
 		 * Walking the hierarchy is required in any case when a
@@ -774,7 +783,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 		 * first event information of the group is updated properly and
 		 * also handled properly, so skip this fast return path.
 		 */
-		if (evt->ignore && !remote && group->parent)
+		if (ignore && !remote && group->parent)
 			return true;
 
 		raw_spin_lock(&group->lock);
@@ -788,7 +797,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 	 * queue when the expiry time changed only or when it could be ignored.
 	 */
 	if (timerqueue_node_queued(&evt->nextevt)) {
-		if ((evt->nextevt.expires == nextexp) && !evt->ignore) {
+		if ((evt->nextevt.expires == nextexp) && !ignore) {
 			/* Make sure not to miss a new CPU event with the same expiry */
 			evt->cpu = first_childevt->cpu;
 			goto check_toplvl;
@@ -798,7 +807,7 @@ bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
 			WRITE_ONCE(group->next_expiry, KTIME_MAX);
 	}
 
-	if (evt->ignore) {
+	if (ignore) {
 		/*
 		 * When the next child event could be ignored (nextexp is
 		 * KTIME_MAX) and there was no remote timer handling before or
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 3/7] timers/migration: Simplify top level detection on group setup
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 1/7] timer/migration: Fix kernel-doc warnings for union tmigr_state Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 4/7] timers/migration: Clean up the loop in tmigr_quick_check() Ionut Nechita (Wind River)
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Frederic Weisbecker <frederic@kernel.org>

commit dcf6230555dcd0b05e8d2dd5b128dcc4b6fc04ef upstream.

Having a single group on a given level is enough to know this is the
top level, because a root has to have at least two children, unless that
root is the only group and the children are actual CPUs.

Simplify the test in tmigr_setup_groups() accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-5-frederic@kernel.org
---
 kernel/time/timer_migration.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 0707f1ef05f7e..2f6330831f084 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1670,9 +1670,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 		 * be different from tmigr_hierarchy_levels, contains only a
 		 * single group.
 		 */
-		if (group->parent || i == tmigr_hierarchy_levels ||
-		    (list_empty(&tmigr_level_list[i]) &&
-		     list_is_singular(&tmigr_level_list[i - 1])))
+		if (group->parent || list_is_singular(&tmigr_level_list[i - 1]))
 			break;
 
 	} while (i < tmigr_hierarchy_levels);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 4/7] timers/migration: Clean up the loop in tmigr_quick_check()
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
                   ` (2 preceding siblings ...)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 3/7] timers/migration: Simplify top level detection on group setup Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 5/7] timers/migration: Convert "while" loops to use "for" Ionut Nechita (Wind River)
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Petr Tesarik <ptesarik@suse.com>

commit ff56a3e2a8613e8524f40ef2efa2c0169659e99e upstream.

Make the logic easier to follow:

  - Remove the final return statement, which is never reached, and move the
    actual walk-terminating return statement out of the do-while loop.

  - Remove the else-clause to reduce indentation. If a non-lonely group is
    encountered during the walk, the loop is immediately terminated with a
    return statement anyway; no need for an else.

Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20250606124818.455560-1-ptesarik@suse.com
---
 kernel/time/timer_migration.c | 23 ++++++++++-------------
 1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 2f6330831f084..c0c54dc5314c3 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1405,23 +1405,20 @@ u64 tmigr_quick_check(u64 nextevt)
 		return KTIME_MAX;
 
 	do {
-		if (!tmigr_check_lonely(group)) {
+		if (!tmigr_check_lonely(group))
 			return KTIME_MAX;
-		} else {
-			/*
-			 * Since current CPU is active, events may not be sorted
-			 * from bottom to the top because the CPU's event is ignored
-			 * up to the top and its sibling's events not propagated upwards.
-			 * Thus keep track of the lowest observed expiry.
-			 */
-			nextevt = min_t(u64, nextevt, READ_ONCE(group->next_expiry));
-			if (!group->parent)
-				return nextevt;
-		}
+
+		/*
+		 * Since current CPU is active, events may not be sorted
+		 * from bottom to the top because the CPU's event is ignored
+		 * up to the top and its sibling's events not propagated upwards.
+		 * Thus keep track of the lowest observed expiry.
+		 */
+		nextevt = min_t(u64, nextevt, READ_ONCE(group->next_expiry));
 		group = group->parent;
 	} while (group);
 
-	return KTIME_MAX;
+	return nextevt;
 }
 
 /*
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 5/7] timers/migration: Convert "while" loops to use "for"
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
                   ` (3 preceding siblings ...)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 4/7] timers/migration: Clean up the loop in tmigr_quick_check() Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 6/7] timers/migration: Remove locking on group connection Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 7/7] timers/migration: Fix imbalanced NUMA trees Ionut Nechita (Wind River)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Frederic Weisbecker <frederic@kernel.org>

commit 6c181b5667eea3e6564d334443536a5974190e15 upstream.

Both the "do while" and "while" loops in tmigr_setup_groups() eventually
mimic the behaviour of "for" loops.

Simplify accordingly.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org
---
 kernel/time/timer_migration.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index c0c54dc5314c3..1e371f1fdc86c 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1642,22 +1642,23 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 {
 	struct tmigr_group *group, *child, **stack;
-	int top = 0, err = 0, i = 0;
+	int i, top = 0, err = 0;
 	struct list_head *lvllist;
 
 	stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL);
 	if (!stack)
 		return -ENOMEM;
 
-	do {
+	for (i = 0; i < tmigr_hierarchy_levels; i++) {
 		group = tmigr_get_group(cpu, node, i);
 		if (IS_ERR(group)) {
 			err = PTR_ERR(group);
+			i--;
 			break;
 		}
 
 		top = i;
-		stack[i++] = group;
+		stack[i] = group;
 
 		/*
 		 * When booting only less CPUs of a system than CPUs are
@@ -1667,16 +1668,18 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 		 * be different from tmigr_hierarchy_levels, contains only a
 		 * single group.
 		 */
-		if (group->parent || list_is_singular(&tmigr_level_list[i - 1]))
+		if (group->parent || list_is_singular(&tmigr_level_list[i]))
 			break;
+	}
 
-	} while (i < tmigr_hierarchy_levels);
-
-	/* Assert single root */
-	WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top]));
+	/* Assert single root without parent */
+	if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels))
+		return -EINVAL;
+	if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top])))
+		return -EINVAL;
 
-	while (i > 0) {
-		group = stack[--i];
+	for (; i >= 0; i--) {
+		group = stack[i];
 
 		if (err < 0) {
 			list_del(&group->list);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 6/7] timers/migration: Remove locking on group connection
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
                   ` (4 preceding siblings ...)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 5/7] timers/migration: Convert "while" loops to use "for" Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 7/7] timers/migration: Fix imbalanced NUMA trees Ionut Nechita (Wind River)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Frederic Weisbecker <frederic@kernel.org>

commit fa9620355d4192200f15cb3d97c6eb9c02442249 upstream.

Initializing the tmc's group, the group's number of children and the
group's parent can all be done without locking because:

  1) Reading the group's parent and its group mask is done locklessly.

  2) The connections prepared for a given CPU hierarchy are visible to the
     target CPU once online, thanks to the CPU hotplug enforced memory
     ordering.

  3) In case of a newly created upper level, the new root and its
     connections and initialization are made visible by the CPU which made
     the connections. When that CPUs goes idle in the future, the new link
     is published by tmigr_inactive_up() through the atomic RmW on
     ->migr_state.

  4) If CPUs were still walking up the active hierarchy, they could observe
     the new root earlier. In this case the ordering is enforced by an
     early initialization of the group mask and by barriers that maintain
     address dependency as explained in:

     b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit")
     de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers")

  5) Timers are propagated by a chain of group locking from the bottom to
     the top. And while doing so, the tree also propagates groups links
     and initialization. Therefore remote expiration, which also relies
     on group locking, will observe those links and initialization while
     holding the root lock before walking the tree remotely and update
     remote timers. This is especially important for migrators in the
     active hierarchy that may observe the new root early.

Therefore the locking is unnecessary at initialization. If anything, it
just brings confusion. Remove it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org
---
 kernel/time/timer_migration.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 1e371f1fdc86c..5f8aef94ca0f7 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1573,9 +1573,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 {
 	struct tmigr_walk data;
 
-	raw_spin_lock_irq(&child->lock);
-	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
-
 	if (activate) {
 		/*
 		 * @child is the old top and @parent the new one. In this
@@ -1596,9 +1593,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	 */
 	smp_store_release(&child->parent, parent);
 
-	raw_spin_unlock(&parent->lock);
-	raw_spin_unlock_irq(&child->lock);
-
 	trace_tmigr_connect_child_parent(child);
 
 	if (!activate)
@@ -1695,13 +1689,9 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 		if (i == 0) {
 			struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu);
 
-			raw_spin_lock_irq(&group->lock);
-
 			tmc->tmgroup = group;
 			tmc->groupmask = BIT(group->num_children++);
 
-			raw_spin_unlock_irq(&group->lock);
-
 			trace_tmigr_connect_cpu_parent(tmc);
 
 			/* There are no children that need to be connected */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6.12.y 7/7] timers/migration: Fix imbalanced NUMA trees
  2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
                   ` (5 preceding siblings ...)
  2026-03-21 10:24 ` [PATCH v2 6.12.y 6/7] timers/migration: Remove locking on group connection Ionut Nechita (Wind River)
@ 2026-03-21 10:24 ` Ionut Nechita (Wind River)
  6 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Wind River) @ 2026-03-21 10:24 UTC (permalink / raw)
  To: stable; +Cc: frederic, tglx, linux-kernel, rdunlap, ptesarik, Ionut Nechita

From: Ionut Nechita <ionut.nechita@windriver.com>

From: Frederic Weisbecker <frederic@kernel.org>

commit 5eb579dfd46b4949117ecb0f1ba2f12d3dc9a6f2 upstream.

When a CPU from a new node boots, the old root may happen to be
connected to the new root even if their node mismatch, as depicted in
the following scenario:

1) CPU 0 boots and creates the first group for node 0.

   [GRP0:0]
    node 0
      |
    CPU 0

2) CPU 1 from node 1 boots and creates a new top that corresponds to
   node 1, but it also connects the old root from node 0 to the new root
   from node 1 by mistake.

             [GRP1:0]
              node 1
            /        \
           /          \
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0                CPU 1

3) This eventually leads to an imbalanced tree where some node 0 CPUs
   migrate node 1 timers (and vice versa) way before reaching the
   crossnode groups, resulting in more frequent remote memory accesses
   than expected.

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 1               node 0
            /        \                |
           /          \             [...]
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0...              CPU 1...

A balanced tree should only contain groups having children that belong
to the same node:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:0]
              node 0               node 1
            /        \             /      \
           /          \           /        \
   [GRP0:0]          [...]      [...]    [GRP0:1]
    node 0                                node 1
      |                                     |
    CPU 0...                              CPU 1...

In order to fix this, the hierarchy must be unfolded up to the crossnode
level as soon as a node mismatch is detected. For example the stage 2
above should lead to this layout:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 0               node 1
              /                         \
             /                           \
        [GRP0:0]                        [GRP0:1]
        node 0                           node 1
          |                                |
       CPU 0                             CPU 1

This means that not only GRP1:0 must be created but also GRP1:1 and
GRP2:0 in order to prepare a balanced tree for next CPUs to boot.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org
---
 kernel/time/timer_migration.c | 231 +++++++++++++++++++---------------
 1 file changed, 127 insertions(+), 104 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 5f8aef94ca0f7..49635a2b7ee28 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -420,6 +420,8 @@ static struct list_head *tmigr_level_list __read_mostly;
 static unsigned int tmigr_hierarchy_levels __read_mostly;
 static unsigned int tmigr_crossnode_level __read_mostly;
 
+static struct tmigr_group *tmigr_root;
+
 static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
 
 #define TMIGR_NONE	0xFF
@@ -522,11 +524,9 @@ struct tmigr_walk {
 
 typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *);
 
-static void __walk_groups(up_f up, struct tmigr_walk *data,
-			  struct tmigr_cpu *tmc)
+static void __walk_groups_from(up_f up, struct tmigr_walk *data,
+			       struct tmigr_group *child, struct tmigr_group *group)
 {
-	struct tmigr_group *child = NULL, *group = tmc->tmgroup;
-
 	do {
 		WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
 
@@ -544,6 +544,12 @@ static void __walk_groups(up_f up, struct tmigr_walk *data,
 	} while (group);
 }
 
+static void __walk_groups(up_f up, struct tmigr_walk *data,
+			  struct tmigr_cpu *tmc)
+{
+	__walk_groups_from(up, data, NULL, tmc->tmgroup);
+}
+
 static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc)
 {
 	lockdep_assert_held(&tmc->lock);
@@ -1498,21 +1504,6 @@ static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 	s.seq = 0;
 	atomic_set(&group->migr_state, s.state);
 
-	/*
-	 * If this is a new top-level, prepare its groupmask in advance.
-	 * This avoids accidents where yet another new top-level is
-	 * created in the future and made visible before the current groupmask.
-	 */
-	if (list_empty(&tmigr_level_list[lvl])) {
-		group->groupmask = BIT(0);
-		/*
-		 * The previous top level has prepared its groupmask already,
-		 * simply account it as the first child.
-		 */
-		if (lvl > 0)
-			group->num_children = 1;
-	}
-
 	timerqueue_init_head(&group->events);
 	timerqueue_init(&group->groupevt.nextevt);
 	group->groupevt.nextevt.expires = KTIME_MAX;
@@ -1567,22 +1558,51 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
 	return group;
 }
 
+static bool tmigr_init_root(struct tmigr_group *group, bool activate)
+{
+	if (!group->parent && group != tmigr_root) {
+		/*
+		 * This is the new top-level, prepare its groupmask in advance
+		 * to avoid accidents where yet another new top-level is
+		 * created in the future and made visible before this groupmask.
+		 */
+		group->groupmask = BIT(0);
+		WARN_ON_ONCE(activate);
+
+		return true;
+	}
+
+	return false;
+
+}
+
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent,
 				       bool activate)
 {
-	struct tmigr_walk data;
+	if (tmigr_init_root(parent, activate)) {
+		/*
+		 * The previous top level had prepared its groupmask already,
+		 * simply account it in advance as the first child. If some groups
+		 * have been created between the old and new root due to node
+		 * mismatch, the new root's child will be intialized accordingly.
+		 */
+		parent->num_children = 1;
+	}
 
-	if (activate) {
+	/* Connecting old root to new root ? */
+	if (!parent->parent && activate) {
 		/*
-		 * @child is the old top and @parent the new one. In this
-		 * case groupmask is pre-initialized and @child already
-		 * accounted, along with its new sibling corresponding to the
-		 * CPU going up.
+		 * @child is the old top, or in case of node mismatch, some
+		 * intermediate group between the old top and the new one in
+		 * @parent. In this case the @child must be pre-accounted above
+		 * as the first child. Its new inactive sibling corresponding
+		 * to the CPU going up has been accounted as the second child.
 		 */
-		WARN_ON_ONCE(child->groupmask != BIT(0) || parent->num_children != 2);
+		WARN_ON_ONCE(parent->num_children != 2);
+		child->groupmask = BIT(0);
 	} else {
-		/* Adding @child for the CPU going up to @parent. */
+		/* Common case adding @child for the CPU going up to @parent. */
 		child->groupmask = BIT(parent->num_children++);
 	}
 
@@ -1594,56 +1614,28 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	smp_store_release(&child->parent, parent);
 
 	trace_tmigr_connect_child_parent(child);
-
-	if (!activate)
-		return;
-
-	/*
-	 * To prevent inconsistent states, active children need to be active in
-	 * the new parent as well. Inactive children are already marked inactive
-	 * in the parent group:
-	 *
-	 * * When new groups were created by tmigr_setup_groups() starting from
-	 *   the lowest level (and not higher then one level below the current
-	 *   top level), then they are not active. They will be set active when
-	 *   the new online CPU comes active.
-	 *
-	 * * But if a new group above the current top level is required, it is
-	 *   mandatory to propagate the active state of the already existing
-	 *   child to the new parent. So tmigr_connect_child_parent() is
-	 *   executed with the formerly top level group (child) and the newly
-	 *   created group (parent).
-	 *
-	 * * It is ensured that the child is active, as this setup path is
-	 *   executed in hotplug prepare callback. This is exectued by an
-	 *   already connected and !idle CPU. Even if all other CPUs go idle,
-	 *   the CPU executing the setup will be responsible up to current top
-	 *   level group. And the next time it goes inactive, it will release
-	 *   the new childmask and parent to subsequent walkers through this
-	 *   @child. Therefore propagate active state unconditionally.
-	 */
-	data.childmask = child->groupmask;
-
-	/*
-	 * There is only one new level per time (which is protected by
-	 * tmigr_mutex). When connecting the child and the parent and set the
-	 * child active when the parent is inactive, the parent needs to be the
-	 * uppermost level. Otherwise there went something wrong!
-	 */
-	WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 }
 
-static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
+static int tmigr_setup_groups(unsigned int cpu, unsigned int node,
+			      struct tmigr_group *start, bool activate)
 {
 	struct tmigr_group *group, *child, **stack;
-	int i, top = 0, err = 0;
-	struct list_head *lvllist;
+	int i, top = 0, err = 0, start_lvl = 0;
+	bool root_mismatch = false;
 
 	stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL);
 	if (!stack)
 		return -ENOMEM;
 
-	for (i = 0; i < tmigr_hierarchy_levels; i++) {
+	if (start) {
+		stack[start->level] = start;
+		start_lvl = start->level + 1;
+	}
+
+	if (tmigr_root)
+		root_mismatch = tmigr_root->numa_node != node;
+
+	for (i = start_lvl; i < tmigr_hierarchy_levels; i++) {
 		group = tmigr_get_group(cpu, node, i);
 		if (IS_ERR(group)) {
 			err = PTR_ERR(group);
@@ -1656,23 +1648,25 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 
 		/*
 		 * When booting only less CPUs of a system than CPUs are
-		 * available, not all calculated hierarchy levels are required.
+		 * available, not all calculated hierarchy levels are required,
+		 * unless a node mismatch is detected.
 		 *
 		 * The loop is aborted as soon as the highest level, which might
 		 * be different from tmigr_hierarchy_levels, contains only a
-		 * single group.
+		 * single group, unless the nodes mismatch below tmigr_crossnode_level
 		 */
-		if (group->parent || list_is_singular(&tmigr_level_list[i]))
+		if (group->parent)
+			break;
+		if ((!root_mismatch || i >= tmigr_crossnode_level) &&
+		    list_is_singular(&tmigr_level_list[i]))
 			break;
 	}
 
 	/* Assert single root without parent */
 	if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels))
 		return -EINVAL;
-	if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top])))
-		return -EINVAL;
 
-	for (; i >= 0; i--) {
+	for (; i >= start_lvl; i--) {
 		group = stack[i];
 
 		if (err < 0) {
@@ -1692,48 +1686,63 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 			tmc->tmgroup = group;
 			tmc->groupmask = BIT(group->num_children++);
 
+			tmigr_init_root(group, activate);
+
 			trace_tmigr_connect_cpu_parent(tmc);
 
 			/* There are no children that need to be connected */
 			continue;
 		} else {
 			child = stack[i - 1];
-			/* Will be activated at online time */
-			tmigr_connect_child_parent(child, group, false);
+			tmigr_connect_child_parent(child, group, activate);
 		}
+	}
 
-		/* check if uppermost level was newly created */
-		if (top != i)
-			continue;
-
-		WARN_ON_ONCE(top == 0);
+	if (err < 0)
+		goto out;
 
-		lvllist = &tmigr_level_list[top];
+	if (activate) {
+		struct tmigr_walk data;
 
 		/*
-		 * Newly created root level should have accounted the upcoming
-		 * CPU's child group and pre-accounted the old root.
+		 * To prevent inconsistent states, active children need to be active in
+		 * the new parent as well. Inactive children are already marked inactive
+		 * in the parent group:
+		 *
+		 * * When new groups were created by tmigr_setup_groups() starting from
+		 *   the lowest level, then they are not active. They will be set active
+		 *   when the new online CPU comes active.
+		 *
+		 * * But if new groups above the current top level are required, it is
+		 *   mandatory to propagate the active state of the already existing
+		 *   child to the new parents. So tmigr_active_up() activates the
+		 *   new parents while walking up from the old root to the new.
+		 *
+		 * * It is ensured that @start is active, as this setup path is
+		 *   executed in hotplug prepare callback. This is executed by an
+		 *   already connected and !idle CPU. Even if all other CPUs go idle,
+		 *   the CPU executing the setup will be responsible up to current top
+		 *   level group. And the next time it goes inactive, it will release
+		 *   the new childmask and parent to subsequent walkers through this
+		 *   @child. Therefore propagate active state unconditionally.
 		 */
-		if (group->num_children == 2 && list_is_singular(lvllist)) {
-			/*
-			 * The target CPU must never do the prepare work, except
-			 * on early boot when the boot CPU is the target. Otherwise
-			 * it may spuriously activate the old top level group inside
-			 * the new one (nevertheless whether old top level group is
-			 * active or not) and/or release an uninitialized childmask.
-			 */
-			WARN_ON_ONCE(cpu == raw_smp_processor_id());
-
-			lvllist = &tmigr_level_list[top - 1];
-			list_for_each_entry(child, lvllist, list) {
-				if (child->parent)
-					continue;
+		WARN_ON_ONCE(!start->parent);
+		data.childmask = start->groupmask;
+		__walk_groups_from(tmigr_active_up, &data, start, start->parent);
+	}
 
-				tmigr_connect_child_parent(child, group, true);
-			}
+	/* Root update */
+	if (list_is_singular(&tmigr_level_list[top])) {
+		group = list_first_entry(&tmigr_level_list[top],
+					 typeof(*group), list);
+		WARN_ON_ONCE(group->parent);
+		if (tmigr_root) {
+			/* Old root should be the same or below */
+			WARN_ON_ONCE(tmigr_root->level > top);
 		}
+		tmigr_root = group;
 	}
-
+out:
 	kfree(stack);
 
 	return err;
@@ -1741,12 +1750,26 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
 
 static int tmigr_add_cpu(unsigned int cpu)
 {
+	struct tmigr_group *old_root = tmigr_root;
 	int node = cpu_to_node(cpu);
 	int ret;
 
-	mutex_lock(&tmigr_mutex);
-	ret = tmigr_setup_groups(cpu, node);
-	mutex_unlock(&tmigr_mutex);
+	guard(mutex)(&tmigr_mutex);
+
+	ret = tmigr_setup_groups(cpu, node, NULL, false);
+
+	/* Root has changed? Connect the old one to the new */
+	if (ret >= 0 && old_root && old_root != tmigr_root) {
+		/*
+		 * The target CPU must never do the prepare work, except
+		 * on early boot when the boot CPU is the target. Otherwise
+		 * it may spuriously activate the old top level group inside
+		 * the new one (nevertheless whether old top level group is
+		 * active or not) and/or release an uninitialized childmask.
+		 */
+		WARN_ON_ONCE(cpu == raw_smp_processor_id());
+		ret = tmigr_setup_groups(-1, old_root->numa_node, old_root, true);
+	}
 
 	return ret;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag
  2026-03-21 10:24 ` [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag Ionut Nechita (Wind River)
@ 2026-03-21 11:03   ` Greg KH
  2026-03-24  7:40     ` Ionut Nechita (Sunlight Linux)
  0 siblings, 1 reply; 10+ messages in thread
From: Greg KH @ 2026-03-21 11:03 UTC (permalink / raw)
  To: Ionut Nechita (Wind River)
  Cc: stable, frederic, tglx, linux-kernel, rdunlap, ptesarik,
	kernel test robot

On Sat, Mar 21, 2026 at 12:24:35PM +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>

Why is this here?

And again, you did not send this properly, work with your kernel team to
learn the proper process, I've been over this multiple times with them
in the past...

greg k-h

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag
  2026-03-21 11:03   ` Greg KH
@ 2026-03-24  7:40     ` Ionut Nechita (Sunlight Linux)
  0 siblings, 0 replies; 10+ messages in thread
From: Ionut Nechita (Sunlight Linux) @ 2026-03-24  7:40 UTC (permalink / raw)
  To: gregkh
  Cc: frederic, ionut.nechita, linux-kernel, oliver.sang, ptesarik,
	rdunlap, stable, tglx

Hi Greg,

Thank you for the feedback. I understand the issues you pointed out -
the incorrect From: header and the improper backport process.

I will take the time to learn the correct steps for submitting stable
backports and will work with kernel team on this. I'll send a
properly formatted new series once I have the process right.

Sorry for the noise.

Best regards,
Ionut

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-03-24  7:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-21 10:24 [PATCH v2 6.12.y 0/7] timers/migration: Backport fixes and cleanups from mainline Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 1/7] timer/migration: Fix kernel-doc warnings for union tmigr_state Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 2/7] timers/migration: Annotate accesses to ignore flag Ionut Nechita (Wind River)
2026-03-21 11:03   ` Greg KH
2026-03-24  7:40     ` Ionut Nechita (Sunlight Linux)
2026-03-21 10:24 ` [PATCH v2 6.12.y 3/7] timers/migration: Simplify top level detection on group setup Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 4/7] timers/migration: Clean up the loop in tmigr_quick_check() Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 5/7] timers/migration: Convert "while" loops to use "for" Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 6/7] timers/migration: Remove locking on group connection Ionut Nechita (Wind River)
2026-03-21 10:24 ` [PATCH v2 6.12.y 7/7] timers/migration: Fix imbalanced NUMA trees Ionut Nechita (Wind River)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox