* [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups
@ 2025-10-24 13:25 Frederic Weisbecker
2025-10-24 13:25 ` [PATCH 1/6] timers/migration: Convert "while" loops to use "for" Frederic Weisbecker
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw)
To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen
Hi,
A while ago I found out that NUMA tree layout had flaws but I thought it
was only about rare setups. Having had a closer look while thinking
about the big.LITTLE handling, I realized the NUMA timer tree are
actually imbalanced in many cases, except perhaps when 0-7 CPUs are lucky
enough to belong to the same node, and yet up to a certain amount of
CPUs.
Here is a fix proposal, plus a bunch of cleanups.
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
timers/core
HEAD: 4adecc004d35dbe8a48df3d23ed2b8cf42772716
Thanks,
Frederic
---
Frederic Weisbecker (6):
timers/migration: Convert "while" loops to use "for"
timers/migration: Remove locking on group connection
timers/migration: Fix imbalanced NUMA trees
timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy
timers/migration: Remove unused "cpu" parameter from tmigr_get_group()
timers/migration: Remove dead code handling idle CPU checking for remote timers
kernel/time/timer_migration.c | 279 ++++++++++++++++++++++--------------------
1 file changed, 143 insertions(+), 136 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread* [PATCH 1/6] timers/migration: Convert "while" loops to use "for" 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 2/6] timers/migration: Remove locking on group connection Frederic Weisbecker ` (4 subsequent siblings) 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen Both the "do while" and "while" loops in tmigr_setup_groups() eventually mimic the behaviour of "for" loops. Simplify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index c0c54dc5314c..1e371f1fdc86 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1642,22 +1642,23 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, static int tmigr_setup_groups(unsigned int cpu, unsigned int node) { struct tmigr_group *group, *child, **stack; - int top = 0, err = 0, i = 0; + int i, top = 0, err = 0; struct list_head *lvllist; stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL); if (!stack) return -ENOMEM; - do { + for (i = 0; i < tmigr_hierarchy_levels; i++) { group = tmigr_get_group(cpu, node, i); if (IS_ERR(group)) { err = PTR_ERR(group); + i--; break; } top = i; - stack[i++] = group; + stack[i] = group; /* * When booting only less CPUs of a system than CPUs are @@ -1667,16 +1668,18 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) * be different from tmigr_hierarchy_levels, contains only a * single group. */ - if (group->parent || list_is_singular(&tmigr_level_list[i - 1])) + if (group->parent || list_is_singular(&tmigr_level_list[i])) break; + } - } while (i < tmigr_hierarchy_levels); + /* Assert single root without parent */ + if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels)) + return -EINVAL; + if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top]))) + return -EINVAL; - /* Assert single root */ - WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top])); - - while (i > 0) { - group = stack[--i]; + for (; i >= 0; i--) { + group = stack[i]; if (err < 0) { list_del(&group->list); -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Convert "while" loops to use "for" 2025-10-24 13:25 ` [PATCH 1/6] timers/migration: Convert "while" loops to use "for" Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: 6c181b5667eea3e6564d334443536a5974190e15 Gitweb: https://git.kernel.org/tip/6c181b5667eea3e6564d334443536a5974190e15 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:31 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:24 +01:00 timers/migration: Convert "while" loops to use "for" Both the "do while" and "while" loops in tmigr_setup_groups() eventually mimic the behaviour of "for" loops. Simplify accordingly. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org --- kernel/time/timer_migration.c | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index c0c54dc..1e371f1 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1642,22 +1642,23 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, static int tmigr_setup_groups(unsigned int cpu, unsigned int node) { struct tmigr_group *group, *child, **stack; - int top = 0, err = 0, i = 0; + int i, top = 0, err = 0; struct list_head *lvllist; stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL); if (!stack) return -ENOMEM; - do { + for (i = 0; i < tmigr_hierarchy_levels; i++) { group = tmigr_get_group(cpu, node, i); if (IS_ERR(group)) { err = PTR_ERR(group); + i--; break; } top = i; - stack[i++] = group; + stack[i] = group; /* * When booting only less CPUs of a system than CPUs are @@ -1667,16 +1668,18 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) * be different from tmigr_hierarchy_levels, contains only a * single group. */ - if (group->parent || list_is_singular(&tmigr_level_list[i - 1])) + if (group->parent || list_is_singular(&tmigr_level_list[i])) break; + } - } while (i < tmigr_hierarchy_levels); - - /* Assert single root */ - WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top])); + /* Assert single root without parent */ + if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels)) + return -EINVAL; + if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top]))) + return -EINVAL; - while (i > 0) { - group = stack[--i]; + for (; i >= 0; i--) { + group = stack[i]; if (err < 0) { list_del(&group->list); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/6] timers/migration: Remove locking on group connection 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 1/6] timers/migration: Convert "while" loops to use "for" Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 3/6] timers/migration: Fix imbalanced NUMA trees Frederic Weisbecker ` (3 subsequent siblings) 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen Initializing the tmc's group, the group's number of children and the group's parent can all be done without locking because: 1) Reading the group's parent and its groupmask is done locklessly. 2) The connections prepared for a given CPU hierarchy are visible to the target CPU once online, thanks to the CPU hotplug enforced memory ordering. 3) In case of a newly created upper level, the new root and its connections/initializations are made visible by the CPU which made the connections. When that CPUs goes idle in the future, the new link is published by tmigr_inactive_up() through the atomic RmW on ->migr_state. 4) If CPUs were still walking up the active hierarchy, they could observe the new root earlier. In this case the ordering is enforced by an early initialization of the group mask and by barriers that maintain address dependency as explained in: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers") 5) Timers are propagated by a chain of group locking from the bottom to the top. And while doing so, the tree also propagates groups links and initializations. Therefore remote expiration, which also relies on group locking, will observe those links and initialization while holding the root lock before walking the tree remotely and update remote timers. This is especially important for migrators in the active hierarchy that may observe the new root early. Therefore the locking is unecesserary at initialization. If anything, it just brings confusion. Remove it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 1e371f1fdc86..5f8aef94ca0f 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1573,9 +1573,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, { struct tmigr_walk data; - raw_spin_lock_irq(&child->lock); - raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING); - if (activate) { /* * @child is the old top and @parent the new one. In this @@ -1596,9 +1593,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, */ smp_store_release(&child->parent, parent); - raw_spin_unlock(&parent->lock); - raw_spin_unlock_irq(&child->lock); - trace_tmigr_connect_child_parent(child); if (!activate) @@ -1695,13 +1689,9 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) if (i == 0) { struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu); - raw_spin_lock_irq(&group->lock); - tmc->tmgroup = group; tmc->groupmask = BIT(group->num_children++); - raw_spin_unlock_irq(&group->lock); - trace_tmigr_connect_cpu_parent(tmc); /* There are no children that need to be connected */ -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Remove locking on group connection 2025-10-24 13:25 ` [PATCH 2/6] timers/migration: Remove locking on group connection Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: fa9620355d4192200f15cb3d97c6eb9c02442249 Gitweb: https://git.kernel.org/tip/fa9620355d4192200f15cb3d97c6eb9c02442249 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:32 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:25 +01:00 timers/migration: Remove locking on group connection Initializing the tmc's group, the group's number of children and the group's parent can all be done without locking because: 1) Reading the group's parent and its group mask is done locklessly. 2) The connections prepared for a given CPU hierarchy are visible to the target CPU once online, thanks to the CPU hotplug enforced memory ordering. 3) In case of a newly created upper level, the new root and its connections and initialization are made visible by the CPU which made the connections. When that CPUs goes idle in the future, the new link is published by tmigr_inactive_up() through the atomic RmW on ->migr_state. 4) If CPUs were still walking up the active hierarchy, they could observe the new root earlier. In this case the ordering is enforced by an early initialization of the group mask and by barriers that maintain address dependency as explained in: b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit") de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers") 5) Timers are propagated by a chain of group locking from the bottom to the top. And while doing so, the tree also propagates groups links and initialization. Therefore remote expiration, which also relies on group locking, will observe those links and initialization while holding the root lock before walking the tree remotely and update remote timers. This is especially important for migrators in the active hierarchy that may observe the new root early. Therefore the locking is unnecessary at initialization. If anything, it just brings confusion. Remove it. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org --- kernel/time/timer_migration.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 1e371f1..5f8aef9 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1573,9 +1573,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, { struct tmigr_walk data; - raw_spin_lock_irq(&child->lock); - raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING); - if (activate) { /* * @child is the old top and @parent the new one. In this @@ -1596,9 +1593,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, */ smp_store_release(&child->parent, parent); - raw_spin_unlock(&parent->lock); - raw_spin_unlock_irq(&child->lock); - trace_tmigr_connect_child_parent(child); if (!activate) @@ -1695,13 +1689,9 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) if (i == 0) { struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu); - raw_spin_lock_irq(&group->lock); - tmc->tmgroup = group; tmc->groupmask = BIT(group->num_children++); - raw_spin_unlock_irq(&group->lock); - trace_tmigr_connect_cpu_parent(tmc); /* There are no children that need to be connected */ ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/6] timers/migration: Fix imbalanced NUMA trees 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 1/6] timers/migration: Convert "while" loops to use "for" Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 2/6] timers/migration: Remove locking on group connection Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 4/6] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy Frederic Weisbecker ` (2 subsequent siblings) 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen When a CPU from a new node boots, the old root may happen to be connected to the new root even if their node mismatch, as depicted in the following scenario: 1) CPU 0 boots and creates the first group for node 0. [GRP0:0] node 0 | CPU 0 2) CPU 1 from node 1 boots and creates a new top that corresponds to node 1, but it also connects the old root from node 0 to the new root from node 1 by mistake. [GRP1:0] node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 3) This eventually leads to an imbalanced tree where some node 0 CPUs migrate node 1 timers (and vice versa) way before reaching the crossnode groups, resulting in more frequent remote memory accesses than expected. [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 1 node 0 / \ | / \ [...] [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... A balanced tree should only contain groups having children that belong to the same node: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:0] node 0 node 1 / \ / \ / \ / \ [GRP0:0] [...] [...] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... In order to fix this, the hierarchy must be unfolded up to the crossnode level as soon as a node mismatch is detected. For example the stage 2 above should lead to this layout: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 0 node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 This means that not only GRP1:0 must be created but also GRP1:1 and GRP2:0 in order to prepare a balanced tree for next CPUs to boot. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 231 +++++++++++++++++++--------------- 1 file changed, 127 insertions(+), 104 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 5f8aef94ca0f..49635a2b7ee2 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -420,6 +420,8 @@ static struct list_head *tmigr_level_list __read_mostly; static unsigned int tmigr_hierarchy_levels __read_mostly; static unsigned int tmigr_crossnode_level __read_mostly; +static struct tmigr_group *tmigr_root; + static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu); #define TMIGR_NONE 0xFF @@ -522,11 +524,9 @@ struct tmigr_walk { typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *); -static void __walk_groups(up_f up, struct tmigr_walk *data, - struct tmigr_cpu *tmc) +static void __walk_groups_from(up_f up, struct tmigr_walk *data, + struct tmigr_group *child, struct tmigr_group *group) { - struct tmigr_group *child = NULL, *group = tmc->tmgroup; - do { WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels); @@ -544,6 +544,12 @@ static void __walk_groups(up_f up, struct tmigr_walk *data, } while (group); } +static void __walk_groups(up_f up, struct tmigr_walk *data, + struct tmigr_cpu *tmc) +{ + __walk_groups_from(up, data, NULL, tmc->tmgroup); +} + static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc) { lockdep_assert_held(&tmc->lock); @@ -1498,21 +1504,6 @@ static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl, s.seq = 0; atomic_set(&group->migr_state, s.state); - /* - * If this is a new top-level, prepare its groupmask in advance. - * This avoids accidents where yet another new top-level is - * created in the future and made visible before the current groupmask. - */ - if (list_empty(&tmigr_level_list[lvl])) { - group->groupmask = BIT(0); - /* - * The previous top level has prepared its groupmask already, - * simply account it as the first child. - */ - if (lvl > 0) - group->num_children = 1; - } - timerqueue_init_head(&group->events); timerqueue_init(&group->groupevt.nextevt); group->groupevt.nextevt.expires = KTIME_MAX; @@ -1567,22 +1558,51 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node, return group; } +static bool tmigr_init_root(struct tmigr_group *group, bool activate) +{ + if (!group->parent && group != tmigr_root) { + /* + * This is the new top-level, prepare its groupmask in advance + * to avoid accidents where yet another new top-level is + * created in the future and made visible before this groupmask. + */ + group->groupmask = BIT(0); + WARN_ON_ONCE(activate); + + return true; + } + + return false; + +} + static void tmigr_connect_child_parent(struct tmigr_group *child, struct tmigr_group *parent, bool activate) { - struct tmigr_walk data; + if (tmigr_init_root(parent, activate)) { + /* + * The previous top level had prepared its groupmask already, + * simply account it in advance as the first child. If some groups + * have been created between the old and new root due to node + * mismatch, the new root's child will be intialized accordingly. + */ + parent->num_children = 1; + } - if (activate) { + /* Connecting old root to new root ? */ + if (!parent->parent && activate) { /* - * @child is the old top and @parent the new one. In this - * case groupmask is pre-initialized and @child already - * accounted, along with its new sibling corresponding to the - * CPU going up. + * @child is the old top, or in case of node mismatch, some + * intermediate group between the old top and the new one in + * @parent. In this case the @child must be pre-accounted above + * as the first child. Its new inactive sibling corresponding + * to the CPU going up has been accounted as the second child. */ - WARN_ON_ONCE(child->groupmask != BIT(0) || parent->num_children != 2); + WARN_ON_ONCE(parent->num_children != 2); + child->groupmask = BIT(0); } else { - /* Adding @child for the CPU going up to @parent. */ + /* Common case adding @child for the CPU going up to @parent. */ child->groupmask = BIT(parent->num_children++); } @@ -1594,56 +1614,28 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, smp_store_release(&child->parent, parent); trace_tmigr_connect_child_parent(child); - - if (!activate) - return; - - /* - * To prevent inconsistent states, active children need to be active in - * the new parent as well. Inactive children are already marked inactive - * in the parent group: - * - * * When new groups were created by tmigr_setup_groups() starting from - * the lowest level (and not higher then one level below the current - * top level), then they are not active. They will be set active when - * the new online CPU comes active. - * - * * But if a new group above the current top level is required, it is - * mandatory to propagate the active state of the already existing - * child to the new parent. So tmigr_connect_child_parent() is - * executed with the formerly top level group (child) and the newly - * created group (parent). - * - * * It is ensured that the child is active, as this setup path is - * executed in hotplug prepare callback. This is exectued by an - * already connected and !idle CPU. Even if all other CPUs go idle, - * the CPU executing the setup will be responsible up to current top - * level group. And the next time it goes inactive, it will release - * the new childmask and parent to subsequent walkers through this - * @child. Therefore propagate active state unconditionally. - */ - data.childmask = child->groupmask; - - /* - * There is only one new level per time (which is protected by - * tmigr_mutex). When connecting the child and the parent and set the - * child active when the parent is inactive, the parent needs to be the - * uppermost level. Otherwise there went something wrong! - */ - WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent); } -static int tmigr_setup_groups(unsigned int cpu, unsigned int node) +static int tmigr_setup_groups(unsigned int cpu, unsigned int node, + struct tmigr_group *start, bool activate) { struct tmigr_group *group, *child, **stack; - int i, top = 0, err = 0; - struct list_head *lvllist; + int i, top = 0, err = 0, start_lvl = 0; + bool root_mismatch = false; stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL); if (!stack) return -ENOMEM; - for (i = 0; i < tmigr_hierarchy_levels; i++) { + if (start) { + stack[start->level] = start; + start_lvl = start->level + 1; + } + + if (tmigr_root) + root_mismatch = tmigr_root->numa_node != node; + + for (i = start_lvl; i < tmigr_hierarchy_levels; i++) { group = tmigr_get_group(cpu, node, i); if (IS_ERR(group)) { err = PTR_ERR(group); @@ -1656,23 +1648,25 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) /* * When booting only less CPUs of a system than CPUs are - * available, not all calculated hierarchy levels are required. + * available, not all calculated hierarchy levels are required, + * unless a node mismatch is detected. * * The loop is aborted as soon as the highest level, which might * be different from tmigr_hierarchy_levels, contains only a - * single group. + * single group, unless the nodes mismatch below tmigr_crossnode_level */ - if (group->parent || list_is_singular(&tmigr_level_list[i])) + if (group->parent) + break; + if ((!root_mismatch || i >= tmigr_crossnode_level) && + list_is_singular(&tmigr_level_list[i])) break; } /* Assert single root without parent */ if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels)) return -EINVAL; - if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top]))) - return -EINVAL; - for (; i >= 0; i--) { + for (; i >= start_lvl; i--) { group = stack[i]; if (err < 0) { @@ -1692,48 +1686,63 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) tmc->tmgroup = group; tmc->groupmask = BIT(group->num_children++); + tmigr_init_root(group, activate); + trace_tmigr_connect_cpu_parent(tmc); /* There are no children that need to be connected */ continue; } else { child = stack[i - 1]; - /* Will be activated at online time */ - tmigr_connect_child_parent(child, group, false); + tmigr_connect_child_parent(child, group, activate); } + } - /* check if uppermost level was newly created */ - if (top != i) - continue; + if (err < 0) + goto out; - WARN_ON_ONCE(top == 0); - - lvllist = &tmigr_level_list[top]; + if (activate) { + struct tmigr_walk data; /* - * Newly created root level should have accounted the upcoming - * CPU's child group and pre-accounted the old root. + * To prevent inconsistent states, active children need to be active in + * the new parent as well. Inactive children are already marked inactive + * in the parent group: + * + * * When new groups were created by tmigr_setup_groups() starting from + * the lowest level, then they are not active. They will be set active + * when the new online CPU comes active. + * + * * But if new groups above the current top level are required, it is + * mandatory to propagate the active state of the already existing + * child to the new parents. So tmigr_active_up() activates the + * new parents while walking up from the old root to the new. + * + * * It is ensured that @start is active, as this setup path is + * executed in hotplug prepare callback. This is executed by an + * already connected and !idle CPU. Even if all other CPUs go idle, + * the CPU executing the setup will be responsible up to current top + * level group. And the next time it goes inactive, it will release + * the new childmask and parent to subsequent walkers through this + * @child. Therefore propagate active state unconditionally. */ - if (group->num_children == 2 && list_is_singular(lvllist)) { - /* - * The target CPU must never do the prepare work, except - * on early boot when the boot CPU is the target. Otherwise - * it may spuriously activate the old top level group inside - * the new one (nevertheless whether old top level group is - * active or not) and/or release an uninitialized childmask. - */ - WARN_ON_ONCE(cpu == raw_smp_processor_id()); + WARN_ON_ONCE(!start->parent); + data.childmask = start->groupmask; + __walk_groups_from(tmigr_active_up, &data, start, start->parent); + } - lvllist = &tmigr_level_list[top - 1]; - list_for_each_entry(child, lvllist, list) { - if (child->parent) - continue; - - tmigr_connect_child_parent(child, group, true); - } + /* Root update */ + if (list_is_singular(&tmigr_level_list[top])) { + group = list_first_entry(&tmigr_level_list[top], + typeof(*group), list); + WARN_ON_ONCE(group->parent); + if (tmigr_root) { + /* Old root should be the same or below */ + WARN_ON_ONCE(tmigr_root->level > top); } + tmigr_root = group; } - +out: kfree(stack); return err; @@ -1741,12 +1750,26 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) static int tmigr_add_cpu(unsigned int cpu) { + struct tmigr_group *old_root = tmigr_root; int node = cpu_to_node(cpu); int ret; - mutex_lock(&tmigr_mutex); - ret = tmigr_setup_groups(cpu, node); - mutex_unlock(&tmigr_mutex); + guard(mutex)(&tmigr_mutex); + + ret = tmigr_setup_groups(cpu, node, NULL, false); + + /* Root has changed? Connect the old one to the new */ + if (ret >= 0 && old_root && old_root != tmigr_root) { + /* + * The target CPU must never do the prepare work, except + * on early boot when the boot CPU is the target. Otherwise + * it may spuriously activate the old top level group inside + * the new one (nevertheless whether old top level group is + * active or not) and/or release an uninitialized childmask. + */ + WARN_ON_ONCE(cpu == raw_smp_processor_id()); + ret = tmigr_setup_groups(-1, old_root->numa_node, old_root, true); + } return ret; } -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Fix imbalanced NUMA trees 2025-10-24 13:25 ` [PATCH 3/6] timers/migration: Fix imbalanced NUMA trees Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: 5eb579dfd46b4949117ecb0f1ba2f12d3dc9a6f2 Gitweb: https://git.kernel.org/tip/5eb579dfd46b4949117ecb0f1ba2f12d3dc9a6f2 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:33 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:25 +01:00 timers/migration: Fix imbalanced NUMA trees When a CPU from a new node boots, the old root may happen to be connected to the new root even if their node mismatch, as depicted in the following scenario: 1) CPU 0 boots and creates the first group for node 0. [GRP0:0] node 0 | CPU 0 2) CPU 1 from node 1 boots and creates a new top that corresponds to node 1, but it also connects the old root from node 0 to the new root from node 1 by mistake. [GRP1:0] node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 3) This eventually leads to an imbalanced tree where some node 0 CPUs migrate node 1 timers (and vice versa) way before reaching the crossnode groups, resulting in more frequent remote memory accesses than expected. [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 1 node 0 / \ | / \ [...] [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... A balanced tree should only contain groups having children that belong to the same node: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:0] node 0 node 1 / \ / \ / \ / \ [GRP0:0] [...] [...] [GRP0:1] node 0 node 1 | | CPU 0... CPU 1... In order to fix this, the hierarchy must be unfolded up to the crossnode level as soon as a node mismatch is detected. For example the stage 2 above should lead to this layout: [GRP2:0] NUMA_NO_NODE / \ [GRP1:0] [GRP1:1] node 0 node 1 / \ / \ [GRP0:0] [GRP0:1] node 0 node 1 | | CPU 0 CPU 1 This means that not only GRP1:0 must be created but also GRP1:1 and GRP2:0 in order to prepare a balanced tree for next CPUs to boot. Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org --- kernel/time/timer_migration.c | 231 ++++++++++++++++++--------------- 1 file changed, 127 insertions(+), 104 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 5f8aef9..49635a2 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -420,6 +420,8 @@ static struct list_head *tmigr_level_list __read_mostly; static unsigned int tmigr_hierarchy_levels __read_mostly; static unsigned int tmigr_crossnode_level __read_mostly; +static struct tmigr_group *tmigr_root; + static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu); #define TMIGR_NONE 0xFF @@ -522,11 +524,9 @@ struct tmigr_walk { typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *); -static void __walk_groups(up_f up, struct tmigr_walk *data, - struct tmigr_cpu *tmc) +static void __walk_groups_from(up_f up, struct tmigr_walk *data, + struct tmigr_group *child, struct tmigr_group *group) { - struct tmigr_group *child = NULL, *group = tmc->tmgroup; - do { WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels); @@ -544,6 +544,12 @@ static void __walk_groups(up_f up, struct tmigr_walk *data, } while (group); } +static void __walk_groups(up_f up, struct tmigr_walk *data, + struct tmigr_cpu *tmc) +{ + __walk_groups_from(up, data, NULL, tmc->tmgroup); +} + static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc) { lockdep_assert_held(&tmc->lock); @@ -1498,21 +1504,6 @@ static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl, s.seq = 0; atomic_set(&group->migr_state, s.state); - /* - * If this is a new top-level, prepare its groupmask in advance. - * This avoids accidents where yet another new top-level is - * created in the future and made visible before the current groupmask. - */ - if (list_empty(&tmigr_level_list[lvl])) { - group->groupmask = BIT(0); - /* - * The previous top level has prepared its groupmask already, - * simply account it as the first child. - */ - if (lvl > 0) - group->num_children = 1; - } - timerqueue_init_head(&group->events); timerqueue_init(&group->groupevt.nextevt); group->groupevt.nextevt.expires = KTIME_MAX; @@ -1567,22 +1558,51 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node, return group; } +static bool tmigr_init_root(struct tmigr_group *group, bool activate) +{ + if (!group->parent && group != tmigr_root) { + /* + * This is the new top-level, prepare its groupmask in advance + * to avoid accidents where yet another new top-level is + * created in the future and made visible before this groupmask. + */ + group->groupmask = BIT(0); + WARN_ON_ONCE(activate); + + return true; + } + + return false; + +} + static void tmigr_connect_child_parent(struct tmigr_group *child, struct tmigr_group *parent, bool activate) { - struct tmigr_walk data; + if (tmigr_init_root(parent, activate)) { + /* + * The previous top level had prepared its groupmask already, + * simply account it in advance as the first child. If some groups + * have been created between the old and new root due to node + * mismatch, the new root's child will be intialized accordingly. + */ + parent->num_children = 1; + } - if (activate) { + /* Connecting old root to new root ? */ + if (!parent->parent && activate) { /* - * @child is the old top and @parent the new one. In this - * case groupmask is pre-initialized and @child already - * accounted, along with its new sibling corresponding to the - * CPU going up. + * @child is the old top, or in case of node mismatch, some + * intermediate group between the old top and the new one in + * @parent. In this case the @child must be pre-accounted above + * as the first child. Its new inactive sibling corresponding + * to the CPU going up has been accounted as the second child. */ - WARN_ON_ONCE(child->groupmask != BIT(0) || parent->num_children != 2); + WARN_ON_ONCE(parent->num_children != 2); + child->groupmask = BIT(0); } else { - /* Adding @child for the CPU going up to @parent. */ + /* Common case adding @child for the CPU going up to @parent. */ child->groupmask = BIT(parent->num_children++); } @@ -1594,56 +1614,28 @@ static void tmigr_connect_child_parent(struct tmigr_group *child, smp_store_release(&child->parent, parent); trace_tmigr_connect_child_parent(child); - - if (!activate) - return; - - /* - * To prevent inconsistent states, active children need to be active in - * the new parent as well. Inactive children are already marked inactive - * in the parent group: - * - * * When new groups were created by tmigr_setup_groups() starting from - * the lowest level (and not higher then one level below the current - * top level), then they are not active. They will be set active when - * the new online CPU comes active. - * - * * But if a new group above the current top level is required, it is - * mandatory to propagate the active state of the already existing - * child to the new parent. So tmigr_connect_child_parent() is - * executed with the formerly top level group (child) and the newly - * created group (parent). - * - * * It is ensured that the child is active, as this setup path is - * executed in hotplug prepare callback. This is exectued by an - * already connected and !idle CPU. Even if all other CPUs go idle, - * the CPU executing the setup will be responsible up to current top - * level group. And the next time it goes inactive, it will release - * the new childmask and parent to subsequent walkers through this - * @child. Therefore propagate active state unconditionally. - */ - data.childmask = child->groupmask; - - /* - * There is only one new level per time (which is protected by - * tmigr_mutex). When connecting the child and the parent and set the - * child active when the parent is inactive, the parent needs to be the - * uppermost level. Otherwise there went something wrong! - */ - WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent); } -static int tmigr_setup_groups(unsigned int cpu, unsigned int node) +static int tmigr_setup_groups(unsigned int cpu, unsigned int node, + struct tmigr_group *start, bool activate) { struct tmigr_group *group, *child, **stack; - int i, top = 0, err = 0; - struct list_head *lvllist; + int i, top = 0, err = 0, start_lvl = 0; + bool root_mismatch = false; stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL); if (!stack) return -ENOMEM; - for (i = 0; i < tmigr_hierarchy_levels; i++) { + if (start) { + stack[start->level] = start; + start_lvl = start->level + 1; + } + + if (tmigr_root) + root_mismatch = tmigr_root->numa_node != node; + + for (i = start_lvl; i < tmigr_hierarchy_levels; i++) { group = tmigr_get_group(cpu, node, i); if (IS_ERR(group)) { err = PTR_ERR(group); @@ -1656,23 +1648,25 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) /* * When booting only less CPUs of a system than CPUs are - * available, not all calculated hierarchy levels are required. + * available, not all calculated hierarchy levels are required, + * unless a node mismatch is detected. * * The loop is aborted as soon as the highest level, which might * be different from tmigr_hierarchy_levels, contains only a - * single group. + * single group, unless the nodes mismatch below tmigr_crossnode_level */ - if (group->parent || list_is_singular(&tmigr_level_list[i])) + if (group->parent) + break; + if ((!root_mismatch || i >= tmigr_crossnode_level) && + list_is_singular(&tmigr_level_list[i])) break; } /* Assert single root without parent */ if (WARN_ON_ONCE(i >= tmigr_hierarchy_levels)) return -EINVAL; - if (WARN_ON_ONCE(!err && !group->parent && !list_is_singular(&tmigr_level_list[top]))) - return -EINVAL; - for (; i >= 0; i--) { + for (; i >= start_lvl; i--) { group = stack[i]; if (err < 0) { @@ -1692,48 +1686,63 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) tmc->tmgroup = group; tmc->groupmask = BIT(group->num_children++); + tmigr_init_root(group, activate); + trace_tmigr_connect_cpu_parent(tmc); /* There are no children that need to be connected */ continue; } else { child = stack[i - 1]; - /* Will be activated at online time */ - tmigr_connect_child_parent(child, group, false); + tmigr_connect_child_parent(child, group, activate); } + } - /* check if uppermost level was newly created */ - if (top != i) - continue; - - WARN_ON_ONCE(top == 0); + if (err < 0) + goto out; - lvllist = &tmigr_level_list[top]; + if (activate) { + struct tmigr_walk data; /* - * Newly created root level should have accounted the upcoming - * CPU's child group and pre-accounted the old root. + * To prevent inconsistent states, active children need to be active in + * the new parent as well. Inactive children are already marked inactive + * in the parent group: + * + * * When new groups were created by tmigr_setup_groups() starting from + * the lowest level, then they are not active. They will be set active + * when the new online CPU comes active. + * + * * But if new groups above the current top level are required, it is + * mandatory to propagate the active state of the already existing + * child to the new parents. So tmigr_active_up() activates the + * new parents while walking up from the old root to the new. + * + * * It is ensured that @start is active, as this setup path is + * executed in hotplug prepare callback. This is executed by an + * already connected and !idle CPU. Even if all other CPUs go idle, + * the CPU executing the setup will be responsible up to current top + * level group. And the next time it goes inactive, it will release + * the new childmask and parent to subsequent walkers through this + * @child. Therefore propagate active state unconditionally. */ - if (group->num_children == 2 && list_is_singular(lvllist)) { - /* - * The target CPU must never do the prepare work, except - * on early boot when the boot CPU is the target. Otherwise - * it may spuriously activate the old top level group inside - * the new one (nevertheless whether old top level group is - * active or not) and/or release an uninitialized childmask. - */ - WARN_ON_ONCE(cpu == raw_smp_processor_id()); - - lvllist = &tmigr_level_list[top - 1]; - list_for_each_entry(child, lvllist, list) { - if (child->parent) - continue; + WARN_ON_ONCE(!start->parent); + data.childmask = start->groupmask; + __walk_groups_from(tmigr_active_up, &data, start, start->parent); + } - tmigr_connect_child_parent(child, group, true); - } + /* Root update */ + if (list_is_singular(&tmigr_level_list[top])) { + group = list_first_entry(&tmigr_level_list[top], + typeof(*group), list); + WARN_ON_ONCE(group->parent); + if (tmigr_root) { + /* Old root should be the same or below */ + WARN_ON_ONCE(tmigr_root->level > top); } + tmigr_root = group; } - +out: kfree(stack); return err; @@ -1741,12 +1750,26 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node) static int tmigr_add_cpu(unsigned int cpu) { + struct tmigr_group *old_root = tmigr_root; int node = cpu_to_node(cpu); int ret; - mutex_lock(&tmigr_mutex); - ret = tmigr_setup_groups(cpu, node); - mutex_unlock(&tmigr_mutex); + guard(mutex)(&tmigr_mutex); + + ret = tmigr_setup_groups(cpu, node, NULL, false); + + /* Root has changed? Connect the old one to the new */ + if (ret >= 0 && old_root && old_root != tmigr_root) { + /* + * The target CPU must never do the prepare work, except + * on early boot when the boot CPU is the target. Otherwise + * it may spuriously activate the old top level group inside + * the new one (nevertheless whether old top level group is + * active or not) and/or release an uninitialized childmask. + */ + WARN_ON_ONCE(cpu == raw_smp_processor_id()); + ret = tmigr_setup_groups(-1, old_root->numa_node, old_root, true); + } return ret; } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 4/6] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker ` (2 preceding siblings ...) 2025-10-24 13:25 ` [PATCH 3/6] timers/migration: Fix imbalanced NUMA trees Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 5/6] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 6/6] timers/migration: Remove dead code handling idle CPU checking for remote timers Frederic Weisbecker 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen The CPU doing the prepare work for a remote target must be online from the tree point of view and its hierarchy must be active, otherwise propagating its active state up to the new root branch would be either incorrect or racy. Assert those conditions with more sanity checks. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 49635a2b7ee2..bddd816faaeb 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1703,6 +1703,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, if (activate) { struct tmigr_walk data; + union tmigr_state state; /* * To prevent inconsistent states, active children need to be active in @@ -1726,6 +1727,8 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, * the new childmask and parent to subsequent walkers through this * @child. Therefore propagate active state unconditionally. */ + state.state = atomic_read(&start->migr_state); + WARN_ON_ONCE(!state.active); WARN_ON_ONCE(!start->parent); data.childmask = start->groupmask; __walk_groups_from(tmigr_active_up, &data, start, start->parent); @@ -1768,6 +1771,11 @@ static int tmigr_add_cpu(unsigned int cpu) * active or not) and/or release an uninitialized childmask. */ WARN_ON_ONCE(cpu == raw_smp_processor_id()); + /* + * The (likely) current CPU is expected to be online in the hierarchy, + * otherwise the old root may not be active as expected. + */ + WARN_ON_ONCE(!per_cpu_ptr(&tmigr_cpu, raw_smp_processor_id())->online); ret = tmigr_setup_groups(-1, old_root->numa_node, old_root, true); } -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy 2025-10-24 13:25 ` [PATCH 4/6] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: 3c8eb36e2a46786d50dbef417ef782ff37b372ca Gitweb: https://git.kernel.org/tip/3c8eb36e2a46786d50dbef417ef782ff37b372ca Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:34 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:25 +01:00 timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy The CPU doing the prepare work for a remote target must be online from the tree point of view and its hierarchy must be active, otherwise propagating its active state up to the new root branch would be either incorrect or racy. Assert those conditions with more sanity checks. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-5-frederic@kernel.org --- kernel/time/timer_migration.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 49635a2..bddd816 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1703,6 +1703,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, if (activate) { struct tmigr_walk data; + union tmigr_state state; /* * To prevent inconsistent states, active children need to be active in @@ -1726,6 +1727,8 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, * the new childmask and parent to subsequent walkers through this * @child. Therefore propagate active state unconditionally. */ + state.state = atomic_read(&start->migr_state); + WARN_ON_ONCE(!state.active); WARN_ON_ONCE(!start->parent); data.childmask = start->groupmask; __walk_groups_from(tmigr_active_up, &data, start, start->parent); @@ -1768,6 +1771,11 @@ static int tmigr_add_cpu(unsigned int cpu) * active or not) and/or release an uninitialized childmask. */ WARN_ON_ONCE(cpu == raw_smp_processor_id()); + /* + * The (likely) current CPU is expected to be online in the hierarchy, + * otherwise the old root may not be active as expected. + */ + WARN_ON_ONCE(!per_cpu_ptr(&tmigr_cpu, raw_smp_processor_id())->online); ret = tmigr_setup_groups(-1, old_root->numa_node, old_root, true); } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 5/6] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker ` (3 preceding siblings ...) 2025-10-24 13:25 ` [PATCH 4/6] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 6/6] timers/migration: Remove dead code handling idle CPU checking for remote timers Frederic Weisbecker 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index bddd816faaeb..73d9b0648116 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1511,8 +1511,7 @@ static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl, group->groupevt.ignore = true; } -static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node, - unsigned int lvl) +static struct tmigr_group *tmigr_get_group(int node, unsigned int lvl) { struct tmigr_group *tmp, *group = NULL; @@ -1636,7 +1635,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, root_mismatch = tmigr_root->numa_node != node; for (i = start_lvl; i < tmigr_hierarchy_levels; i++) { - group = tmigr_get_group(cpu, node, i); + group = tmigr_get_group(node, i); if (IS_ERR(group)) { err = PTR_ERR(group); i--; -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() 2025-10-24 13:25 ` [PATCH 5/6] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: 93643b90d6c141cb90dca7c24eabee800f51f908 Gitweb: https://git.kernel.org/tip/93643b90d6c141cb90dca7c24eabee800f51f908 Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:35 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:25 +01:00 timers/migration: Remove unused "cpu" parameter from tmigr_get_group() Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-6-frederic@kernel.org --- kernel/time/timer_migration.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index bddd816..73d9b06 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -1511,8 +1511,7 @@ static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl, group->groupevt.ignore = true; } -static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node, - unsigned int lvl) +static struct tmigr_group *tmigr_get_group(int node, unsigned int lvl) { struct tmigr_group *tmp, *group = NULL; @@ -1636,7 +1635,7 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node, root_mismatch = tmigr_root->numa_node != node; for (i = start_lvl; i < tmigr_hierarchy_levels; i++) { - group = tmigr_get_group(cpu, node, i); + group = tmigr_get_group(node, i); if (IS_ERR(group)) { err = PTR_ERR(group); i--; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 6/6] timers/migration: Remove dead code handling idle CPU checking for remote timers 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker ` (4 preceding siblings ...) 2025-10-24 13:25 ` [PATCH 5/6] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() Frederic Weisbecker @ 2025-10-24 13:25 ` Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 5 siblings, 1 reply; 13+ messages in thread From: Frederic Weisbecker @ 2025-10-24 13:25 UTC (permalink / raw) To: Thomas Gleixner; +Cc: LKML, Frederic Weisbecker, Anna-Maria Behnsen Idle migrators don't walk the whole tree in order to find out if there are timers to migrate because they recorded the next deadline to be verified within a single check in tmigr_requires_handle_remote(). Remove the related dead code and data. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> --- kernel/time/timer_migration.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 73d9b0648116..19ddfa96b9df 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -504,11 +504,6 @@ static bool tmigr_check_lonely(struct tmigr_group *group) * @now: timer base monotonic * @check: is set if there is the need to handle remote timers; * required in tmigr_requires_handle_remote() only - * @tmc_active: this flag indicates, whether the CPU which triggers - * the hierarchy walk is !idle in the timer migration - * hierarchy. When the CPU is idle and the whole hierarchy is - * idle, only the first event of the top level has to be - * considered. */ struct tmigr_walk { u64 nextexp; @@ -519,7 +514,6 @@ struct tmigr_walk { unsigned long basej; u64 now; bool check; - bool tmc_active; }; typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *); @@ -1119,15 +1113,6 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group, */ if (!tmigr_check_migrator(group, childmask)) return true; - - /* - * When there is a parent group and the CPU which triggered the - * hierarchy walk is not active, proceed the walk to reach the top level - * group before reading the next_expiry value. - */ - if (group->parent && !data->tmc_active) - return false; - /* * The lock is required on 32bit architectures to read the variable * consistently with a concurrent writer. On 64bit the lock is not @@ -1172,7 +1157,6 @@ bool tmigr_requires_handle_remote(void) data.now = get_jiffies_update(&jif); data.childmask = tmc->groupmask; data.firstexp = KTIME_MAX; - data.tmc_active = !tmc->idle; data.check = false; /* -- 2.51.0 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [tip: timers/core] timers/migration: Remove dead code handling idle CPU checking for remote timers 2025-10-24 13:25 ` [PATCH 6/6] timers/migration: Remove dead code handling idle CPU checking for remote timers Frederic Weisbecker @ 2025-11-01 19:41 ` tip-bot2 for Frederic Weisbecker 0 siblings, 0 replies; 13+ messages in thread From: tip-bot2 for Frederic Weisbecker @ 2025-11-01 19:41 UTC (permalink / raw) To: linux-tip-commits; +Cc: Frederic Weisbecker, Thomas Gleixner, x86, linux-kernel The following commit has been merged into the timers/core branch of tip: Commit-ID: ba14500e4bfcab5e841fbf8d7fcbbc80e98d6b9e Gitweb: https://git.kernel.org/tip/ba14500e4bfcab5e841fbf8d7fcbbc80e98d6b9e Author: Frederic Weisbecker <frederic@kernel.org> AuthorDate: Fri, 24 Oct 2025 15:25:36 +02:00 Committer: Thomas Gleixner <tglx@linutronix.de> CommitterDate: Sat, 01 Nov 2025 20:38:25 +01:00 timers/migration: Remove dead code handling idle CPU checking for remote timers Idle migrators don't walk the whole tree in order to find out if there are timers to migrate because they recorded the next deadline to be verified within a single check in tmigr_requires_handle_remote(). Remove the related dead code and data. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251024132536.39841-7-frederic@kernel.org --- kernel/time/timer_migration.c | 16 ---------------- 1 file changed, 16 deletions(-) diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c index 73d9b06..19ddfa9 100644 --- a/kernel/time/timer_migration.c +++ b/kernel/time/timer_migration.c @@ -504,11 +504,6 @@ static bool tmigr_check_lonely(struct tmigr_group *group) * @now: timer base monotonic * @check: is set if there is the need to handle remote timers; * required in tmigr_requires_handle_remote() only - * @tmc_active: this flag indicates, whether the CPU which triggers - * the hierarchy walk is !idle in the timer migration - * hierarchy. When the CPU is idle and the whole hierarchy is - * idle, only the first event of the top level has to be - * considered. */ struct tmigr_walk { u64 nextexp; @@ -519,7 +514,6 @@ struct tmigr_walk { unsigned long basej; u64 now; bool check; - bool tmc_active; }; typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *); @@ -1119,15 +1113,6 @@ static bool tmigr_requires_handle_remote_up(struct tmigr_group *group, */ if (!tmigr_check_migrator(group, childmask)) return true; - - /* - * When there is a parent group and the CPU which triggered the - * hierarchy walk is not active, proceed the walk to reach the top level - * group before reading the next_expiry value. - */ - if (group->parent && !data->tmc_active) - return false; - /* * The lock is required on 32bit architectures to read the variable * consistently with a concurrent writer. On 64bit the lock is not @@ -1172,7 +1157,6 @@ bool tmigr_requires_handle_remote(void) data.now = get_jiffies_update(&jif); data.childmask = tmc->groupmask; data.firstexp = KTIME_MAX; - data.tmc_active = !tmc->idle; data.check = false; /* ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-11-01 19:41 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-24 13:25 [PATCH 0/6] timers/migration: Fix NUMA trees + cleanups Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 1/6] timers/migration: Convert "while" loops to use "for" Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 2/6] timers/migration: Remove locking on group connection Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 3/6] timers/migration: Fix imbalanced NUMA trees Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 4/6] timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 5/6] timers/migration: Remove unused "cpu" parameter from tmigr_get_group() Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker 2025-10-24 13:25 ` [PATCH 6/6] timers/migration: Remove dead code handling idle CPU checking for remote timers Frederic Weisbecker 2025-11-01 19:41 ` [tip: timers/core] " tip-bot2 for Frederic Weisbecker
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox