[PATCH 0/3] timer_migration: Fix a possible race and improvements

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] timer_migration: Fix a possible race and improvements
@ 2024-06-21  9:37 Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 1/3] timer_migration: Do not rely always on group->parent Anna-Maria Behnsen
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-21  9:37 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel
  Cc: Borislav Petkov, Anna-Maria Behnsen, Narasimhan V

Borislav reported a warning in timer migration deactive path

  https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local

Sadly it doesn't reproduce directly. But with the change of timing (by
adding a trace prinkt before the warning), it is possible to trigger the
warning reliable at least in my test setup. The problem here is a racy
check agains group->parent pointer. This is also used in other places in
the code and fixing this racy usage is adressed by the first patch.

While working with the code, I saw two things which could be improved
(tracing and update of per cpu group wakeup value). This improvements are
adressed by the other two patches.

Patches are available here:

  https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc

Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org

Thanks,

Anna-Maria

---
Anna-Maria Behnsen (3):
      timer_migration: Do not rely always on group->parent
      timer_migration: Spare write when nothing changed
      timer_migration: Improve tracing

 kernel/time/timer_migration.c | 55 ++++++++++++++++++++-----------------------
 kernel/time/timer_migration.h | 12 +++++++++-
 2 files changed, 36 insertions(+), 31 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] timer_migration: Do not rely always on group->parent
  2024-06-21  9:37 [PATCH 0/3] timer_migration: Fix a possible race and improvements Anna-Maria Behnsen
@ 2024-06-21  9:37 ` Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 2/3] timer_migration: Spare write when nothing changed Anna-Maria Behnsen
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-21  9:37 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel
  Cc: Borislav Petkov, Anna-Maria Behnsen, Narasimhan V

When reading group->parent without holding the group lock it is racy
against CPUs coming online the first time and thereby creating another
level of the hierarchy. This is not a problem when this value is read once
to decide whether to abort a propagation or not. The worst outcome is an
unnecessary/early CPU wake up. But it is racy when reading it several times
during a single 'action' (like activation, deactivation, checking for
remote timer expiry,...) and relying on the consitency of this value
without holding the lock. This happens at the moment e.g. in
tmigr_inactive_up() which is also calling tmigr_udpate_events(). Code relys
on group->parent not to change during this 'action'.

Update parent struct member description to explain the above only
once. Remove parent pointer checks when they are not mandatory (like update
of data->childmask). Remove a warning, which would be nice but the trigger
of this warning is not reliable and add expand the data structure member
description instead. Expand a comment, why it is safe to rely on parent
pointer here (inside hierarchy update).

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Narasimhan V <Narasimhan.V@amd.com>
---
 kernel/time/timer_migration.c | 33 +++++++++++++++------------------
 kernel/time/timer_migration.h | 12 +++++++++++-
 2 files changed, 26 insertions(+), 19 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 84413114db5c..d91efe1dc3bf 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -507,7 +507,14 @@ static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
  *			(get_next_timer_interrupt())
  * @firstexp:		Contains the first event expiry information when last
  *			active CPU of hierarchy is on the way to idle to make
- *			sure CPU will be back in time.
+ *			sure CPU will be back in time. It is updated in top
+ *			level group only. Be aware, there could occur a new top
+ *			level of the hierarchy between the 'top level call' in
+ *			tmigr_update_events() and the check for the parent group
+ *			in walk_groups(). Then @firstexp might contain a value
+ *			!= KTIME_MAX even if it was not the final top
+ *			level. This is not a problem, as the worst outcome is a
+ *			CPU which might wake up a little early.
  * @evt:		Pointer to tmigr_event which needs to be queued (of idle
  *			child group)
  * @childmask:		childmask of child group
@@ -649,7 +656,7 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
-	if ((walk_done == false) && group->parent)
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
 	/*
@@ -1317,20 +1324,9 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	/* Event Handling */
 	tmigr_update_events(group, child, data);
 
-	if (group->parent && (walk_done == false))
+	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	/*
-	 * data->firstexp was set by tmigr_update_events() and contains the
-	 * expiry of the first global event which needs to be handled. It
-	 * differs from KTIME_MAX if:
-	 * - group is the top level group and
-	 * - group is idle (which means CPU was the last active CPU in the
-	 *   hierarchy) and
-	 * - there is a pending event in the hierarchy
-	 */
-	WARN_ON_ONCE(data->firstexp != KTIME_MAX && group->parent);
-
 	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 
 	return walk_done;
@@ -1552,10 +1548,11 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 		data.childmask = child->childmask;
 
 		/*
-		 * There is only one new level per time. When connecting the
-		 * child and the parent and set the child active when the parent
-		 * is inactive, the parent needs to be the uppermost
-		 * level. Otherwise there went something wrong!
+		 * There is only one new level per time (which is protected by
+		 * tmigr_mutex). When connecting the child and the parent and
+		 * set the child active when the parent is inactive, the parent
+		 * needs to be the uppermost level. Otherwise there went
+		 * something wrong!
 		 */
 		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 	}
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
index 6c37d94a37d9..494f68cc13f4 100644
--- a/kernel/time/timer_migration.h
+++ b/kernel/time/timer_migration.h
@@ -22,7 +22,17 @@ struct tmigr_event {
  * struct tmigr_group - timer migration hierarchy group
  * @lock:		Lock protecting the event information and group hierarchy
  *			information during setup
- * @parent:		Pointer to the parent group
+ * @parent:		Pointer to the parent group. Pointer is updated when a
+ *			new hierarchy level is added because of a CPU coming
+ *			online the first time. Once it is set, the pointer will
+ *			not be removed or updated. When accessing parent pointer
+ *			lock less to decide whether to abort a propagation or
+ *			not, it is not a problem. The worst outcome is an
+ *			unnecessary/early CPU wake up. But do not access parent
+ *			pointer several times in the same 'action' (like
+ *			activation, deactivation, check for remote expiry,...)
+ *			without holding the lock as it is not ensured that value
+ *			will not change.
  * @groupevt:		Next event of the group which is only used when the
  *			group is !active. The group event is then queued into
  *			the parent timer queue.

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] timer_migration: Spare write when nothing changed
  2024-06-21  9:37 [PATCH 0/3] timer_migration: Fix a possible race and improvements Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 1/3] timer_migration: Do not rely always on group->parent Anna-Maria Behnsen
@ 2024-06-21  9:37 ` Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 3/3] timer_migration: Improve tracing Anna-Maria Behnsen
  2024-06-21 14:31 ` [PATCH 0/3] timer_migration: Fix a possible race and improvements Frederic Weisbecker
  3 siblings, 0 replies; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-21  9:37 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

The wakeup value is written unconditionally in tmigr_cpu_new_timer(). When
there was no new next timer expiry that needs to be propagated, then the
value that was read before is written. This is not required.

Move write to the place where wakeup value could have changed.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index d91efe1dc3bf..f55be5411ad9 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1237,14 +1237,13 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
 		if (nextexp != tmc->cpuevt.nextevt.expires ||
 		    tmc->cpuevt.ignore) {
 			ret = tmigr_new_timer(tmc, nextexp);
+			/*
+			 * Make sure the reevaluation of timers in idle path
+			 * will not miss an event.
+			 */
+			WRITE_ONCE(tmc->wakeup, ret);
 		}
 	}
-	/*
-	 * Make sure the reevaluation of timers in idle path will not miss an
-	 * event.
-	 */
-	WRITE_ONCE(tmc->wakeup, ret);
-
 	trace_tmigr_cpu_new_timer_idle(tmc, nextexp);
 	raw_spin_unlock(&tmc->lock);
 	return ret;

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] timer_migration: Improve tracing
  2024-06-21  9:37 [PATCH 0/3] timer_migration: Fix a possible race and improvements Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 1/3] timer_migration: Do not rely always on group->parent Anna-Maria Behnsen
  2024-06-21  9:37 ` [PATCH 2/3] timer_migration: Spare write when nothing changed Anna-Maria Behnsen
@ 2024-06-21  9:37 ` Anna-Maria Behnsen
  2024-06-21 14:31 ` [PATCH 0/3] timer_migration: Fix a possible race and improvements Frederic Weisbecker
  3 siblings, 0 replies; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-21  9:37 UTC (permalink / raw)
  To: Frederic Weisbecker, Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen

Trace points of inactive and active propagation are located at the end of
the related functions. The interesting information of those trace points is
the updated group state. When trace points are not located directly at the
place where group state changed, order of trace points in traces could be
confusing.

Move inactive and active propagation trace points directly after update of
group state values.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
---
 kernel/time/timer_migration.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index f55be5411ad9..41ea48d08c65 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -656,6 +656,8 @@ static bool tmigr_active_up(struct tmigr_group *group,
 
 	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
 
+	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
@@ -673,8 +675,6 @@ static bool tmigr_active_up(struct tmigr_group *group,
 	 */
 	group->groupevt.ignore = true;
 
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
-
 	return walk_done;
 }
 
@@ -1305,9 +1305,10 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 
 		WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
 
-		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state,
-				       newstate.state))
+		if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
+			trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
 			break;
+		}
 
 		/*
 		 * The memory barrier is paired with the cmpxchg() in
@@ -1326,8 +1327,6 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
 	if (walk_done == false)
 		data->childmask = group->childmask;
 
-	trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
-
 	return walk_done;
 }
 

-- 
2.39.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements
  2024-06-21  9:37 [PATCH 0/3] timer_migration: Fix a possible race and improvements Anna-Maria Behnsen
                   ` (2 preceding siblings ...)
  2024-06-21  9:37 ` [PATCH 3/3] timer_migration: Improve tracing Anna-Maria Behnsen
@ 2024-06-21 14:31 ` Frederic Weisbecker
  2024-06-24  8:58   ` Anna-Maria Behnsen
  3 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2024-06-21 14:31 UTC (permalink / raw)
  To: Anna-Maria Behnsen
  Cc: Thomas Gleixner, linux-kernel, Borislav Petkov, Narasimhan V

Le Fri, Jun 21, 2024 at 11:37:05AM +0200, Anna-Maria Behnsen a écrit :
> Borislav reported a warning in timer migration deactive path
> 
>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
> 
> Sadly it doesn't reproduce directly. But with the change of timing (by
> adding a trace prinkt before the warning), it is possible to trigger the
> warning reliable at least in my test setup. The problem here is a racy
> check agains group->parent pointer. This is also used in other places in
> the code and fixing this racy usage is adressed by the first patch.
> 
> While working with the code, I saw two things which could be improved
> (tracing and update of per cpu group wakeup value). This improvements are
> adressed by the other two patches.
> 
> Patches are available here:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
> 
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> 
> Thanks,
> 
> Anna-Maria
> 
> ---

This made me stare at the group creation again and I might have found
something. Does the following race look plausible to you?


                  [GRP0:0]
               migrator = 0
               active   = 0
               nextevt  = KTIME_MAX
               /         \
              0         1 .. 7
          active         idle

0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
   as active.

   
                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                                         \
                 [GRP0:0]                  [GRP0:1]
              migrator = 0              migrator = TMIGR_NONE
              active   = 0              active   = NONE
              nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
          active         idle                !online

1) CPU 8 is booting and creates a new node and a new top. For now it's
   only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
   __tmigr_cpu_activate() on itself yet.


                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = 0              migrator = TMIGR_NONE
              active   = 0              active   = NONE
              nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
          active         idle                active

2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
   tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
   prepares to call tmigr_active_up() on it. It hasn't done it yet.


                             [GRP1:0]
                        migrator = TMIGR_NONE
                        active   = NONE
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE        migrator = TMIGR_NONE
              active   = NONE              active   = NONE
              nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
   GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events()
   and it propagates the change to the top (no change there and no wakeup
   programmed since there is no timer).


                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE       migrator = TMIGR_NONE
              active   = NONE             active   = NONE
              nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

4) Now CPU 8 finally calls tmigr_active_up() to GRP0:0

                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0, GRP0:1
                        nextevt  = KTIME_MAX
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE       migrator = 8
              active   = NONE             active   = 8
              nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
                /         \                    |
              0          1 .. 7                8
            idle         idle                active

5) And out of tmigr_cpu_online() CPU 8 calls tmigr_active_up() on itself

                             [GRP1:0]
                        migrator = GRP0:0
                        active   = GRP0:0
                        nextevt  = T8
                       /                  \
                 [GRP0:0]                  [GRP0:1]
              migrator = TMIGR_NONE         migrator = TMIGR_NONE
              active   = NONE               active   = NONE
              nextevt  = KTIME_MAX          nextevt  = T8
                /         \                    |
              0          1 .. 7                8
            idle         idle                  idle

5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
   But it's not really active, so T8 gets ignored.


And if that race looks plausible, does the following fix look good?

diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 84413114db5c..0609cb8c770e 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -1525,7 +1525,6 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 	child->childmask = BIT(parent->num_children++);
 
 	raw_spin_unlock(&parent->lock);
-	raw_spin_unlock_irq(&child->lock);
 
 	trace_tmigr_connect_child_parent(child);
 
@@ -1559,6 +1558,14 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
 		 */
 		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
 	}
+	/*
+	 * Keep the lock up to that point so that if the child goes idle
+	 * concurrently, either it sees the new parent with its active state
+	 * after locking on tmigr_update_events() and propagates afterwards
+	 * its idle state up, or the current booting CPU will observe TMIGR_NONE
+	 * on the remote child and it won't propagate a spurious active state.
+	 */
+	raw_spin_unlock_irq(&child->lock);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements
  2024-06-21 14:31 ` [PATCH 0/3] timer_migration: Fix a possible race and improvements Frederic Weisbecker
@ 2024-06-24  8:58   ` Anna-Maria Behnsen
  2024-06-24 11:04     ` Frederic Weisbecker
  0 siblings, 1 reply; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-24  8:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, linux-kernel, Borislav Petkov, Narasimhan V

Frederic Weisbecker <frederic@kernel.org> writes:

> Le Fri, Jun 21, 2024 at 11:37:05AM +0200, Anna-Maria Behnsen a écrit :
>> Borislav reported a warning in timer migration deactive path
>> 
>>   https://lore.kernel.org/r/20240612090347.GBZmlkc5PwlVpOG6vT@fat_crate.local
>> 
>> Sadly it doesn't reproduce directly. But with the change of timing (by
>> adding a trace prinkt before the warning), it is possible to trigger the
>> warning reliable at least in my test setup. The problem here is a racy
>> check agains group->parent pointer. This is also used in other places in
>> the code and fixing this racy usage is adressed by the first patch.
>> 
>> While working with the code, I saw two things which could be improved
>> (tracing and update of per cpu group wakeup value). This improvements are
>> adressed by the other two patches.
>> 
>> Patches are available here:
>> 
>>   https://git.kernel.org/pub/scm/linux/kernel/git/anna-maria/linux-devel.git timers/misc
>> 
>> Cc: Frederic Weisbecker <frederic@kernel.org>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: linux-kernel@vger.kernel.org
>> 
>> Thanks,
>> 
>> Anna-Maria
>> 
>> ---
>
> This made me stare at the group creation again and I might have found
> something. Does the following race look plausible to you?

Yes...

>
>                   [GRP0:0]
>                migrator = 0
>                active   = 0
>                nextevt  = KTIME_MAX
>                /         \
>               0         1 .. 7
>           active         idle
>
> 0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
>    as active.
>
>    
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                                          \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                !online
>
> 1) CPU 8 is booting and creates a new node and a new top. For now it's
>    only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
>    __tmigr_cpu_activate() on itself yet.

NIT: In step 1) CPU8 is not yet connected to GRP0:1 as the second while
loop is not yet finished, but nevertheless...

>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = 0              migrator = TMIGR_NONE
>               active   = 0              active   = NONE
>               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>           active         idle                active
>
> 2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
>    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
>    prepares to call tmigr_active_up() on it. It hasn't done it yet.

NIT: CPU8 keeps its state !online until step 5.

>
>
>                              [GRP1:0]
>                         migrator = TMIGR_NONE
>                         active   = NONE
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE        migrator = TMIGR_NONE
>               active   = NONE              active   = NONE
>               nextevt  = KTIME_MAX         nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 3) CPU 0 goes idle. Since GRP0:0->parent has been updated by CPU 8 with
>    GRP0:0->lock held, CPU 0 observes GRP1:0 after calling tmigr_update_events()
>    and it propagates the change to the top (no change there and no wakeup
>    programmed since there is no timer).
>
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = TMIGR_NONE
>               active   = NONE             active   = NONE
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 4) Now CPU 8 finally calls tmigr_active_up() to GRP0:0

... oh no. Here it starts to go mad. Good catch!

>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0, GRP0:1
>                         nextevt  = KTIME_MAX
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE       migrator = 8
>               active   = NONE             active   = 8
>               nextevt  = KTIME_MAX        nextevt  = KTIME_MAX
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                active
>
> 5) And out of tmigr_cpu_online() CPU 8 calls tmigr_active_up() on itself
>
>                              [GRP1:0]
>                         migrator = GRP0:0
>                         active   = GRP0:0
>                         nextevt  = T8
>                        /                  \
>                  [GRP0:0]                  [GRP0:1]
>               migrator = TMIGR_NONE         migrator = TMIGR_NONE
>               active   = NONE               active   = NONE
>               nextevt  = KTIME_MAX          nextevt  = T8
>                 /         \                    |
>               0          1 .. 7                8
>             idle         idle                  idle
>
> 5) CPU 8 goes idle with a timer T8 and relies on GRP0:0 as the migrator.
>    But it's not really active, so T8 gets ignored.
>
>
> And if that race looks plausible, does the following fix look good?

Holding the lock will not help as the state is not protected by the
lock. Step 4) would nevertheless happen. To properly fix this, we need a
memory barrier. I added a comment back then in tmigr_active_up() why the
barrier is not required there but without having this corner case in
mind. Or am I wrong?

To solve it, we could

a) add a memory barrier also in tmigr_active_up() and read the
   childstate there always (similar to tmigr_inactive_up()).

b) create a separate tmigr_setup_active_up() function with this memory
   barrier and with reading the childstate there after the memory
   barrier.

I would propose to go with b) to do not impact active_up().

I dropped the walk_hierarchy information for tmigr_setup_active_up() and
also do not set the group event ignore bit. This shouldn't be required,
as this active update could only happen between the new top level and
the formerly top level group. And the event ignore bit on the top level
isn't taken into account.

This is not tested yet, but it could look like the following. When it
makes sense to you and if I didn't miss something else - I would start
testing and functionality verification ;)

---8<----
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -623,12 +623,37 @@ static u64 tmigr_next_groupevt_expires(s
 		return evt->nextevt.expires;
 }
 
+static bool __tmigr_active_up(struct tmigr_group *group, bool *walk_done, union tmigr_state *curstate, u8 childmask)
+{
+	union tmigr_state newstate;
+
+	newstate = *curstate;
+	*walk_done = true;
+
+	if (newstate.migrator == TMIGR_NONE) {
+		newstate.migrator = childmask;
+
+		/* Changes need to be propagated */
+		*walk_done = false;
+	}
+
+	newstate.active |= childmask;
+	newstate.seq++;
+
+	if (atomic_try_cmpxchg(&group->migr_state, &curstate->state, newstate.state)) {
+		trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+		return true;
+	}
+
+	return false;
+}
+
 static bool tmigr_active_up(struct tmigr_group *group,
 			    struct tmigr_group *child,
 			    void *ptr)
 {
-	union tmigr_state curstate, newstate;
 	struct tmigr_walk *data = ptr;
+	union tmigr_state curstate;
 	bool walk_done;
 	u8 childmask;
 
@@ -640,23 +665,10 @@ static bool tmigr_active_up(struct tmigr
 	 */
 	curstate.state = atomic_read(&group->migr_state);
 
-	do {
-		newstate = curstate;
-		walk_done = true;
-
-		if (newstate.migrator == TMIGR_NONE) {
-			newstate.migrator = childmask;
-
-			/* Changes need to be propagated */
-			walk_done = false;
-		}
-
-		newstate.active |= childmask;
-		newstate.seq++;
-
-	} while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
-
-	trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+	for(;;) {
+		if (__tmigr_active_up(group, &walk_done, &curstate, childmask))
+		    break;
+	}
 
 	if (walk_done == false)
 		data->childmask = group->childmask;
@@ -1436,6 +1448,35 @@ u64 tmigr_quick_check(u64 nextevt)
 	return KTIME_MAX;
 }
 
+static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
+{
+	union tmigr_state curstate, childstate;
+	bool walk_done;
+
+	/*
+	 * FIXME: Memory barrier is required here as the child state
+	 * could have changed in the meantime
+	 */
+	curstate.state = atomic_read_acquire(&group->migr_state);
+
+	for (;;) {
+		childstate.state = atomic_read(&child->migr_state);
+		if (!childstate.active)
+			return;
+
+		if (__tmigr_active_up(group, &walk_done, &curstate, child->childmask))
+			break;
+
+		/*
+		 * The memory barrier is paired with the cmpxchg() in
+		 * tmigr_inactive_up() to make sure the updates of child and group
+		 * states are ordered. It is required only when the
+		 * try_cmpxchg() in __tmigr_active_up() fails.
+		 */
+		smp_mb__after_atomic();
+	}
+}
+
 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
 			     int node)
 {
@@ -1510,8 +1551,6 @@ static struct tmigr_group *tmigr_get_gro
 static void tmigr_connect_child_parent(struct tmigr_group *child,
 				       struct tmigr_group *parent)
 {
-	union tmigr_state childstate;
-
 	raw_spin_lock_irq(&child->lock);
 	raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
 
@@ -1539,21 +1578,7 @@ static void tmigr_connect_child_parent(s
 	 *   executed with the formerly top level group (child) and the newly
 	 *   created group (parent).
 	 */
-	childstate.state = atomic_read(&child->migr_state);
-	if (childstate.migrator != TMIGR_NONE) {
-		struct tmigr_walk data;
-
-		data.childmask = child->childmask;
-
-		/*
-		 * There is only one new level per time (which is protected by
-		 * tmigr_mutex). When connecting the child and the parent and
-		 * set the child active when the parent is inactive, the parent
-		 * needs to be the uppermost level. Otherwise there went
-		 * something wrong!
-		 */
-		WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
-	}
+	tmigr_setup_active_up(parent, child);
 }
 
 static int tmigr_setup_groups(unsigned int cpu, unsigned int node)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements
  2024-06-24  8:58   ` Anna-Maria Behnsen
@ 2024-06-24 11:04     ` Frederic Weisbecker
  2024-06-24 14:48       ` Anna-Maria Behnsen
  0 siblings, 1 reply; 8+ messages in thread
From: Frederic Weisbecker @ 2024-06-24 11:04 UTC (permalink / raw)
  To: Anna-Maria Behnsen
  Cc: Thomas Gleixner, linux-kernel, Borislav Petkov, Narasimhan V

On Mon, Jun 24, 2024 at 10:58:26AM +0200, Anna-Maria Behnsen wrote:
> Frederic Weisbecker <frederic@kernel.org> writes:
> > 0) Hierarchy has only 8 CPUs (single node for now with only CPU 0
> >    as active.
> >
> >    
> >                              [GRP1:0]
> >                         migrator = TMIGR_NONE
> >                         active   = NONE
> >                         nextevt  = KTIME_MAX
> >                                          \
> >                  [GRP0:0]                  [GRP0:1]
> >               migrator = 0              migrator = TMIGR_NONE
> >               active   = 0              active   = NONE
> >               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> >                 /         \                    |
> >               0          1 .. 7                8
> >           active         idle                !online
> >
> > 1) CPU 8 is booting and creates a new node and a new top. For now it's
> >    only connected to GRP0:1, not yet to GRP0:0. Also CPU 8 hasn't called
> >    __tmigr_cpu_activate() on itself yet.
> 
> NIT: In step 1) CPU8 is not yet connected to GRP0:1 as the second while
> loop is not yet finished, but nevertheless...

Yes, I was in the second loop in my mind, but that didn't transcribe well :-)

> 
> >
> >                              [GRP1:0]
> >                         migrator = TMIGR_NONE
> >                         active   = NONE
> >                         nextevt  = KTIME_MAX
> >                        /                  \
> >                  [GRP0:0]                  [GRP0:1]
> >               migrator = 0              migrator = TMIGR_NONE
> >               active   = 0              active   = NONE
> >               nextevt  = KTIME_MAX      nextevt  = KTIME_MAX
> >                 /         \                    |
> >               0          1 .. 7                8
> >           active         idle                active
> >
> > 2) CPU 8 connects GRP0:0 to GRP1:0 and observes while in
> >    tmigr_connect_child_parent() that GRP0:0 is not TMIGR_NONE. So it
> >    prepares to call tmigr_active_up() on it. It hasn't done it yet.
> 
> NIT: CPU8 keeps its state !online until step 5.

Oops, copy paste hazards.

> Holding the lock will not help as the state is not protected by the
> lock.

No but any access happening before an UNLOCK is guaranteed to be visible
after the next LOCK. This makes sure that either:

1) If the remote CPU going inactive has passed tmigr_update_events() with
its unlock of group->lock then the onlining CPU will see the TMIGR_NONE

or:

1) If the onlining CPU locks before the remote CPU calls tmigr_update_events(),
   then the subsequent LOCK on group->lock in tmigr_update_events() will acquire
   the new parent link and propagate the up the inactive state.

And yes there is an optimization case where we don't lock:

		if (evt->ignore && !remote && group->parent)
			return true;

And that would fall through the cracks. So, forget about that lock idea.


> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
> +{
> +	union tmigr_state curstate, childstate;
> +	bool walk_done;
> +
> +	/*
> +	 * FIXME: Memory barrier is required here as the child state
> +	 * could have changed in the meantime
> +	 */
> +	curstate.state = atomic_read_acquire(&group->migr_state);
> +
> +	for (;;) {
> +		childstate.state = atomic_read(&child->migr_state);
> +		if (!childstate.active)
> +			return;

Ok there could have been a risk that we miss the remote CPU going active. But
again thanks to the lock this makes sure that either we observe the childstate
as active or the remote CPU sees the link and propagates its active state. And
the unlocked optimization tmigr_update_events() still works because either it
sees the new parent and proceeds or it sees an intermediate parent and the next
ones will be locked.

Phew!

I'll do a deeper review this evening but it _looks_ ok.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] timer_migration: Fix a possible race and improvements
  2024-06-24 11:04     ` Frederic Weisbecker
@ 2024-06-24 14:48       ` Anna-Maria Behnsen
  0 siblings, 0 replies; 8+ messages in thread
From: Anna-Maria Behnsen @ 2024-06-24 14:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Thomas Gleixner, linux-kernel, Borislav Petkov, Narasimhan V

Frederic Weisbecker <frederic@kernel.org> writes:

> On Mon, Jun 24, 2024 at 10:58:26AM +0200, Anna-Maria Behnsen wrote:
>> Frederic Weisbecker <frederic@kernel.org> writes:
>> +static void tmigr_setup_active_up(struct tmigr_group *group, struct tmigr_group *child)
>> +{
>> +	union tmigr_state curstate, childstate;
>> +	bool walk_done;
>> +
>> +	/*
>> +	 * FIXME: Memory barrier is required here as the child state
>> +	 * could have changed in the meantime
>> +	 */
>> +	curstate.state = atomic_read_acquire(&group->migr_state);
>> +
>> +	for (;;) {
>> +		childstate.state = atomic_read(&child->migr_state);
>> +		if (!childstate.active)
>> +			return;
>
> Ok there could have been a risk that we miss the remote CPU going active. But
> again thanks to the lock this makes sure that either we observe the childstate
> as active or the remote CPU sees the link and propagates its active state. And
> the unlocked optimization tmigr_update_events() still works because either it
> sees the new parent and proceeds or it sees an intermediate parent and the next
> ones will be locked.
>
> Phew!
>
> I'll do a deeper review this evening but it _looks_ ok.
>

I will send you a v2 of the whole timer_migration series where the fix
is also splitted. And review should then be a little easier.

Thanks,

        Anna-Maria


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-06-24 14:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21  9:37 [PATCH 0/3] timer_migration: Fix a possible race and improvements Anna-Maria Behnsen
2024-06-21  9:37 ` [PATCH 1/3] timer_migration: Do not rely always on group->parent Anna-Maria Behnsen
2024-06-21  9:37 ` [PATCH 2/3] timer_migration: Spare write when nothing changed Anna-Maria Behnsen
2024-06-21  9:37 ` [PATCH 3/3] timer_migration: Improve tracing Anna-Maria Behnsen
2024-06-21 14:31 ` [PATCH 0/3] timer_migration: Fix a possible race and improvements Frederic Weisbecker
2024-06-24  8:58   ` Anna-Maria Behnsen
2024-06-24 11:04     ` Frederic Weisbecker
2024-06-24 14:48       ` Anna-Maria Behnsen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox