From: Frederic Weisbecker <frederic@kernel.org>
To: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: linux-kernel@vger.kernel.org,
Peter Zijlstra <peterz@infradead.org>,
John Stultz <jstultz@google.com>,
Thomas Gleixner <tglx@linutronix.de>,
Eric Dumazet <edumazet@google.com>,
"Rafael J . Wysocki" <rafael.j.wysocki@intel.com>,
Arjan van de Ven <arjan@infradead.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
Rik van Riel <riel@surriel.com>,
Steven Rostedt <rostedt@goodmis.org>,
Sebastian Siewior <bigeasy@linutronix.de>,
Giovanni Gherdovich <ggherdovich@suse.cz>,
Lukasz Luba <lukasz.luba@arm.com>,
"Gautham R . Shenoy" <gautham.shenoy@amd.com>,
Srinivas Pandruvada <srinivas.pandruvada@intel.com>,
K Prateek Nayak <kprateek.nayak@amd.com>
Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model
Date: Fri, 26 Jan 2024 13:53:13 +0100 [thread overview]
Message-ID: <ZbOrOV8kWUd59h9q@lothringen> (raw)
In-Reply-To: <20240115143743.27827-19-anna-maria@linutronix.de>
On Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen wrote:
> + * Protection of the tmigr group state information:
> + * ------------------------------------------------
> + *
> + * The state information with the list of active children and migrator needs to
> + * be protected by a sequence counter. It prevents a race when updates in child
> + * groups are propagated in changed order. The state update is performed
> + * lockless and group wise. The following scenario describes what happens
> + * without updating the sequence counter:
> + *
> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
> + * as GRP0:1 will not change during the scenario):
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU0 migrator = CPU2
> + * active = CPU0 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * active idle active idle
> + *
> + *
> + * 1. CPU0 goes idle. As the update is performed group wise, in the first step
> + * only GRP0:0 is updated. The update of GRP1:0 is pending as CPU0 has to
> + * walk the hierarchy.
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = TMIGR_NONE migrator = CPU2
> + * --> active = active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * --> idle idle active idle
> + *
> + * 2. While CPU0 goes idle and continues to update the state, CPU1 comes out of
> + * idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending as CPU1 also
> + * has to the hierarchy. Both CPUs (CPU0 and CPU1) now walk the hierarchy to
> + * perform the needed update from their point of view. The currently visible
> + * state looks the following:
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = CPU1 migrator = CPU2
> + * --> active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle --> active active idle
> + *
> + * 3. Here is the race condition: CPU1 managed to propagate its changes (from
> + * step 2) through the hierarchy to GRP1:0 before CPU0 (step 1) did. The
> + * active members of GRP1:0 remain unchanged after the update since it is
> + * still valid from CPU1 current point of view:
> + *
> + * LVL 1 [GRP1:0]
> + * --> migrator = GRP0:1
> + * --> active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU1 migrator = CPU2
> + * active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle active active idle
So let's take this scenario and suppose we are at this stage. CPU 1
has propagated the state up to [GRP1:0] and CPU 0 is going to do it
but hasn't yet.
> +static bool tmigr_inactive_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate, childstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + curstate.state = atomic_read(&group->migr_state);
And now suppose CPU 0 arrives here and sees the group->migr_state change
performed by CPU 1. So it's all good, right? The below atomic_cmpxchg()
will success on the first take.
> + childstate.state = 0;
> +
> + do {
> + if (child)
> + childstate.state = atomic_read(&child->migr_state);
But then how do you guarantee that CPU 0 will load here the version of
child->migr_state modified by CPU 1? What prevents from loading the stale value?
The one that was modified by CPU 0 instead? Nothing because the two above reads
are unordered. As a result, CPU 0 may ignore the fact that CPU 1 is up and
wrongly report GRP0:0 as active up to GRP1:0.
One way to solve this is to change the above atomic_read(&group->migr_state)
into atomic_read_acquire(&group->migr_state). It's cheap and pairs with the
order enforced by the upwards successful cmpxchg calls.
> +
> + newstate = curstate;
> + walk_done = true;
> +
> + /* Reset active bit when the child is no longer active */
> + if (!childstate.active)
> + newstate.active &= ~childmask;
> +
> + if (newstate.migrator == childmask) {
> + /*
> + * Find a new migrator for the group, because the child
> + * group is idle!
> + */
> + if (!childstate.active) {
> + unsigned long new_migr_bit, active = newstate.active;
> +
> + new_migr_bit = find_first_bit(&active, BIT_CNT);
> +
> + if (new_migr_bit != BIT_CNT) {
> + newstate.migrator = BIT(new_migr_bit);
> + } else {
> + newstate.migrator = TMIGR_NONE;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> + }
> + }
> +
> + newstate.seq++;
> +
> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
> +
> + } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
Similarly, I seem to remember that a failing cmpxchg() doesn't imply a full
memory barrier. If it's the case, you may need to reload &group->migr_state
using an acquire barrier. But lemme check that...
Thanks.
next prev parent reply other threads:[~2024-01-26 12:53 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-15 14:37 [PATCH v10 00/20] timers: Move from a push remote at enqueue to a pull at expiry model Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 01/20] timers: Restructure get_next_timer_interrupt() Anna-Maria Behnsen
2024-01-17 15:01 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 02/20] timers: Split out get next timer interrupt Anna-Maria Behnsen
2024-01-17 15:06 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 03/20] timers: Move marking timer bases idle into tick_nohz_stop_tick() Anna-Maria Behnsen
2024-01-17 16:02 ` Frederic Weisbecker
2024-01-22 11:45 ` Anna-Maria Behnsen
2024-01-22 21:49 ` Frederic Weisbecker
2024-02-19 8:52 ` [PATCH v10a] " Anna-Maria Behnsen
2024-02-19 22:37 ` Frederic Weisbecker
2024-02-20 10:48 ` Anna-Maria Behnsen
2024-02-20 11:41 ` Frederic Weisbecker
2024-02-20 12:02 ` Anna-Maria Behnsen
2024-02-20 12:34 ` Frederic Weisbecker
2024-02-20 14:00 ` Anna-Maria Behnsen
2024-02-20 15:10 ` Frederic Weisbecker
2024-02-20 15:23 ` Anna-Maria Behnsen
2024-02-20 15:25 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 04/20] timers: Optimization for timer_base_try_to_set_idle() Anna-Maria Behnsen
2024-01-17 16:45 ` Frederic Weisbecker
2024-01-22 11:48 ` Anna-Maria Behnsen
2024-01-22 22:22 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 05/20] timers: Introduce add_timer() variants which modify timer flags Anna-Maria Behnsen
2024-01-17 17:01 ` Frederic Weisbecker
2024-01-22 11:50 ` Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 06/20] workqueue: Use global variant for add_timer() Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 07/20] timers: add_timer_on(): Make sure TIMER_PINNED flag is set Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 08/20] timers: Ease code in run_local_timers() Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 09/20] timers: Split next timer interrupt logic Anna-Maria Behnsen
2024-01-23 14:28 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 10/20] timers: Keep the pinned timers separate from the others Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 11/20] timers: Retrieve next expiry of pinned/non-pinned timers separately Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 12/20] timers: Split out "get next timer interrupt" functionality Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 13/20] timers: Add get next timer interrupt functionality for remote CPUs Anna-Maria Behnsen
2024-02-19 16:04 ` Frederic Weisbecker
2024-02-19 16:57 ` Anna-Maria Behnsen
2024-01-15 14:37 ` [PATCH v10 14/20] timers: Restructure internal locking Anna-Maria Behnsen
2024-01-24 13:56 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 15/20] timers: Check if timers base is handled already Anna-Maria Behnsen
2024-01-24 14:22 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 16/20] tick/sched: Split out jiffies update helper function Anna-Maria Behnsen
2024-01-24 14:42 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 17/20] timers: Introduce function to check timer base is_idle flag Anna-Maria Behnsen
2024-01-24 14:52 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 18/20] timers: Implement the hierarchical pull model Anna-Maria Behnsen
2024-01-25 14:30 ` Frederic Weisbecker
2024-01-28 15:58 ` Anna-Maria Behnsen
2024-01-30 15:29 ` Frederic Weisbecker
2024-01-30 16:45 ` Anna-Maria Behnsen
2024-01-26 12:53 ` Frederic Weisbecker [this message]
2024-01-27 22:54 ` Frederic Weisbecker
2024-01-29 10:50 ` Anna-Maria Behnsen
2024-01-29 22:21 ` Frederic Weisbecker
2024-01-30 13:32 ` Anna-Maria Behnsen
2024-01-29 13:50 ` Paul E. McKenney
2024-01-29 1:04 ` Frederic Weisbecker
2024-01-30 17:56 ` Anna-Maria Behnsen
2024-01-30 21:13 ` Frederic Weisbecker
2024-01-31 11:19 ` Anna-Maria Behnsen
2024-01-30 15:37 ` Frederic Weisbecker
2024-02-01 14:59 ` Anna-Maria Behnsen
2024-02-01 15:05 ` Frederic Weisbecker
2024-02-01 16:15 ` Anna-Maria Behnsen
2024-02-01 17:43 ` Frederic Weisbecker
2024-02-01 20:52 ` Anna-Maria Behnsen
2024-02-05 13:29 ` Anna-Maria Behnsen
2024-02-05 20:30 ` Frederic Weisbecker
2024-02-06 10:06 ` Anna-Maria Behnsen
2024-02-06 10:29 ` Frederic Weisbecker
2024-02-01 16:33 ` Frederic Weisbecker
2024-02-05 15:59 ` Anna-Maria Behnsen
2024-02-05 20:28 ` Frederic Weisbecker
2024-02-04 22:02 ` Frederic Weisbecker
2024-02-06 11:03 ` Anna-Maria Behnsen
2024-02-06 11:11 ` Frederic Weisbecker
2024-02-04 22:32 ` Frederic Weisbecker
2024-02-06 11:36 ` Anna-Maria Behnsen
2024-02-06 13:21 ` Frederic Weisbecker
2024-02-06 14:13 ` Anna-Maria Behnsen
2024-02-06 14:21 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 19/20] timer_migration: Add tracepoints Anna-Maria Behnsen
2024-02-01 16:47 ` Frederic Weisbecker
2024-01-15 14:37 ` [PATCH v10 20/20] timers: Always queue timers on the local CPU Anna-Maria Behnsen
2024-02-01 17:36 ` Frederic Weisbecker
2024-02-01 20:58 ` Anna-Maria Behnsen
2024-02-02 11:57 ` Frederic Weisbecker
2024-01-30 22:07 ` [PATCH v10 00/20] timers: Move from a push remote at enqueue to a pull at expiry model Christian Loehle
2024-02-01 15:03 ` Anna-Maria Behnsen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZbOrOV8kWUd59h9q@lothringen \
--to=frederic@kernel.org \
--cc=anna-maria@linutronix.de \
--cc=arjan@infradead.org \
--cc=bigeasy@linutronix.de \
--cc=edumazet@google.com \
--cc=gautham.shenoy@amd.com \
--cc=ggherdovich@suse.cz \
--cc=jstultz@google.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lukasz.luba@arm.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=rafael.j.wysocki@intel.com \
--cc=riel@surriel.com \
--cc=rostedt@goodmis.org \
--cc=srinivas.pandruvada@intel.com \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.