All of lore.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <frederic@kernel.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, rostedt@goodmis.org
Subject: Re: [PATCH RFC v2 rcu] Fix get_state_synchronize_rcu_full() GP-start detection
Date: Fri, 24 Jan 2025 17:42:29 +0100	[thread overview]
Message-ID: <Z5PC9Rs1C2NIBR8k@localhost.localdomain> (raw)
In-Reply-To: <19388b60-5e71-463d-ba68-9064d0caa224@paulmck-laptop>

Le Fri, Jan 24, 2025 at 07:58:20AM -0800, Paul E. McKenney a écrit :
> On Fri, Jan 24, 2025 at 03:49:24PM +0100, Frederic Weisbecker wrote:
> > Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit :
> > > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
> > > functions use the root rcu_node structure's ->gp_seq field to detect
> > > the beginnings and ends of grace periods, respectively.  This choice is
> > > necessary for the poll_state_synchronize_rcu_full() function because
> > > (give or take counter wrap), the following sequence is guaranteed not
> > > to trigger:
> > > 
> > >         get_state_synchronize_rcu_full(&rgos);
> > >         synchronize_rcu();
> > >         WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));
> > > 
> > > The RCU callbacks that awaken synchronize_rcu() instances are
> > > guaranteed not to be invoked before the root rcu_node structure's
> > > ->gp_seq field is updated to indicate the end of the grace period.
> > > However, these callbacks might start being invoked immediately
> > > thereafter, in particular, before rcu_state.gp_seq has been updated.
> > > Therefore, poll_state_synchronize_rcu_full() must refer to the
> > > root rcu_node structure's ->gp_seq field.  Because this field is
> > > updated under this structure's ->lock, any code following a call to
> > > poll_state_synchronize_rcu_full() will be fully ordered after the
> > > full grace-period computation, as is required by RCU's memory-ordering
> > > semantics.
> > > 
> > > By symmetry, the get_state_synchronize_rcu_full() function should also
> > > use this same root rcu_node structure's ->gp_seq field.  But it turns out
> > > that symmetry is profoundly (though extremely infrequently) destructive
> > > in this case.  To see this, consider the following sequence of events:
> > > 
> > > 1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
> > >         accordingly.
> > > 
> > > 2.      As its first step of grace-period initialization, CPU 0 examines
> > >         the current CPU hotplug state and decides that it need not wait
> > >         for CPU 1, which is currently offline.
> > > 
> > > 3.      CPU 1 comes online, and updates its state.  But this does not
> > >         affect the current grace period, but rather the one after that.
> > >         After all, CPU 1 was offline when the current grace period
> > >         started, so all pre-existing RCU readers on CPU 1 must have
> > >         completed or been preempted before it last went offline.
> > >         The current grace period therefore has nothing it needs to wait
> > >         for on CPU 1.
> > > 
> > > 4.      CPU 1 switches to an rcutorture kthread which is running
> > >         rcutorture's rcu_torture_reader() function, which starts a new
> > >         RCU reader.
> > > 
> > > 5.      CPU 2 is running rcutorture's rcu_torture_writer() function
> > >         and collects a new polled grace-period "cookie" using
> > >         get_state_synchronize_rcu_full().  Because the newly started
> > >         grace period has not completed initialization, the root rcu_node
> > >         structure's ->gp_seq field has not yet been updated to indicate
> > >         that this new grace period has already started.
> > > 
> > >         This cookie is therefore set up for the end of the current grace
> > >         period (rather than the end of the following grace period).
> > > 
> > > 6.      CPU 0 finishes grace-period initialization.
> > > 
> > > 7.      If CPU 1’s rcutorture reader is preempted, it will be added to
> > >         the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
> > >         set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
> > >         will not be updated.  Thus, this grace period will not wait on
> > >         it.  Which is only fair, given that the CPU did not come online
> > >         until after the grace period officially started.
> > > 
> > > 8.      CPUs 0 and 2 then detect the new grace period and then report
> > >         a quiescent state to the RCU core.
> > > 
> > > 9.      Because CPU 1 was offline at the start of the current grace
> > >         period, CPUs 0 and 2 are the only CPUs that this grace period
> > >         needs to wait on.  So the grace period ends and post-grace-period
> > >         cleanup starts.  In particular, the root rcu_node structure's
> > >         ->gp_seq field is updated to indicate that this grace period
> > >         has now ended.
> > > 
> > > 10.     CPU 2 continues running rcu_torture_writer() and sees that,
> > >         from the viewpoint of the root rcu_node structure consulted by
> > >         the poll_state_synchronize_rcu_full() function, the grace period
> > >         has ended.  It therefore updates state accordingly.
> > > 
> > > 11.     CPU 1 is still running the same RCU reader, which notices this
> > >         update and thus complains about the too-short grace period.
> > 
> > I think I get the race but I must confess I'm not very familiar with how this
> > all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader().
> > 
> > Basically this could trigger on CPU 1 with just doing the following, right?
> > 
> > rcu_read_lock()
> > get_state_synchronize_rcu_full(&rgos);
> > WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos))
> > rcu_read_unlock()
> 
> CPU 1 would do rcu_read_lock()/checks/rcu_read_unlock() as the
> reader, and CPU 2 would do get_state_synchronize_rcu_full(), later
> poll_state_synchronize_rcu_full(), which would (erroneously) indicate
> a completed grace period, so it would update the state, triggering CPU
> 1's checks.

I see, so if I generalize the problem beyond rcutorture, this looks like this,
right?

CPU 0                                    CPU 1                       CPU 2
-----                                    ------                      -----

rcu_seq_start(rcu_state.gp_seq)
                                         // CPU boots
                                         rcu_read_lock()
                                         r0 = rcu_dereference(*X)
                                                                     r1 = *X
                                                                     *X = NULL
                                                                     // relies on rnp->gp_seq and not rcu_state.gp_seq
                                                                     get_state_synchronize_rcu_full(&gros)
WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq);
rcu_report_qs_rdp()
                                                                      rcu_report_qs_rdp()
                                                                      rcu_report_qs_rnp()
                                                                      rcu_report_qs_rsp()
                                                                      if (poll_state_synchronize_rcu_full(&rgos))
                                                                          kfree(r1)
                                          // play with r0 and crash


> > I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on
> > the root node for the global state ?
> 
> One scenario comes to mind immediately.  There may be others.
> 
> Suppose we were running with default configuration on a system with
> "only" eight CPUs.  Then there is only the one rcu_node structure,
> which is both root and leaf.  Without rcu_state.gp_seq, there
> would be no way to communicate the beginning of the grace period to
> get_state_synchronize_rcu_full() without also allowing quiescent states
> to be reported.  There would thus be no time in which to check for newly
> onlined/offlined CPUs.

Heh, that makes sense! Though perhaps that qsmaskinit[next] handling part
could be done before rcu_seq_start()?

Thanks.

> 
> 							Thanx, Paul

  reply	other threads:[~2025-01-24 16:42 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-13  0:59 [PATCH RFC rcu] Fix get_state_synchronize_rcu_full() GP-start detection Paul E. McKenney
2024-12-13 19:49 ` [PATCH RFC v2 " Paul E. McKenney
2025-01-24 14:49   ` Frederic Weisbecker
2025-01-24 15:58     ` Paul E. McKenney
2025-01-24 16:42       ` Frederic Weisbecker [this message]
2025-01-24 19:40         ` Paul E. McKenney
2025-01-24 22:25           ` Frederic Weisbecker
2025-01-24 22:50             ` Paul E. McKenney
2025-01-24 23:03   ` Frederic Weisbecker
2025-01-25  0:01     ` Paul E. McKenney
2025-01-25 14:56       ` Frederic Weisbecker
2025-01-25 18:39         ` Paul E. McKenney
2025-01-27  1:13           ` Joel Fernandes
2025-01-27  1:22             ` Joel Fernandes
2025-01-27  2:03               ` Paul E. McKenney
2025-01-27  2:55                 ` Joel Fernandes
2025-01-27  2:58                   ` Joel Fernandes
2025-01-27 16:49                     ` Paul E. McKenney
2025-01-27 18:45                       ` Joel Fernandes
2025-02-11  0:28                         ` Joel Fernandes
2025-02-11  1:22                           ` Joel Fernandes
2025-02-12 10:14                             ` Paul E. McKenney
2025-02-12 10:42                               ` Paul E. McKenney
2025-01-24  1:49 ` [PATCH RFC " Joel Fernandes
2025-01-24 14:56   ` Frederic Weisbecker
2025-01-24 20:21     ` Joel Fernandes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5PC9Rs1C2NIBR8k@localhost.localdomain \
    --to=frederic@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.