From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 528033C0B; Fri, 24 Jan 2025 14:49:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737730168; cv=none; b=osq4LR84D0+D50whkBGjE5PKRLrCxY+mNztEy4ihitT9PQQjfCOIkEQyCbr4q39R/WdFOWlQr9jpo4wUuc+W4qz7hQB2AhDy6l6AYFFEGpT34/big6aY0ildGLigcRLQGMISKJVxlsUH3U6u3Wn86rAB/DFbCj2SwgYKzZ9O8eg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737730168; c=relaxed/simple; bh=qv/T1UPlqlN2gRfcaNhB4pqtvqWzFX0EBJSz+nNFHhQ=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=VrApiyxJxSWyiyDr+iCRBtLOvY+Oo5pyxrQK0CuEgNWv4GqLTN8hA9R7PMnfCdTIIS7BwiOLGPIlVS4wkgLPEAPvSWNIi1Q/htGhz7ik7do7GAaQEEf14ZsK3rt+UatQl7nd2HV3mFG/c4YHyq1iybObVYjJmwzXXQ+M7WcQIVY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=CTTB+jxr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="CTTB+jxr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4C6E0C4CED2; Fri, 24 Jan 2025 14:49:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737730167; bh=qv/T1UPlqlN2gRfcaNhB4pqtvqWzFX0EBJSz+nNFHhQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=CTTB+jxremOAq6t9tJvs4bjhHrA/A1VprHBgqOTwGQW6SfoklIvo1jwbyqEm2Kkhz EK41YJSamhNy9ZgeNiSRoJJC5ycMfrWKLVE4DpwBXC/geDZbY5Gt2S2Kbzc5uMV8nC /jsimqX6X3iwO7fIJreHJXO5cdFpRNS8+upq5RaKqMnLy+CcRD6E85SDhYL19je2dV v5zyYRDWGBIT0rZI9wRqQ1AjWaamVjz4cDJNol5yGYT3XTgdkM1zlEKhAhDhCQJ60u WTXCyRGTSMDH4KaR90nMr1Kv/k4DRB9OUqU2yGdfkBI2WLpY1cShVCsj5F58XVnCJl jcDLQw7FjSRQg== Date: Fri, 24 Jan 2025 15:49:24 +0100 From: Frederic Weisbecker To: "Paul E. McKenney" Cc: rcu@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, rostedt@goodmis.org Subject: Re: [PATCH RFC v2 rcu] Fix get_state_synchronize_rcu_full() GP-start detection Message-ID: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Le Fri, Dec 13, 2024 at 11:49:49AM -0800, Paul E. McKenney a écrit : > The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full() > functions use the root rcu_node structure's ->gp_seq field to detect > the beginnings and ends of grace periods, respectively. This choice is > necessary for the poll_state_synchronize_rcu_full() function because > (give or take counter wrap), the following sequence is guaranteed not > to trigger: > > get_state_synchronize_rcu_full(&rgos); > synchronize_rcu(); > WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos)); > > The RCU callbacks that awaken synchronize_rcu() instances are > guaranteed not to be invoked before the root rcu_node structure's > ->gp_seq field is updated to indicate the end of the grace period. > However, these callbacks might start being invoked immediately > thereafter, in particular, before rcu_state.gp_seq has been updated. > Therefore, poll_state_synchronize_rcu_full() must refer to the > root rcu_node structure's ->gp_seq field. Because this field is > updated under this structure's ->lock, any code following a call to > poll_state_synchronize_rcu_full() will be fully ordered after the > full grace-period computation, as is required by RCU's memory-ordering > semantics. > > By symmetry, the get_state_synchronize_rcu_full() function should also > use this same root rcu_node structure's ->gp_seq field. But it turns out > that symmetry is profoundly (though extremely infrequently) destructive > in this case. To see this, consider the following sequence of events: > > 1. CPU 0 starts a new grace period, and updates rcu_state.gp_seq > accordingly. > > 2. As its first step of grace-period initialization, CPU 0 examines > the current CPU hotplug state and decides that it need not wait > for CPU 1, which is currently offline. > > 3. CPU 1 comes online, and updates its state. But this does not > affect the current grace period, but rather the one after that. > After all, CPU 1 was offline when the current grace period > started, so all pre-existing RCU readers on CPU 1 must have > completed or been preempted before it last went offline. > The current grace period therefore has nothing it needs to wait > for on CPU 1. > > 4. CPU 1 switches to an rcutorture kthread which is running > rcutorture's rcu_torture_reader() function, which starts a new > RCU reader. > > 5. CPU 2 is running rcutorture's rcu_torture_writer() function > and collects a new polled grace-period "cookie" using > get_state_synchronize_rcu_full(). Because the newly started > grace period has not completed initialization, the root rcu_node > structure's ->gp_seq field has not yet been updated to indicate > that this new grace period has already started. > > This cookie is therefore set up for the end of the current grace > period (rather than the end of the following grace period). > > 6. CPU 0 finishes grace-period initialization. > > 7. If CPU 1’s rcutorture reader is preempted, it will be added to > the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not > set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer > will not be updated.  Thus, this grace period will not wait on > it.  Which is only fair, given that the CPU did not come online > until after the grace period officially started. > > 8. CPUs 0 and 2 then detect the new grace period and then report > a quiescent state to the RCU core. > > 9. Because CPU 1 was offline at the start of the current grace > period, CPUs 0 and 2 are the only CPUs that this grace period > needs to wait on. So the grace period ends and post-grace-period > cleanup starts. In particular, the root rcu_node structure's > ->gp_seq field is updated to indicate that this grace period > has now ended. > > 10. CPU 2 continues running rcu_torture_writer() and sees that, > from the viewpoint of the root rcu_node structure consulted by > the poll_state_synchronize_rcu_full() function, the grace period > has ended.  It therefore updates state accordingly. > > 11. CPU 1 is still running the same RCU reader, which notices this > update and thus complains about the too-short grace period. I think I get the race but I must confess I'm not very familiar with how this all materialize on CPU 2's rcu_torture_writer() and CPU 1's rcu_torture_reader(). Basically this could trigger on CPU 1 with just doing the following, right? rcu_read_lock() get_state_synchronize_rcu_full(&rgos); WARN_ON_ONCE(poll_state_synchronize_rcu_full(&rgos)) rcu_read_unlock() > > The fix is for the get_state_synchronize_rcu_full() function to use > rcu_state.gp_seq instead of the the root rcu_node structure's ->gp_seq > field. With this change in place, if step 5's cookie indicates that the > grace period has not yet started, then any prior code executed by CPU 2 > must have happened before CPU 1 came online. This will in turn prevent > CPU 1's code in steps 3 and 11 from spanning CPU 2's grace-period wait, > thus preventing CPU 1 from being subjected to a too-short grace period. > > This commit therefore makes this change. Note that there is no change to > the poll_state_synchronize_rcu_full() function, which as noted above, > must continue to use the root rcu_node structure's ->gp_seq field. > This is of course an asymmetry between these two functions, but is an > asymmetry that is absolutely required for correct operation. It is a > common human tendency to greatly value symmetry, and sometimes symmetry > is a wonderful thing. Other times, symmetry results in poor performance. > But in this case, symmetry is just plain wrong. > > Nevertheless, the asymmetry does require an additional adjustment. > It is possible for get_state_synchronize_rcu_full() to see a given > grace period as having started, but for an immediately following > poll_state_synchronize_rcu_full() to see it as having not yet started. > Given the current rcu_seq_done_exact() implementation, this will > result in a false-positive indication that the grace period is done > from poll_state_synchronize_rcu_full(). This is dealt with by making > rcu_seq_done_exact() reach back three grace periods rather than just > two of them. > > Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for > get_completed*() and poll_state*()"), it is not clear that it is worth > backporting this commit. First, it took me many weeks to convince > rcutorture to reproduce this more frequently than once per year. Second, > this cannot be reproduced at all without frequent CPU-hotplug operations, > as in waiting all of 50 milliseconds from the end of the previous > operation until starting the next one. Third, the TREE03.boot settings > cause multi-millisecond delays during RCU grace-period initialization, > which greatly increase the probability of the above sequence of events. > (Don't do this in production workloads!) Fourth, extremely heavy use of > get_state_synchronize_rcu_full() and/or poll_state_synchronize_rcu_full() > is required to reproduce this, and as of v6.12, only kfree_rcu() uses it, > and even then not particularly heavily. I'm wondering, what prevents us from removing rcu_state.gp_seq and rely only on the root node for the global state ? Thanks.