[PATCH 1/2] llist: avoid memory tearing for llist

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] llist: avoid memory tearing for llist_node
@ 2025-07-03 20:00 Shakeel Butt
  2025-07-03 20:00 ` [PATCH 2/2] cgroup: explain the race between updater and flusher Shakeel Butt
  2025-07-03 22:24 ` [PATCH 1/2] llist: avoid memory tearing for llist_node Paul E. McKenney
  0 siblings, 2 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-07-03 20:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E . McKenney, Andrew Morton, JP Kobryn, Johannes Weiner,
	Ying Huang, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, bpf, linux-mm,
	cgroups, linux-kernel, Meta kernel team

Before the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
safe"), the struct llist_node is expected to be private to the one
inserting the node to the lockless list or the one removing the node
from the lockless list. After the mentioned commit, the llist_node in
the rstat code is per-cpu shared between the stacked contexts i.e.
process, softirq, hardirq & nmi. It is possible the compiler may tear
the loads or stores of llist_node. Let's avoid that.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/llist.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 27b17f64bcee..607b2360c938 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -83,7 +83,7 @@ static inline void init_llist_head(struct llist_head *list)
  */
 static inline void init_llist_node(struct llist_node *node)
 {
-	node->next = node;
+	WRITE_ONCE(node->next, node);
 }
 
 /**
@@ -97,7 +97,7 @@ static inline void init_llist_node(struct llist_node *node)
  */
 static inline bool llist_on_list(const struct llist_node *node)
 {
-	return node->next != node;
+	return READ_ONCE(node->next) != node;
 }
 
 /**
@@ -220,7 +220,7 @@ static inline bool llist_empty(const struct llist_head *head)
 
 static inline struct llist_node *llist_next(struct llist_node *node)
 {
-	return node->next;
+	return READ_ONCE(node->next);
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-03 20:00 [PATCH 1/2] llist: avoid memory tearing for llist_node Shakeel Butt
@ 2025-07-03 20:00 ` Shakeel Butt
  2025-07-03 22:29   ` Paul E. McKenney
  2025-07-03 22:24 ` [PATCH 1/2] llist: avoid memory tearing for llist_node Paul E. McKenney
  1 sibling, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-07-03 20:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Paul E . McKenney, Andrew Morton, JP Kobryn, Johannes Weiner,
	Ying Huang, Vlastimil Babka, Alexei Starovoitov,
	Sebastian Andrzej Siewior, Michal Koutný, bpf, linux-mm,
	cgroups, linux-kernel, Meta kernel team

Currently the rstat updater and the flusher can race and cause a
scenario where the stats updater skips adding the css to the lockless
list but the flusher might not see those updates done by the skipped
updater. This is benign race and the subsequent flusher will flush those
stats and at the moment there aren't any rstat users which are not fine
with this kind of race. However some future user might want more
stricter guarantee, so let's add appropriate comments and data_race()
tags to ease the job of future users.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 kernel/cgroup/rstat.c | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index c8a48cf83878..b98c03b1af25 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
  * Atomically inserts the css in the ss's llist for the given cpu. This is
  * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
  * will be processed at the flush time to create the update tree.
+ *
+ * NOTE: if the user needs the guarantee that the updater either add itself in
+ * the lockless list or the concurrent flusher flushes its updated stats, a
+ * memory barrier is needed before the call to css_rstat_updated() i.e. a
+ * barrier after updating the per-cpu stats and before calling
+ * css_rstat_updated().
  */
 __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 {
@@ -86,8 +92,13 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
 		return;
 
 	rstatc = css_rstat_cpu(css, cpu);
-	/* If already on list return. */
-	if (llist_on_list(&rstatc->lnode))
+	/*
+	 * If already on list return. This check is racy and smp_mb() is needed
+	 * to pair it with the smp_mb() in css_process_update_tree() if the
+	 * guarantee that the updated stats are visible to concurrent flusher is
+	 * needed.
+	 */
+	if (data_race(llist_on_list(&rstatc->lnode)))
 		return;
 
 	/*
@@ -145,9 +156,24 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
 	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
 	struct llist_node *lnode;
 
-	while ((lnode = llist_del_first_init(lhead))) {
+	while ((lnode = data_race(llist_del_first_init(lhead)))) {
 		struct css_rstat_cpu *rstatc;
 
+		/*
+		 * smp_mb() is needed here (more specifically in between
+		 * init_llist_node() and per-cpu stats flushing) if the
+		 * guarantee is required by a rstat user where etiher the
+		 * updater should add itself on the lockless list or the
+		 * flusher flush the stats updated by the updater who have
+		 * observed that they are already on the list. The
+		 * corresponding barrier pair for this one should be before
+		 * css_rstat_updated() by the user.
+		 *
+		 * For now, there aren't any such user, so not adding the
+		 * barrier here but if such a use-case arise, please add
+		 * smp_mb() here.
+		 */
+
 		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
 		__css_process_update_tree(rstatc->owner, cpu);
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] llist: avoid memory tearing for llist_node
  2025-07-03 20:00 [PATCH 1/2] llist: avoid memory tearing for llist_node Shakeel Butt
  2025-07-03 20:00 ` [PATCH 2/2] cgroup: explain the race between updater and flusher Shakeel Butt
@ 2025-07-03 22:24 ` Paul E. McKenney
  1 sibling, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2025-07-03 22:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 01:00:11PM -0700, Shakeel Butt wrote:
> Before the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
> safe"), the struct llist_node is expected to be private to the one
> inserting the node to the lockless list or the one removing the node
> from the lockless list. After the mentioned commit, the llist_node in
> the rstat code is per-cpu shared between the stacked contexts i.e.
> process, softirq, hardirq & nmi. It is possible the compiler may tear
> the loads or stores of llist_node. Let's avoid that.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

> ---
>  include/linux/llist.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 27b17f64bcee..607b2360c938 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -83,7 +83,7 @@ static inline void init_llist_head(struct llist_head *list)
>   */
>  static inline void init_llist_node(struct llist_node *node)
>  {
> -	node->next = node;
> +	WRITE_ONCE(node->next, node);
>  }
>  
>  /**
> @@ -97,7 +97,7 @@ static inline void init_llist_node(struct llist_node *node)
>   */
>  static inline bool llist_on_list(const struct llist_node *node)
>  {
> -	return node->next != node;
> +	return READ_ONCE(node->next) != node;
>  }
>  
>  /**
> @@ -220,7 +220,7 @@ static inline bool llist_empty(const struct llist_head *head)
>  
>  static inline struct llist_node *llist_next(struct llist_node *node)
>  {
> -	return node->next;
> +	return READ_ONCE(node->next);
>  }
>  
>  /**
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-03 20:00 ` [PATCH 2/2] cgroup: explain the race between updater and flusher Shakeel Butt
@ 2025-07-03 22:29   ` Paul E. McKenney
  2025-07-03 22:46     ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2025-07-03 22:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 01:00:12PM -0700, Shakeel Butt wrote:
> Currently the rstat updater and the flusher can race and cause a
> scenario where the stats updater skips adding the css to the lockless
> list but the flusher might not see those updates done by the skipped
> updater. This is benign race and the subsequent flusher will flush those
> stats and at the moment there aren't any rstat users which are not fine
> with this kind of race. However some future user might want more
> stricter guarantee, so let's add appropriate comments and data_race()
> tags to ease the job of future users.
> 
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  kernel/cgroup/rstat.c | 32 +++++++++++++++++++++++++++++---
>  1 file changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index c8a48cf83878..b98c03b1af25 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
>   * Atomically inserts the css in the ss's llist for the given cpu. This is
>   * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
>   * will be processed at the flush time to create the update tree.
> + *
> + * NOTE: if the user needs the guarantee that the updater either add itself in
> + * the lockless list or the concurrent flusher flushes its updated stats, a
> + * memory barrier is needed before the call to css_rstat_updated() i.e. a
> + * barrier after updating the per-cpu stats and before calling
> + * css_rstat_updated().
>   */
>  __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
>  {
> @@ -86,8 +92,13 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
>  		return;
>  
>  	rstatc = css_rstat_cpu(css, cpu);
> -	/* If already on list return. */
> -	if (llist_on_list(&rstatc->lnode))
> +	/*
> +	 * If already on list return. This check is racy and smp_mb() is needed
> +	 * to pair it with the smp_mb() in css_process_update_tree() if the
> +	 * guarantee that the updated stats are visible to concurrent flusher is
> +	 * needed.
> +	 */
> +	if (data_race(llist_on_list(&rstatc->lnode)))

OK, I will bite...

Why is this needed given the READ_ONCE() that the earlier patch added to
llist_on_list()?

>  		return;
>  
>  	/*
> @@ -145,9 +156,24 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
>  	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
>  	struct llist_node *lnode;
>  
> -	while ((lnode = llist_del_first_init(lhead))) {
> +	while ((lnode = data_race(llist_del_first_init(lhead)))) {

And for this one, why not make init_llist_node(), which is invoked from
llist_del_first_init(), do a WRITE_ONCE()?

							Thanx, Paul

>  		struct css_rstat_cpu *rstatc;
>  
> +		/*
> +		 * smp_mb() is needed here (more specifically in between
> +		 * init_llist_node() and per-cpu stats flushing) if the
> +		 * guarantee is required by a rstat user where etiher the
> +		 * updater should add itself on the lockless list or the
> +		 * flusher flush the stats updated by the updater who have
> +		 * observed that they are already on the list. The
> +		 * corresponding barrier pair for this one should be before
> +		 * css_rstat_updated() by the user.
> +		 *
> +		 * For now, there aren't any such user, so not adding the
> +		 * barrier here but if such a use-case arise, please add
> +		 * smp_mb() here.
> +		 */
> +
>  		rstatc = container_of(lnode, struct css_rstat_cpu, lnode);
>  		__css_process_update_tree(rstatc->owner, cpu);
>  	}
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-03 22:29   ` Paul E. McKenney
@ 2025-07-03 22:46     ` Shakeel Butt
  2025-07-03 23:53       ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-07-03 22:46 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 03:29:16PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 03, 2025 at 01:00:12PM -0700, Shakeel Butt wrote:
> > Currently the rstat updater and the flusher can race and cause a
> > scenario where the stats updater skips adding the css to the lockless
> > list but the flusher might not see those updates done by the skipped
> > updater. This is benign race and the subsequent flusher will flush those
> > stats and at the moment there aren't any rstat users which are not fine
> > with this kind of race. However some future user might want more
> > stricter guarantee, so let's add appropriate comments and data_race()
> > tags to ease the job of future users.
> > 
> > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > ---
> >  kernel/cgroup/rstat.c | 32 +++++++++++++++++++++++++++++---
> >  1 file changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index c8a48cf83878..b98c03b1af25 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
> >   * Atomically inserts the css in the ss's llist for the given cpu. This is
> >   * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
> >   * will be processed at the flush time to create the update tree.
> > + *
> > + * NOTE: if the user needs the guarantee that the updater either add itself in
> > + * the lockless list or the concurrent flusher flushes its updated stats, a
> > + * memory barrier is needed before the call to css_rstat_updated() i.e. a
> > + * barrier after updating the per-cpu stats and before calling
> > + * css_rstat_updated().
> >   */
> >  __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> >  {
> > @@ -86,8 +92,13 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> >  		return;
> >  
> >  	rstatc = css_rstat_cpu(css, cpu);
> > -	/* If already on list return. */
> > -	if (llist_on_list(&rstatc->lnode))
> > +	/*
> > +	 * If already on list return. This check is racy and smp_mb() is needed
> > +	 * to pair it with the smp_mb() in css_process_update_tree() if the
> > +	 * guarantee that the updated stats are visible to concurrent flusher is
> > +	 * needed.
> > +	 */
> > +	if (data_race(llist_on_list(&rstatc->lnode)))
> 
> OK, I will bite...
> 
> Why is this needed given the READ_ONCE() that the earlier patch added to
> llist_on_list()?
> 
> >  		return;
> >  
> >  	/*
> > @@ -145,9 +156,24 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
> >  	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> >  	struct llist_node *lnode;
> >  
> > -	while ((lnode = llist_del_first_init(lhead))) {
> > +	while ((lnode = data_race(llist_del_first_init(lhead)))) {
> 
> And for this one, why not make init_llist_node(), which is invoked from
> llist_del_first_init(), do a WRITE_ONCE()?
> 

Let me answer this one first. The previous patch actually made
init_llist_node() do WRITE_ONCE().

So the actual question is why do we need
data_race([READ|WRITE]_ONCE()) instead of just [READ|WRITE]_ONCE()?
Actually I had the similar question myself and found the following
comment in include/linux/compiler.h:

/**
 * data_race - mark an expression as containing intentional data races
 *
 * This data_race() macro is useful for situations in which data races
 * should be forgiven.  One example is diagnostic code that accesses
 * shared variables but is not a part of the core synchronization design.
 * For example, if accesses to a given variable are protected by a lock,
 * except for diagnostic code, then the accesses under the lock should
 * be plain C-language accesses and those in the diagnostic code should
 * use data_race().  This way, KCSAN will complain if buggy lockless
 * accesses to that variable are introduced, even if the buggy accesses
 * are protected by READ_ONCE() or WRITE_ONCE().
 *
 * This macro *does not* affect normal code generation, but is a hint
 * to tooling that data races here are to be ignored.  If the access must
 * be atomic *and* KCSAN should ignore the access, use both data_race()
 * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
 */

IIUC correctly, I need to protect llist_node against tearing and as well
as tell KCSAN to ignore the access for race then I should use both.
Though I think KCSAN treat [READ|WRITE]_ONCE similar to data_race(), so
it kind of seem redundant but I think at least I want to convey that we
need protection against tearing and ignore KCSAN and using both conveys
that. Let me know if you think otherwise.

thanks a lot for taking a look.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-03 22:46     ` Shakeel Butt
@ 2025-07-03 23:53       ` Paul E. McKenney
  2025-07-04  1:54         ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2025-07-03 23:53 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 03:46:07PM -0700, Shakeel Butt wrote:
> On Thu, Jul 03, 2025 at 03:29:16PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 03, 2025 at 01:00:12PM -0700, Shakeel Butt wrote:
> > > Currently the rstat updater and the flusher can race and cause a
> > > scenario where the stats updater skips adding the css to the lockless
> > > list but the flusher might not see those updates done by the skipped
> > > updater. This is benign race and the subsequent flusher will flush those
> > > stats and at the moment there aren't any rstat users which are not fine
> > > with this kind of race. However some future user might want more
> > > stricter guarantee, so let's add appropriate comments and data_race()
> > > tags to ease the job of future users.
> > > 
> > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> > > ---
> > >  kernel/cgroup/rstat.c | 32 +++++++++++++++++++++++++++++---
> > >  1 file changed, 29 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > > index c8a48cf83878..b98c03b1af25 100644
> > > --- a/kernel/cgroup/rstat.c
> > > +++ b/kernel/cgroup/rstat.c
> > > @@ -60,6 +60,12 @@ static inline struct llist_head *ss_lhead_cpu(struct cgroup_subsys *ss, int cpu)
> > >   * Atomically inserts the css in the ss's llist for the given cpu. This is
> > >   * reentrant safe i.e. safe against softirq, hardirq and nmi. The ss's llist
> > >   * will be processed at the flush time to create the update tree.
> > > + *
> > > + * NOTE: if the user needs the guarantee that the updater either add itself in
> > > + * the lockless list or the concurrent flusher flushes its updated stats, a
> > > + * memory barrier is needed before the call to css_rstat_updated() i.e. a
> > > + * barrier after updating the per-cpu stats and before calling
> > > + * css_rstat_updated().
> > >   */
> > >  __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> > >  {
> > > @@ -86,8 +92,13 @@ __bpf_kfunc void css_rstat_updated(struct cgroup_subsys_state *css, int cpu)
> > >  		return;
> > >  
> > >  	rstatc = css_rstat_cpu(css, cpu);
> > > -	/* If already on list return. */
> > > -	if (llist_on_list(&rstatc->lnode))
> > > +	/*
> > > +	 * If already on list return. This check is racy and smp_mb() is needed
> > > +	 * to pair it with the smp_mb() in css_process_update_tree() if the
> > > +	 * guarantee that the updated stats are visible to concurrent flusher is
> > > +	 * needed.
> > > +	 */
> > > +	if (data_race(llist_on_list(&rstatc->lnode)))
> > 
> > OK, I will bite...
> > 
> > Why is this needed given the READ_ONCE() that the earlier patch added to
> > llist_on_list()?
> > 
> > >  		return;
> > >  
> > >  	/*
> > > @@ -145,9 +156,24 @@ static void css_process_update_tree(struct cgroup_subsys *ss, int cpu)
> > >  	struct llist_head *lhead = ss_lhead_cpu(ss, cpu);
> > >  	struct llist_node *lnode;
> > >  
> > > -	while ((lnode = llist_del_first_init(lhead))) {
> > > +	while ((lnode = data_race(llist_del_first_init(lhead)))) {
> > 
> > And for this one, why not make init_llist_node(), which is invoked from
> > llist_del_first_init(), do a WRITE_ONCE()?
> > 
> 
> Let me answer this one first. The previous patch actually made
> init_llist_node() do WRITE_ONCE().
> 
> So the actual question is why do we need
> data_race([READ|WRITE]_ONCE()) instead of just [READ|WRITE]_ONCE()?

You should *almost* always use [READ|WRITE]_ONCE() instead of data_race().

> Actually I had the similar question myself and found the following
> comment in include/linux/compiler.h:
> 
> /**
>  * data_race - mark an expression as containing intentional data races
>  *
>  * This data_race() macro is useful for situations in which data races
>  * should be forgiven.  One example is diagnostic code that accesses
>  * shared variables but is not a part of the core synchronization design.
>  * For example, if accesses to a given variable are protected by a lock,
>  * except for diagnostic code, then the accesses under the lock should
>  * be plain C-language accesses and those in the diagnostic code should
>  * use data_race().  This way, KCSAN will complain if buggy lockless
>  * accesses to that variable are introduced, even if the buggy accesses
>  * are protected by READ_ONCE() or WRITE_ONCE().
>  *
>  * This macro *does not* affect normal code generation, but is a hint
>  * to tooling that data races here are to be ignored.  If the access must
>  * be atomic *and* KCSAN should ignore the access, use both data_race()
>  * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
>  */
> 
> IIUC correctly, I need to protect llist_node against tearing and as well
> as tell KCSAN to ignore the access for race then I should use both.
> Though I think KCSAN treat [READ|WRITE]_ONCE similar to data_race(), so
> it kind of seem redundant but I think at least I want to convey that we
> need protection against tearing and ignore KCSAN and using both conveys
> that. Let me know if you think otherwise.
> 
> thanks a lot for taking a look.

The thing to remember is that data_race() does not affect the
generated code (except of course when running KCSAN), and thus does
absolutely nothing to prevent load/store tearing.  You need things like
[READ|WRITE]_ONCE() to prevent tearing.

So if it does not affect the generated code, what is the point of
data_race()?

One answer to this question is for diagnostics where you want KCSAN
to check the main algorithm, but you don't want KCSAN to be confused
by the diagnostic accesses.  For example, you might use something like
ASSERT_EXCLUSIVE_ACCESS() as in __list_splice_init_rcu(), and not want
your diagnostic accesses to result in false-positive KCSAN reports
due to interactions with ASSERT_EXCLUSIVE_ACCESS() on some particular
memory location.  And if you were to use READ_ONCE() to access that same
memory location in your diagnostics, KCSAN would complain if they ran
concurrently with that ASSERT_EXCLUSIVE_ACCESS().  So you would instead
use data_race() to suppress such complaints.

Does that make sense?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-03 23:53       ` Paul E. McKenney
@ 2025-07-04  1:54         ` Shakeel Butt
  2025-07-04  4:44           ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-07-04  1:54 UTC (permalink / raw)
  To: paulmck
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 3, 2025 at 4:53 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Jul 03, 2025 at 03:46:07PM -0700, Shakeel Butt wrote:
[...]
> > Let me answer this one first. The previous patch actually made
> > init_llist_node() do WRITE_ONCE().
> >
> > So the actual question is why do we need
> > data_race([READ|WRITE]_ONCE()) instead of just [READ|WRITE]_ONCE()?
>
> You should *almost* always use [READ|WRITE]_ONCE() instead of data_race().
>
> > Actually I had the similar question myself and found the following
> > comment in include/linux/compiler.h:
> >
> > /**
> >  * data_race - mark an expression as containing intentional data races
> >  *
> >  * This data_race() macro is useful for situations in which data races
> >  * should be forgiven.  One example is diagnostic code that accesses
> >  * shared variables but is not a part of the core synchronization design.
> >  * For example, if accesses to a given variable are protected by a lock,
> >  * except for diagnostic code, then the accesses under the lock should
> >  * be plain C-language accesses and those in the diagnostic code should
> >  * use data_race().  This way, KCSAN will complain if buggy lockless
> >  * accesses to that variable are introduced, even if the buggy accesses
> >  * are protected by READ_ONCE() or WRITE_ONCE().
> >  *
> >  * This macro *does not* affect normal code generation, but is a hint
> >  * to tooling that data races here are to be ignored.  If the access must
> >  * be atomic *and* KCSAN should ignore the access, use both data_race()
> >  * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
> >  */
> >
> > IIUC correctly, I need to protect llist_node against tearing and as well
> > as tell KCSAN to ignore the access for race then I should use both.
> > Though I think KCSAN treat [READ|WRITE]_ONCE similar to data_race(), so
> > it kind of seem redundant but I think at least I want to convey that we
> > need protection against tearing and ignore KCSAN and using both conveys
> > that. Let me know if you think otherwise.
> >
> > thanks a lot for taking a look.
>
> The thing to remember is that data_race() does not affect the
> generated code (except of course when running KCSAN), and thus does
> absolutely nothing to prevent load/store tearing.  You need things like
> [READ|WRITE]_ONCE() to prevent tearing.
>
> So if it does not affect the generated code, what is the point of
> data_race()?
>
> One answer to this question is for diagnostics where you want KCSAN
> to check the main algorithm, but you don't want KCSAN to be confused
> by the diagnostic accesses.  For example, you might use something like
> ASSERT_EXCLUSIVE_ACCESS() as in __list_splice_init_rcu(), and not want
> your diagnostic accesses to result in false-positive KCSAN reports
> due to interactions with ASSERT_EXCLUSIVE_ACCESS() on some particular
> memory location.  And if you were to use READ_ONCE() to access that same
> memory location in your diagnostics, KCSAN would complain if they ran
> concurrently with that ASSERT_EXCLUSIVE_ACCESS().  So you would instead
> use data_race() to suppress such complaints.
>
> Does that make sense?
>

Thanks a lot Paul for the awesome explanation. Do you think keeping
data_race() here would be harmful in a sense that it might cause
confusion in future?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-04  1:54         ` Shakeel Butt
@ 2025-07-04  4:44           ` Paul E. McKenney
  2025-07-04 17:45             ` Shakeel Butt
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2025-07-04  4:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 06:54:02PM -0700, Shakeel Butt wrote:
> On Thu, Jul 3, 2025 at 4:53 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Jul 03, 2025 at 03:46:07PM -0700, Shakeel Butt wrote:
> [...]
> > > Let me answer this one first. The previous patch actually made
> > > init_llist_node() do WRITE_ONCE().
> > >
> > > So the actual question is why do we need
> > > data_race([READ|WRITE]_ONCE()) instead of just [READ|WRITE]_ONCE()?
> >
> > You should *almost* always use [READ|WRITE]_ONCE() instead of data_race().
> >
> > > Actually I had the similar question myself and found the following
> > > comment in include/linux/compiler.h:
> > >
> > > /**
> > >  * data_race - mark an expression as containing intentional data races
> > >  *
> > >  * This data_race() macro is useful for situations in which data races
> > >  * should be forgiven.  One example is diagnostic code that accesses
> > >  * shared variables but is not a part of the core synchronization design.
> > >  * For example, if accesses to a given variable are protected by a lock,
> > >  * except for diagnostic code, then the accesses under the lock should
> > >  * be plain C-language accesses and those in the diagnostic code should
> > >  * use data_race().  This way, KCSAN will complain if buggy lockless
> > >  * accesses to that variable are introduced, even if the buggy accesses
> > >  * are protected by READ_ONCE() or WRITE_ONCE().
> > >  *
> > >  * This macro *does not* affect normal code generation, but is a hint
> > >  * to tooling that data races here are to be ignored.  If the access must
> > >  * be atomic *and* KCSAN should ignore the access, use both data_race()
> > >  * and READ_ONCE(), for example, data_race(READ_ONCE(x)).
> > >  */
> > >
> > > IIUC correctly, I need to protect llist_node against tearing and as well
> > > as tell KCSAN to ignore the access for race then I should use both.
> > > Though I think KCSAN treat [READ|WRITE]_ONCE similar to data_race(), so
> > > it kind of seem redundant but I think at least I want to convey that we
> > > need protection against tearing and ignore KCSAN and using both conveys
> > > that. Let me know if you think otherwise.
> > >
> > > thanks a lot for taking a look.
> >
> > The thing to remember is that data_race() does not affect the
> > generated code (except of course when running KCSAN), and thus does
> > absolutely nothing to prevent load/store tearing.  You need things like
> > [READ|WRITE]_ONCE() to prevent tearing.
> >
> > So if it does not affect the generated code, what is the point of
> > data_race()?
> >
> > One answer to this question is for diagnostics where you want KCSAN
> > to check the main algorithm, but you don't want KCSAN to be confused
> > by the diagnostic accesses.  For example, you might use something like
> > ASSERT_EXCLUSIVE_ACCESS() as in __list_splice_init_rcu(), and not want
> > your diagnostic accesses to result in false-positive KCSAN reports
> > due to interactions with ASSERT_EXCLUSIVE_ACCESS() on some particular
> > memory location.  And if you were to use READ_ONCE() to access that same
> > memory location in your diagnostics, KCSAN would complain if they ran
> > concurrently with that ASSERT_EXCLUSIVE_ACCESS().  So you would instead
> > use data_race() to suppress such complaints.
> >
> > Does that make sense?
> 
> Thanks a lot Paul for the awesome explanation. Do you think keeping
> data_race() here would be harmful in a sense that it might cause
> confusion in future?

Yes, plus it might incorrectly suppress a KCSAN warning for a very
real bug.  So I strongly recommend removing the data_race() in this case.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-04  4:44           ` Paul E. McKenney
@ 2025-07-04 17:45             ` Shakeel Butt
  2025-07-04 17:58               ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-07-04 17:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Thu, Jul 03, 2025 at 09:44:58PM -0700, Paul E. McKenney wrote:
[...]
> > 
> > Thanks a lot Paul for the awesome explanation. Do you think keeping
> > data_race() here would be harmful in a sense that it might cause
> > confusion in future?
> 
> Yes, plus it might incorrectly suppress a KCSAN warning for a very
> real bug.  So I strongly recommend removing the data_race() in this case.
> 

I will remove data_race() tags but keep the comments and squash into the
first one. I will keep your reviewed-by tag unless you disagree.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] cgroup: explain the race between updater and flusher
  2025-07-04 17:45             ` Shakeel Butt
@ 2025-07-04 17:58               ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2025-07-04 17:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Tejun Heo, Andrew Morton, JP Kobryn, Johannes Weiner, Ying Huang,
	Vlastimil Babka, Alexei Starovoitov, Sebastian Andrzej Siewior,
	Michal Koutný, bpf, linux-mm, cgroups, linux-kernel,
	Meta kernel team

On Fri, Jul 04, 2025 at 10:45:25AM -0700, Shakeel Butt wrote:
> On Thu, Jul 03, 2025 at 09:44:58PM -0700, Paul E. McKenney wrote:
> [...]
> > > 
> > > Thanks a lot Paul for the awesome explanation. Do you think keeping
> > > data_race() here would be harmful in a sense that it might cause
> > > confusion in future?
> > 
> > Yes, plus it might incorrectly suppress a KCSAN warning for a very
> > real bug.  So I strongly recommend removing the data_race() in this case.
> 
> I will remove data_race() tags but keep the comments and squash into the
> first one. I will keep your reviewed-by tag unless you disagree.

Works for me!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-04 17:58 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-03 20:00 [PATCH 1/2] llist: avoid memory tearing for llist_node Shakeel Butt
2025-07-03 20:00 ` [PATCH 2/2] cgroup: explain the race between updater and flusher Shakeel Butt
2025-07-03 22:29   ` Paul E. McKenney
2025-07-03 22:46     ` Shakeel Butt
2025-07-03 23:53       ` Paul E. McKenney
2025-07-04  1:54         ` Shakeel Butt
2025-07-04  4:44           ` Paul E. McKenney
2025-07-04 17:45             ` Shakeel Butt
2025-07-04 17:58               ` Paul E. McKenney
2025-07-03 22:24 ` [PATCH 1/2] llist: avoid memory tearing for llist_node Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).