From: Tejun Heo <tj@kernel.org>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: "Paul E . McKenney" <paulmck@kernel.org>,
"Andrew Morton" <akpm@linux-foundation.org>,
"JP Kobryn" <inwardvessel@gmail.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Ying Huang" <huang.ying.caritas@gmail.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
"Alexei Starovoitov" <ast@kernel.org>,
"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
"Michal Koutný" <mkoutny@suse.com>,
bpf@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org,
"Meta kernel team" <kernel-team@meta.com>
Subject: Re: [PATCH v3] cgroup: llist: avoid memory tears for llist_node
Date: Thu, 17 Jul 2025 07:44:41 -1000 [thread overview]
Message-ID: <aHk2iXXJkgaDkXVe@slm.duckdns.org> (raw)
In-Reply-To: <20250704180804.3598503-1-shakeel.butt@linux.dev>
On Fri, Jul 04, 2025 at 11:08:04AM -0700, Shakeel Butt wrote:
> Before the commit 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi
> safe"), the struct llist_node is expected to be private to the one
> inserting the node to the lockless list or the one removing the node
> from the lockless list. After the mentioned commit, the llist_node in
> the rstat code is per-cpu shared between the stacked contexts i.e.
> process, softirq, hardirq & nmi. It is possible the compiler may tear
> the loads or stores of llist_node. Let's avoid that.
>
> KCSAN reported the following race:
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 60 UID: 0 PID: 5425 ... 6.16.0-rc3-next-20250626 #1 NONE
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: ...
> ==================================================================
> ==================================================================
> BUG: KCSAN: data-race in css_rstat_flush / css_rstat_updated
> write to 0xffffe8fffe1c85f0 of 8 bytes by task 1061 on cpu 1:
> css_rstat_flush+0x1b8/0xeb0
> __mem_cgroup_flush_stats+0x184/0x190
> flush_memcg_stats_dwork+0x22/0x50
> process_one_work+0x335/0x630
> worker_thread+0x5f1/0x8a0
> kthread+0x197/0x340
> ret_from_fork+0xd3/0x110
> ret_from_fork_asm+0x11/0x20
> read to 0xffffe8fffe1c85f0 of 8 bytes by task 3551 on cpu 15:
> css_rstat_updated+0x81/0x180
> mod_memcg_lruvec_state+0x113/0x2d0
> __mod_lruvec_state+0x3d/0x50
> lru_add+0x21e/0x3f0
> folio_batch_move_lru+0x80/0x1b0
> __folio_batch_add_and_move+0xd7/0x160
> folio_add_lru_vma+0x42/0x50
> do_anonymous_page+0x892/0xe90
> __handle_mm_fault+0xfaa/0x1520
> handle_mm_fault+0xdc/0x350
> do_user_addr_fault+0x1dc/0x650
> exc_page_fault+0x5c/0x110
> asm_exc_page_fault+0x22/0x30
> value changed: 0xffffe8fffe18e0d0 -> 0xffffe8fffe1c85f0
>
> $ ./scripts/faddr2line vmlinux css_rstat_flush+0x1b8/0xeb0
> css_rstat_flush+0x1b8/0xeb0:
> init_llist_node at include/linux/llist.h:86
> (inlined by) llist_del_first_init at include/linux/llist.h:308
> (inlined by) css_process_update_tree at kernel/cgroup/rstat.c:148
> (inlined by) css_rstat_updated_list at kernel/cgroup/rstat.c:258
> (inlined by) css_rstat_flush at kernel/cgroup/rstat.c:389
>
> $ ./scripts/faddr2line vmlinux css_rstat_updated+0x81/0x180
> css_rstat_updated+0x81/0x180:
> css_rstat_updated at kernel/cgroup/rstat.c:90 (discriminator 1)
>
> These are expected race and a simple READ_ONCE/WRITE_ONCE resolves these
> reports. However let's add comments to explain the race and the need for
> memory barriers if stronger guarantees are needed.
>
> More specifically the rstat updater and the flusher can race and cause a
> scenario where the stats updater skips adding the css to the lockless
> list but the flusher might not see those updates done by the skipped
> updater. This is benign race and the subsequent flusher will flush those
> stats and at the moment there aren't any rstat users which are not fine
> with this kind of race. However some future user might want more
> stricter guarantee, so let's add appropriate comments to ease the job of
> future users.
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
> Fixes: 36df6e3dbd7e ("cgroup: make css_rstat_updated nmi safe")
Applied to cgroup/for-6.17. Sorry about the delay. I'm on a vacation and
ended up a lot more offline than I expected to be.
Thanks.
--
tejun
prev parent reply other threads:[~2025-07-17 17:44 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-04 18:08 [PATCH v3] cgroup: llist: avoid memory tears for llist_node Shakeel Butt
2025-07-07 9:23 ` Michal Koutný
2025-07-15 1:01 ` Shakeel Butt
2025-07-17 17:44 ` Tejun Heo [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHk2iXXJkgaDkXVe@slm.duckdns.org \
--to=tj@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=bigeasy@linutronix.de \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=huang.ying.caritas@gmail.com \
--cc=inwardvessel@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=paulmck@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).