From: Long Li <leo.lilong@huawei.com>
To: Dave Chinner <david@fromorbit.com>
Cc: <djwong@kernel.org>, <cem@kernel.org>,
<linux-xfs@vger.kernel.org>, <yi.zhang@huawei.com>,
<houtao1@huawei.com>, <yangerkun@huawei.com>,
<lonuxli.64@gmail.com>
Subject: Re: [PATCH] xfs: fix race condition in inodegc list and cpumask handling
Date: Thu, 5 Dec 2024 14:59:16 +0800 [thread overview]
Message-ID: <Z1FPRKXZrsAxsoZ5@localhost.localdomain> (raw)
In-Reply-To: <Z0TmLzSmLr78T8Im@dread.disaster.area>
On Tue, Nov 26, 2024 at 08:03:43AM +1100, Dave Chinner wrote:
Sorry for reply so late, because I want to make the problem as clear
as possible, but there are still some doubts.
> On Mon, Nov 25, 2024 at 09:52:58AM +0800, Long Li wrote:
> > There is a race condition between inodegc queue and inodegc worker where
> > the cpumask bit may not be set when concurrent operations occur.
>
> What problems does this cause? i.e. how do we identify systems
> hitting this issue?
>
I haven't encountered any actual issues, but while reviewing 62334fab4762
("xfs: use per-mount cpumask to track nonempty percpu inodegc lists"), I
noticed there is a potential problem.
When the gc worker runs on a CPU other than the specified one due to
loadbalancing, it could race with xfs_inodegc_queue() processing the
same struct xfs_inodegc. If xfs_inodegc_queue() adds the last inode
to the gc list during this race, that inode might never be processed
and reclaimed due to cpumask not set. This maybe lead to memory leaks
after filesystem unmount, I'm unsure if there are other more serious
implications.
> >
> > Current problematic sequence:
> >
> > CPU0 CPU1
> > -------------------- ---------------------
> > xfs_inodegc_queue() xfs_inodegc_worker()
> > llist_del_all(&gc->list)
> > llist_add(&ip->i_gclist, &gc->list)
> > cpumask_test_and_set_cpu()
> > cpumask_clear_cpu()
> > < cpumask not set >
> >
> > Fix this by moving llist_del_all() after cpumask_clear_cpu() to ensure
> > proper ordering. This change ensures that when the worker thread clears
> > the cpumask, any concurrent queue operations will either properly set
> > the cpumask bit or have already emptied the list.
> >
> > Also remove unnecessary smp_mb__{before/after}_atomic() barriers since
> > the llist_* operations already provide required ordering semantics. it
> > make the code cleaner.
>
> IIRC, the barriers were for ordering the cpumask bitmap ops against
> llist operations. There are calls elsewhere to for_each_cpu() that
> then use llist_empty() checks (e.g xfs_inodegc_queue_all/wait_all),
> so on relaxed architectures (like alpha) I think we have to ensure
> the bitmask ops carried full ordering against the independent llist
> ops themselves. i.e. llist_empty() just uses READ_ONCE, so it only
> orders against other llist ops and won't guarantee any specific
> ordering against against cpumask modifications.
>
> I could be remembering incorrectly, but I think that was the
> original reason for the barriers. Can you please confirm that the
> cpumask iteration/llist_empty checks do not need these bitmask
> barriers anymore? If that's ok, then the change looks fine.
>
Even on architectures with relaxed memory ordering (like alpha), I noticed
that llist_add() already has full barrier semantics, so I think the
smp_mb__before_atomic barrier in xfs_inodegc_queue() can be removed.
llist_add()
try_cmpxchg
raw_try_cmpxchg
arch_cmpxchg
arch_cmpxchg of alpha in file arch/alpha/include/asm/cmpxchg.h
#define arch_cmpxchg(ptr, o, n) \
({ \
__typeof__(*(ptr)) __ret; \
__typeof__(*(ptr)) _o_ = (o); \
__typeof__(*(ptr)) _n_ = (n); \
smp_mb(); \
__ret = (__typeof__(*(ptr))) ____cmpxchg((ptr), \
(unsigned long)_o_, (unsigned long)_n_, sizeof(*(ptr)));\
smp_mb(); \
^^^^^^^
__ret; \
})
I'm wondering if we really need to "Ensure the list add is always seen
by xfs_inodegc_queue_all() who finds the cpumask bit set". It seems
harmless even if we don't see the list add completion - it can be
processed in the next round. xfs_inodegc_queue_all() doesn't guarantee
processing all inodes that are being or will be added to the gc llist
anyway.
If we do need that guarantee, should we add a barrier between reading
m_inodegc_cpumask and gc->list in xfs_inodegc_queue_all() to prevent
load-load reordering?
Maybe I'm misunderstanding something.
Thanks,
Long Li
next prev parent reply other threads:[~2024-12-05 7:01 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-25 1:52 [PATCH] xfs: fix race condition in inodegc list and cpumask handling Long Li
2024-11-25 21:03 ` Dave Chinner
2024-12-05 6:59 ` Long Li [this message]
2024-12-10 6:20 ` Dave Chinner
2024-12-25 12:41 ` Long Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z1FPRKXZrsAxsoZ5@localhost.localdomain \
--to=leo.lilong@huawei.com \
--cc=cem@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=houtao1@huawei.com \
--cc=linux-xfs@vger.kernel.org \
--cc=lonuxli.64@gmail.com \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox