From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: chandan.babu@gmail.com, linux-xfs@vger.kernel.org,
ritesh.list@gmail.com, sandeen@sandeen.net
Subject: Re: [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists
Date: Thu, 24 Aug 2023 21:07:30 -0700 [thread overview]
Message-ID: <20230825040730.GG17912@frogsfrogsfrogs> (raw)
In-Reply-To: <ZOfyASUg/A+GWE37@dread.disaster.area>
On Fri, Aug 25, 2023 at 10:12:49AM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 04:21:25PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Directly track which CPUs have contributed to the inodegc percpu lists
> > instead of trusting the cpu online mask. This eliminates a theoretical
> > problem where the inodegc flush functions might fail to flush a CPU's
> > inodes if that CPU happened to be dying at exactly the same time. Most
> > likely nobody's noticed this because the CPU dead hook moves the percpu
> > inodegc list to another CPU and schedules that worker immediately. But
> > it's quite possible that this is a subtle race leading to UAF if the
> > inodegc flush were part of an unmount.
> >
> > Further benefits: This reduces the overhead of the inodegc flush code
> > slightly by allowing us to ignore CPUs that have empty lists. Better
> > yet, it reduces our dependence on the cpu online masks, which have been
> > the cause of confusion and drama lately.
> >
> > Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > fs/xfs/xfs_icache.c | 60 +++++++++++----------------------------------------
> > fs/xfs/xfs_icache.h | 1 -
> > fs/xfs/xfs_mount.h | 6 +++--
> > fs/xfs/xfs_super.c | 4 +--
> > 4 files changed, 18 insertions(+), 53 deletions(-)
> >
> >
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index e541f5c0bc25..7fd876e94ecb 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -443,7 +443,7 @@ xfs_inodegc_queue_all(
> > int cpu;
> > bool ret = false;
> >
> > - for_each_online_cpu(cpu) {
> > + for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> > gc = per_cpu_ptr(mp->m_inodegc, cpu);
> > if (!llist_empty(&gc->list)) {
> > mod_delayed_work_on(cpu, mp->m_inodegc_wq, &gc->work, 0);
> > @@ -463,7 +463,7 @@ xfs_inodegc_wait_all(
> > int error = 0;
> >
> > flush_workqueue(mp->m_inodegc_wq);
> > - for_each_online_cpu(cpu) {
> > + for_each_cpu(cpu, &mp->m_inodegc_cpumask) {
> > struct xfs_inodegc *gc;
> >
> > gc = per_cpu_ptr(mp->m_inodegc, cpu);
> > @@ -1845,10 +1845,12 @@ xfs_inodegc_worker(
> > struct xfs_inodegc, work);
> > struct llist_node *node = llist_del_all(&gc->list);
> > struct xfs_inode *ip, *n;
> > + struct xfs_mount *mp = gc->mp;
> > unsigned int nofs_flag;
> >
> > ASSERT(gc->cpu == smp_processor_id());
> >
> > + cpumask_test_and_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
>
> Why does this need to be a test-and-clear operation? If it is set,
> we clear it. If it is not set, clearing it is a no-op. Hence we
> don't need to test whether the bit is set first. Also,
> cpumask_clear_cpu() uses clear_bit(), which is an atomic operation,
> so clearing the bit isn't going to race with any other updates.
Aha, I didn't realize that clear_bit is atomic. Oh, I guess
atomic_bitops.txt mentions that an RMW operation is atomic if there's a
separate '__' version. Subtle. :(
> As it is, we probably want acquire semantics for the gc structure
> here (see below), so I think this likely should be:
>
> /*
> * Clear the cpu mask bit and ensure that we have seen the
> * latest update of the gc structure associated with this
> * CPU. This matches with the release semantics used when
> * setting the cpumask bit in xfs_inodegc_queue.
> */
> cpumask_clear_cpu(gc->cpu, &mp->m_inodegc_cpumask);
> smp_mb__after_atomic();
<nod>
> > WRITE_ONCE(gc->items, 0);
> >
> > if (!node)
> > @@ -1862,7 +1864,7 @@ xfs_inodegc_worker(
> > nofs_flag = memalloc_nofs_save();
> >
> > ip = llist_entry(node, struct xfs_inode, i_gclist);
> > - trace_xfs_inodegc_worker(ip->i_mount, READ_ONCE(gc->shrinker_hits));
> > + trace_xfs_inodegc_worker(mp, READ_ONCE(gc->shrinker_hits));
> >
> > WRITE_ONCE(gc->shrinker_hits, 0);
> > llist_for_each_entry_safe(ip, n, node, i_gclist) {
> > @@ -2057,6 +2059,7 @@ xfs_inodegc_queue(
> > struct xfs_inodegc *gc;
> > int items;
> > unsigned int shrinker_hits;
> > + unsigned int cpu_nr;
> > unsigned long queue_delay = 1;
> >
> > trace_xfs_inode_set_need_inactive(ip);
> > @@ -2064,12 +2067,16 @@ xfs_inodegc_queue(
> > ip->i_flags |= XFS_NEED_INACTIVE;
> > spin_unlock(&ip->i_flags_lock);
> >
> > - gc = get_cpu_ptr(mp->m_inodegc);
> > + cpu_nr = get_cpu();
> > + gc = this_cpu_ptr(mp->m_inodegc);
> > llist_add(&ip->i_gclist, &gc->list);
> > items = READ_ONCE(gc->items);
> > WRITE_ONCE(gc->items, items + 1);
> > shrinker_hits = READ_ONCE(gc->shrinker_hits);
> >
> > + if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> > + cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
> > +
> > /*
> > * We queue the work while holding the current CPU so that the work
> > * is scheduled to run on this CPU.
>
> I think we need release/acquire memory ordering on this atomic bit
> set now. i.e. to guarantee that if the worker sees the cpumask bit
> set (with acquire semantics), it will always see the latest item
> added to the list. i.e.
>
> /*
> * Ensure the list add is always seen by anyone that
> * find the cpumask bit set. This effectively gives
> * the cpumask bit set operation release ordering semantics.
> */
> smp_mb__before_atomic();
> if (!cpumask_test_cpu(cpu_nr, &mp->m_inodegc_cpumask))
> cpumask_test_and_set_cpu(cpu_nr, &mp->m_inodegc_cpumask);
<nod>
> Also, same comment about put_cpu() vs put_cpu_var() as the last patch.
<nod>
> Otherwise this seems sane.
Thanks!
--D
> -Dave.
>
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2023-08-25 4:08 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-24 23:21 [PATCHSET 0/3] xfs: fix cpu hotplug mess Darrick J. Wong
2023-08-24 23:21 ` [PATCH 1/3] xfs: fix per-cpu CIL structure aggregation racing with dying cpus Darrick J. Wong
2023-08-24 23:53 ` Dave Chinner
2023-08-25 4:07 ` Darrick J. Wong
2023-08-24 23:21 ` [PATCH 2/3] xfs: use per-mount cpumask to track nonempty percpu inodegc lists Darrick J. Wong
2023-08-25 0:12 ` Dave Chinner
2023-08-25 4:07 ` Darrick J. Wong [this message]
2023-08-24 23:21 ` [PATCH 3/3] xfs: remove cpu hotplug hooks Darrick J. Wong
2023-08-25 0:15 ` Dave Chinner
2023-08-24 23:45 ` [PATCH 4/3] generic/650: race mount and unmount with cpu hotplug too Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230825040730.GG17912@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=chandan.babu@gmail.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=sandeen@sandeen.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox