From: Dave Chinner <david@fromorbit.com>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>,
Stephen Boyd <sboyd@codeaurora.org>,
mc@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, Nick Piggin <npiggin@kernel.dk>,
"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
Maciej Rutecki <maciej.rutecki@gmail.com>
Subject: Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
Date: Tue, 20 Dec 2011 10:56:59 +1100 [thread overview]
Message-ID: <20111219235659.GT23662@dastard> (raw)
In-Reply-To: <20111219121100.GI2203@ZenIV.linux.org.uk>
On Mon, Dec 19, 2011 at 12:11:00PM +0000, Al Viro wrote:
> On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote:
>
> > IMHO, we don't need to be concerned here because, {get,put}_online_cpus()
> > implement a refcounting solution, and they don't really serialize stuff
> > unnecessarily. The readers (those who prevent cpu hotplug, such as this lock-
> > unlock code) are fast and can be concurrent, while the writers (the task that
> > is doing the cpu hotplug) waits till all existing readers are gone/done with
> > their work.
> >
> > So, since we are readers here, IMO, we don't have to worry about performance.
> > (I know that we get serialized just for a moment while incrementing the
> > refcount, but that should not be worrisome right?)
> >
> > Moreover, using for_each_online_cpu() without using {get,put}_online_cpus()
> > around that, is plain wrong, because of the unhandled race with cpu hotplug.
> > IOW, our primary concern here is functionality, isn't it?
> >
> > To summarize, in the current design of these VFS locks, using
> > {get,put}_online_cpus() is *essential* to fix a functionality-related bug,
> > (and not so bad performance-wise as well).
> >
> > The following patch (v2) incorporates your comments:
>
> I really don't like that. Amount of contention is not a big issue, but the
> fact that now br_write_lock(vfsmount_lock) became blocking is really nasty.
> Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem...
> BTW, it's seriously blocking - if nothing else, it waits for cpu_down()
> in progress to complete. Which can involve any number of interesting
> locks taken by notifiers.
>
> Dave's variant is also no good; consider this:
> CPU1: br_write_lock(); spinlocks grabbed
> CPU2: br_read_lock(); spinning on one of them
> CPU3: try to take CPU2 down. We *can't* proceed to the end, notifiers or no
> notifiers, until CPU2 gets through the critical area. Which can't happen
> until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock().
> Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go
> into the critical area when it's really not safe there.
Yeah, XFS has some, er, complexity to handle this.
Basically, it has global state (the on-disk superblock) that the
per-cpu counters synchronised back to every so often and hence the
per-cpu counters can be switched on and off. There's also a global
state lock that is held through a counter modification slow path and
during notifier operations and the combination of these is used to
avoid such race conditions. Hence when a cpu dies, we do:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
/* Disable all the counters, then fold the dead cpu's
* count into the total on the global superblock and
* re-enable the counters. */
xfs_icsb_lock(mp);
spin_lock(&mp->m_sb_lock);
xfs_icsb_disable_counter(mp, XFS_SBS_ICOUNT);
xfs_icsb_disable_counter(mp, XFS_SBS_IFREE);
xfs_icsb_disable_counter(mp, XFS_SBS_FDBLOCKS);
....
Which is basically:
1. take global counter state modification mutex
2. take in-core superblock lock (global in-core fs state
that the per-cpu counters sync to)
3. disable each online per-cpu counter
a. lock all online per-cpu locks for the counter
b. clear counter enabled bit
c. unlock all online per-cpu locks
And when the counter is re-enabled after the cleanup of the per-cpu
counter state on the dead CPU, it does it via a rebalancing
operation:
1. disable each online per-cpu counter
a. lock all online per-cpu locks for the counter
b. clear counter enabled bit
c. unlock all online per-cpu locks
2. balance counter across all online per-cpu structures
3. enable each online epr-cpu counter:
a. lock all online per-cpu locks for the counter
b. set counter enabled bit
c. unlock all online per-cpu locks
4. drop in-core superblock lock
5. drop global counter state modification mutex
Hence, in the situation you describe above, if CPU 2 gets the lock
before the notifier, all is well. In the case it doesn't, it get
blocked like this:
prempt_disable()
if (counter disabled)
goto slow path
lock_local_counter() <<<< spin here
cpu notifier disables counter
and unlocks it. We get the lock
if (counter disabled) {
unlock_local_counter()
goto slow path
}
.....
slow_path:
preempt_enable()
xfs_icsb_lock(mp) <<<< serialises on global notifier lock
not on any of the spin locks
Like I said, there's quite a bit of complexity in all this to handle
the cpu notifiers in (what I think is) a race free manner. I've been
looking at replacing all this complexity (it's close to a 1000 lines
of code) with the generic per-cpu counters, but that's got it's own
problems that involve adding lots of complexity....
> That got one hell of a deadlock potential ;-/ So far I'm more or less
> in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside
> of namespace_sem. But I still have not convinced myself that it's
> really safe ;-/
Agreed, it looks like a lot simpler solution to this problem than a
notifier. But I don't think I know enough about the usage context to
determine if it is safe, either, so i can't really help you there. :/
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2011-12-19 23:57 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-12-19 3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong
2011-12-19 4:11 ` Al Viro
2011-12-19 5:00 ` Dave Chinner
2011-12-19 6:07 ` mengcong
2011-12-19 7:31 ` Srivatsa S. Bhat
2011-12-19 9:12 ` Stephen Boyd
2011-12-19 11:03 ` Srivatsa S. Bhat
2011-12-19 12:11 ` Al Viro
2011-12-19 20:23 ` Srivatsa S. Bhat
2011-12-19 20:52 ` Al Viro
2011-12-20 4:56 ` Srivatsa S. Bhat
2011-12-20 6:27 ` Al Viro
2011-12-20 7:28 ` Srivatsa S. Bhat
2011-12-20 9:37 ` mengcong
2011-12-20 10:36 ` Srivatsa S. Bhat
2011-12-20 11:08 ` Srivatsa S. Bhat
2011-12-20 12:50 ` Srivatsa S. Bhat
2011-12-20 14:06 ` Al Viro
2011-12-20 14:35 ` Srivatsa S. Bhat
2011-12-20 17:59 ` Al Viro
2011-12-20 19:12 ` Srivatsa S. Bhat
2011-12-20 19:58 ` Al Viro
2011-12-20 22:27 ` Dave Chinner
2011-12-20 23:31 ` Al Viro
2011-12-21 21:15 ` Srivatsa S. Bhat
2011-12-21 22:02 ` Al Viro
2011-12-21 22:12 ` Andrew Morton
2011-12-22 7:02 ` Al Viro
2011-12-22 7:20 ` Andrew Morton
2011-12-22 8:08 ` Al Viro
2011-12-22 8:17 ` Andi Kleen
2011-12-22 8:39 ` Al Viro
2011-12-22 8:22 ` Andi Kleen
2011-12-20 7:30 ` mengcong
2011-12-20 7:37 ` Srivatsa S. Bhat
2011-12-19 23:56 ` Dave Chinner [this message]
2011-12-20 4:05 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111219235659.GT23662@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=maciej.rutecki@gmail.com \
--cc=mc@linux.vnet.ibm.com \
--cc=npiggin@kernel.dk \
--cc=sboyd@codeaurora.org \
--cc=srivatsa.bhat@linux.vnet.ibm.com \
--cc=viro@ZenIV.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).