All of lore.kernel.org
 help / color / mirror / Atom feed
From: Al Viro <viro@ZenIV.linux.org.uk>
To: Jeff Layton <jeff.layton@primarydata.com>
Cc: linux-fsdevel@vger.kernel.org, trond.myklebust@primarydata.com,
	tao.peng@primarydata.com
Subject: Re: [PATCH] vfs: remove unneeded hlist_unhashed check from get_active_super
Date: Wed, 15 Oct 2014 00:05:20 +0100	[thread overview]
Message-ID: <20141014230520.GA7996@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20141014151911.2581f362@tlielax.poochiereds.net>

On Tue, Oct 14, 2014 at 03:19:11PM -0400, Jeff Layton wrote:
> Ok, got it. Thanks for the clarification, Al!

FWIW, the life cycle for superblocks looks so:

* Allocated: all instances are created in this state, all by alloc_super().
Invisible to global data structures.  ->s_umount held exclusive, ->s_count is
1, ->s_active is 1, no MS_BORN in flags.  Possible successors: Implanted,
Freeing; transition happens within sget() (which is the sole caller of
alloc_super()).

* Implanted: set up by 'set()' callback of sget() sufficiently to be
recognizable by 'test()' callback of the same and inserted into super_blocks
and type->fs_supers (all under sb_lock).  Possible successors: SettingUp,
Freeing.  The latter happens if 'set()' fails, the latter - if it succeeds,
sb_lock is dropped upon either transition (both are within sget()).
->s_count is 1, ->s_active is 1, ->s_umount held exclusive, !MS_BORN

* SettingUp: in super_blocks and type->fs_supers, ->s_active is still 1,
->s_count > 0, !MS_BORN.  That's the state in which new instances are
returned by sget() to its callers.  ->s_umount might be dropped and
regained; in the end it is dropped.  Subsequent sget() attempts on the
same fs will block until this instance leaves that state.  No ->s_active
increments are allowed.  That's when the bulk of filesystem setup is being
done.  Possible successors: Born, ShuttingDown (depending on whether that
setup attempt succeeds or fails).  Instances in that state are seen by
->mount().  Transition to Born consists of setting MS_BORN and dropping
->s_umount; transition to ShuttingDown - call of deactivate_locked_super().

* Born: in super_blocks and type->fs_supers, ->s_umount is not held,
->s_active > 0, ->s_count > 0, MS_BORN is set, ->s_root is non-NULL.
That's the normal state; fs is fully set up and active.  ->s_active
increments and decrements are possible.  ->s_active can reach 0 only
with ->s_umount held exclusive - that happens only in deactivate_locked_super()
and moves us to ShuttingDown state.  That's the only possible successor.

* ShuttingDown: still in super_blocks and type->fs_supers, ->s_umount is
held exclusive, ->s_active is 0 (and will never increment after that point),
->s_count is positive, MS_BORN is set.  That's where the fs shutdown happens.
At some point in ->kill_sb() we must call generic_shutdown_super(), which
will do the type-independent part of shutdown, including dentry tree
freeing, inode eviction, etc.  And, in the end, removes it from
type->fs_supers (protected by sb_lock) and drops ->s_umount.  At that point
we are in RunDown state.  In principle, dropping and regaining ->s_umount
in ShuttingDown state is allowed, as long as it's not done until ->s_root
has been made NULL (by shrink_dcache_for_umount() from
generic_shutdown_super()), but I don't know of any fs type that would
do it for any reasons.

* Rundown: still in super_blocks. ->s_count > 0, ->s_active is 0, MS_BORN is
set, ->s_root is NULL.  No increments of ->s_count from that point; there
might be processes blocked on contended attempts to get ->s_umount, but as
soon as they get it and see that superblock is in that state, they drop
->s_umount and decrement ->s_count.  Once ->s_count reaches 0, we remove
it from super_blocks and move to Freeing (again, all manipulations of lists
and of ->s_count happen under sb_lock).

* Freeing: what is says.  We free that sucker.  That's in destroy_super().

There are some extra complications from RCU pathwalk.  Namely, we need to
make sure that all freeing of data structures needed by LOOKUP_RCU ->lookup(),
LOOKUP_RCU ->d_revalidate(), ->d_manage(..., true) and to ->d_hash() and
->d_compare() (i.e. all fs code that can be called from RCU pathwalk) won't
happen without RCU delay.  For struct super_block itself it is guaranteed
by use of kfree_rcu() in destroy_super(), for fs type *module* it's usual
logics in module_put() (we hold a reference to module from a bit before
entering SettingUp to a bit after transition to RunDown).  Grace periods for
dentries and inodes are dealt with by VFS, provided that they references to
them are not leaked by fs driver.  Filesystem-private data structures
are responsibility of fs driver itself, of course.

Note that instance remains on the super_blocks until it's about to get
freed.  That simplifies a lot of logics in list traversals - we walk it
under sb_lock and as long as we bump ->s_count before dropping sb_lock,
we can be sure that the instance will stay on list until we do the matching
decrement.  That's the reason for asymmetry between super_blocks and
type->fs_supers.

To grab ->s_umount you need either guaranteed positive ->s_count or
guaranteed positive ->s_atomic (the latter guarantees the former).
Holding ->s_umount is enough to stabilize the state.
With ->s_umount grabbed by somebody other than the process doing the
lifecycle transitions, the following is true:
	->s_root is non-NULL => it's in SettingUp or Born state
	->s_active is positive => it's in SettingUp or Born state
	MS_BORN is set => it's in Born, ShuttingDown or RunDown state.
Note that checking MS_BORN *is* needed - in SettingUp we allow to drop
and regain ->s_umount, so the first two tests do not distinguish SettingUp
from Born.  OTOH, ->mnt_sb is guaranteed to be in Born state and remain
such until vfsmount is held.
With sb_lock held the following is true:
	it's on type->fs_supers <=> it's in SettingUp, Born or ShuttingDown

We are not allowed to increment ->s_active unless in Born state.
We are not allowed to increment ->s_count when in RunDown state.
Decrements of ->s_active from 1 to 0 are all done under ->s_umount by
deactivate_locked_super().
Changes of ->s_count and lists manipulations are done by sb_lock.
All callers of sget() are in ->mount() instances.
Superblock stays on type->fs_supers from the beginning of setup to the end
of shutdown.  sget() while a matching fs is in SettingUp or ShuttingDown state
will block; sget() while such fs is in Born state will bump its ->s_active,
and return it with s_umount held exclusive.  Callers can tell whether they
got a new instance or an extra reference to existing one by checking ->s_root
of what they get (NULL <=> it's a new instance).
All callers of generic_shutdown_super() are from ->kill_sb() instances and
it must be called by any such instance.

Note, BTW, that once we are in SettingUp state, the call of ->kill_sb() is
guaranteed, whether on normal fs shutdown or in case of setup failure, no
matter how early in setup that failure happens.  The rules are different for
older method (->put_super()), which is called by generic_shutdown_super()
(itself called by ->kill_sb()) only if ->s_root has already been made non-NULL.

The braindump above probably needs to be reordered into more readable form and
put into Documentation/filesystems/<something>; for now let it just sit in
list archives and wait...

      reply	other threads:[~2014-10-14 23:05 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-14 10:36 [PATCH] vfs: remove unneeded hlist_unhashed check from get_active_super Jeff Layton
2014-10-14 11:11 ` Jeff Layton
2014-10-14 19:13   ` Al Viro
2014-10-14 19:19     ` Jeff Layton
2014-10-14 23:05       ` Al Viro [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141014230520.GA7996@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=jeff.layton@primarydata.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=tao.peng@primarydata.com \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.