From: Al Viro <viro@ZenIV.linux.org.uk>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.
Date: Mon, 29 Sep 2014 16:04:45 +0100 [thread overview]
Message-ID: <20140929150445.GE7996@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20140929131613.GL5015@linux.vnet.ibm.com>
On Mon, Sep 29, 2014 at 06:16:13AM -0700, Paul E. McKenney wrote:
> The "textbook" approach is to avoid starting the grace period until
> the after the final reference count is dropped (and of course after
> the name has been made inaccessible to all readers). Not sure what
> the best way to adapt the current code to this approach (if it is
> in fact feasible to begin with), but one approach would be something
> like this:
> static void __d_free(struct rcu_head *head)
> {
> struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
>
> WARN_ON(!hlist_unhashed(&dentry->d_alias));
> if (dname_external(dentry) && atomic_dec_return(&dentry->refcnt))
> call_rcu(&dentry->rh, __d_free_name_rcu);
> kmem_cache_free(dentry_cache, dentry);
> }
>
> Of course, this means that the link side has to do something like
> atomic_add_unless(&&dentry->refcnt, 1, 0), and create a new name
> if the old one is on its way out. If that is too painful, another
What do you mean, "link"? Rename, perhaps?
> approach is to increment the count for the grace period and decrement
> it at the end of the grace period, and to skip the free if non-zero
> after that decrement. And this means adding an rcu_head to the
> qstr structure, which might not help memory footprint.
> Presumably the write_seqcount_begin() prevents confusion from ephemeral
> names...
We hold rename_lock, so __d_move() is serialized. *And* both arguments
are pinned, so they couldn't have reached free_dentry() yet.
> > +static void copy_name(struct dentry *dentry, struct dentry *target)
> > +{
> > + struct ext_name *old_name = ext_name(dentry);
> > + if (unlikely(dname_external(target))) {
> > + atomic_inc(&ext_name(target)->count);
>
> I might be missing something, but I believe that the above needs to
> be atomic_add_unless().
It would better not; both dentry and target are pinned.
> Might also need to be in an RCU read-side critical section, but I am not
> so sure about that. If it is not in an RCU read-side critical section,
> I don't see how the kfree_rcu() knows how long to wait before freeing
> the name.
It is a *writer*, not reader. Readers of ->d_name are either under
rcu_read_lock(), or have hard exclusion wrt that __d_move() one way or another
and hold a reference to dentry, preventing dentry_free(). We really
need to care only about the ones under rcu_read_lock(), so kfree_rcu()
will do. Non-RCU readers are relying on something like "I'm holding
a reference to dentry and ->i_mutex on the parent directory or
->s_vfs_rename_mutex on the entire filesystem". Or "I'm holding that
reference and I know that on this filesystem nothing could lead to
d_move() and friends".
Anyway, see followup to that patch; it's equivalent to this one, but instead
of separately delayed freeing of dentry and ext name it makes __d_free()
free just the dentry itself and __d_free_ext() freeing both. With
free_dentry() choosing which one to call_rcu().
FWIW, the thing I had been worrying about turned out to be a red herring (for
the same reason why RCU list removals don't need barriers - call_rcu() acts
as a barrier, so that reader managing to see the old pointer after
rcu_read_lock() will be seen by call_rcu() from writer, making the callback
delayed until that reader does rcu_read_unlock(); AFAICS, that's guaranteed
by smp_mb() in __call_rcu()). However, the readers are, indeed, in trouble.
Already. There's a data dependency - we want to make sure that the string
we observe via ->d_name.name is NUL-terminated. We have dentry allocation
doing this
dname[name->len] = 0;
/* Make sure we always see the terminating NUL character */
smp_wmb();
dentry->d_name.name = dname;
and dentry_cmp() (one of the RCU readers) has
cs = ACCESS_ONCE(dentry->d_name.name);
smp_read_barrier_depends();
before looking at the string itself. However, another such reader
(prepend_name()) doesn't - it just does ACCESS_ONCE() and assumes that
store of terminating NUL will be seen. AFAICS, not guaranteed on Alpha,
so we need the same smp_read_barrier_depends() there.
Another thing: dentry_free() is called only when all non-RCU readers can't
see the dentry. So that's where we need to decrement the refcount of ext
name, AFAICS. Same in copy_name() (from __d_move()) where we lose the
reference to old name. Again, non-RCU readers are not a problem - they
have exclusion wrt __d_move(). For them it's atomic.
Refcount is equal to the number of dentries visible to non-RCU readers with
->d_name.name pointing to that external name. Freeing isn't scheduled until
that reaches 0 and is done via kfree_rcu() (or, in the later version of patch,
kfree_rcu() on one path and call_rcu() on another)...
next prev parent reply other threads:[~2014-09-29 15:04 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-24 18:14 [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Mikhail Efremov
2014-09-24 18:55 ` Al Viro
2014-09-24 19:20 ` Linus Torvalds
2014-09-24 19:27 ` Linus Torvalds
2014-09-24 20:18 ` Al Viro
2014-09-25 4:46 ` Al Viro
2014-09-26 16:44 ` Al Viro
2014-09-27 4:45 ` Al Viro
2014-09-27 17:56 ` Linus Torvalds
2014-09-27 18:31 ` Al Viro
2014-09-27 19:16 ` Al Viro
2014-09-27 19:37 ` Linus Torvalds
2014-09-27 19:39 ` Linus Torvalds
2014-09-27 19:49 ` Al Viro
2014-09-27 19:55 ` Linus Torvalds
2014-09-27 21:48 ` [git pull] vfs.git for 3.17-rc7 Al Viro
2014-09-27 19:45 ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Al Viro
2014-09-28 7:47 ` Al Viro
2014-09-28 18:05 ` Al Viro
2014-09-28 21:51 ` Al Viro
2014-09-29 1:06 ` [PATCH] missing data dependency barrier in prepend_name() Al Viro
2014-09-29 15:15 ` [PATCH v2] vfs: Don't exchange "short" filenames unconditionally Linus Torvalds
2014-09-29 15:59 ` Al Viro
2014-09-29 16:07 ` Linus Torvalds
2014-09-29 16:27 ` Al Viro
2014-09-29 17:54 ` Linus Torvalds
2014-09-29 19:04 ` Al Viro
2014-09-29 20:45 ` Linus Torvalds
2014-09-29 18:42 ` Paul E. McKenney
2014-10-01 0:16 ` Al Viro
2014-10-02 5:38 ` Paul E. McKenney
2014-10-02 10:35 ` Chuck Ebbert
2014-10-03 2:11 ` Paul E. McKenney
2014-09-29 13:16 ` Paul E. McKenney
2014-09-29 15:04 ` Al Viro [this message]
2014-09-28 15:01 ` Mikhail Efremov
2014-09-26 20:23 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140929150445.GE7996@ZenIV.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).