From: NeilBrown <neilb@suse.de>
To: Kinglong Mee <kinglongmee@gmail.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root
Date: Fri, 24 Apr 2015 13:00:45 +1000 [thread overview]
Message-ID: <20150424130045.6bbdb2f9@notabene.brown> (raw)
In-Reply-To: <5538EB18.7080802@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 8882 bytes --]
On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> On 4/23/2015 7:44 AM, NeilBrown wrote:
> > On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
> > wrote:
> >>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
> >>> it is necessary to store them in svc_export.
> >>>
> >>> Neil points out another way of 'fs_pin', I will check that.
> >>
> >> Yes, that'd be interesting. On a very quick look--I don't understand
> >> what it's meant to do at all. But if it does provide a way to get a
> >> callback on umount, that'd certainly be interesting....
> >
> > Yeah, on a quick look it isn't really obvious at all.
> >
> > But I didn't read the code. I read
> > https://lwn.net/Articles/636730/
> >
> > which says:
> >
> > In its place is a mechanism by which an object can be added to a vfsmount
> > structure (which represents a mounted filesystem); that object supports
> > only one method: kill(). These "pin" objects hang around until the final
> > reference to the vfsmount goes away, at which point each one's kill() function
> > is called. It thus is a clean mechanism for performing cleanup when a filesystem
> > goes away.
> >
> > This is used to close "BSD process accounting" files on umount, and sound
> > like the perfect interface for flushing things from the sunrpc cache on
> > umount.
>
> I have review those codes in fs/fs_pin.c and kernel/acct.c.
> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
> and umount, file->f_path.mnt just holds the reference of vfsmount
> but not keep busy, umount will check the reference and return busy.
>
> The logical is same as sunrpc cache for exports.
>
> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
> method, acct get a pin to vfsmount and then put the reference,
> so umount finds no other reference and callback those pins in vfsmount,
> at last umount success.
>
> After that commit, besides pin to original vfsmount and put the reference
> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
>
> The different between those two methods is the value of file->f_path.mnt,
> the first one, it contains the original vfsmnt without reference,
> the second one, contains a cloned vfsmnt with reference.
>
> I have test the first method, pins to vfsmount for all exports cache,
> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
> of name_to_handle_at, I can't find the reason.
A kernel stack trace of exactly where it is hanging would help a lot.
>
> Also test the second method, there are many problem caused by the cloned
> vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.
>
> Those cases are tested for all exports cache, but, I think nfsd should
> put the reference of vfsmount when finding exports upcall fail.
> The success exports cache should also take it.
>
> The following is a draft of the first method.
I think the first method sounds a lot better than the second.
>
> thanks,
> Kinglong Mee
>
> ----------------------------------------------------------------------
> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
> index 611b540..553e8b1 100644
> --- a/fs/fs_pin.c
> +++ b/fs/fs_pin.c
> @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
> wake_up_locked(&pin->wait);
> spin_unlock_irq(&pin->wait.lock);
> }
> +EXPORT_SYMBOL(pin_remove);
>
> void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
> {
> @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
> hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
> spin_unlock(&pin_lock);
> }
> +EXPORT_SYMBOL(pin_insert_group);
>
> void pin_insert(struct fs_pin *pin, struct vfsmount *m)
> {
> pin_insert_group(pin, m, &m->mnt_sb->s_pins);
> }
> +EXPORT_SYMBOL(pin_insert);
>
> void pin_kill(struct fs_pin *p)
> {
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index c3e3b6e..3e3df8c 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
> static void svc_export_put(struct kref *ref)
> {
> struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
> - path_put(&exp->ex_path);
> + pin_remove(&exp->ex_pin);
> auth_domain_put(exp->ex_client);
> nfsd4_fslocs_free(&exp->ex_fslocs);
> kfree(exp->ex_uuid);
> @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
> orig->ex_path.mnt == new->ex_path.mnt;
> }
>
> +static void export_pin_kill(struct fs_pin *pin)
> +{
> + struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
> +
> + write_lock(&exp->cd->hash_lock);
> + cache_mark_expired(&exp->h);
> + pin_remove(pin);
> + write_unlock(&exp->cd->hash_lock);
> +}
I think you need to add some sort of delay here. The svc_export entry might
still be in active use by an nfsd thread and you need to wait for that to
complete.
I think you need to wait for the refcnt to drop to '1'. Maybe create a
global wait_queue to wait for that.
Actually... svc_expkey contains a reference to a 'path' too, so you need to
use pinning to purge those when the filesystem is unmounted.
Oh.. and you don't want to call 'pin_remove' from export_pin_kill(). It is
called from svc_export_put() and calling from both could be problematic.
Hmmm... Ahhhh.
If you get export_pin_kill to only call cache_mark_expired() (maybe move the
locking into that function) and *not* call pin_remove(), then the pin will
remain on the list and ->done will be -1.
So mnt_pin_kill will loop around and this time pin_kill() will wait for
p->done to be set.
So we really need that thing to be removed from cache promptly. I don't
think we can wait for the next time cache_clean will be run. We could
call cache_flush, but I think I'd rather remove just that one entry.
Also.... this requires that the pin (and the struct containing it) be freed
using RCU.
So:
- add an rcu_head to svc_export and svc_expkey
- use rcu_kfree to free both those objects
- write a 'cache_force_expire()' which sets the expiry time, and then
removes the entry from the cache.
- Use pin_insert_group rather then mntget for both svc_export and svc_expkey
- have the 'kill' functions for both call cache_force_expire(), but *not*
pin_remove
> +
> static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
> {
> struct svc_export *new = container_of(cnew, struct svc_export, h);
> struct svc_export *item = container_of(citem, struct svc_export, h);
>
> + init_fs_pin(&new->ex_pin, export_pin_kill);
> kref_get(&item->ex_client->ref);
> new->ex_client = item->ex_client;
> new->ex_path = item->ex_path;
> - path_get(&item->ex_path);
> + pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);
This doesn't look right. path_get does a 'mntget()' and a 'dget()'.
You are replacing 'mntget()' with 'pin_insert_group()', but I think you still
need the dget().
Similarly you need a dput() up where you removed path_put().
Thanks for working on this!
NeilBrown
> new->ex_fslocs.locations = NULL;
> new->ex_fslocs.locations_count = 0;
> new->ex_fslocs.migrated = 0;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index 1f52bfc..718a27e 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -4,6 +4,7 @@
> #ifndef NFSD_EXPORT_H
> #define NFSD_EXPORT_H
>
> +#include <linux/fs_pin.h>
> #include <linux/sunrpc/cache.h>
> #include <uapi/linux/nfsd/export.h>
>
> @@ -49,6 +50,7 @@ struct svc_export {
> struct auth_domain * ex_client;
> int ex_flags;
> struct path ex_path;
> + struct fs_pin ex_pin;
> kuid_t ex_anon_uid;
> kgid_t ex_anon_gid;
> int ex_fsid;
> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
> index 437ddb6..6936684 100644
> --- a/include/linux/sunrpc/cache.h
> +++ b/include/linux/sunrpc/cache.h
> @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
> (detail->flush_time > h->last_refresh);
> }
>
> +static inline void cache_mark_expired(struct cache_head *h)
> +{
> + h->expiry_time = seconds_since_boot() - 1;
> +}
> +
> extern int cache_check(struct cache_detail *detail,
> struct cache_head *h, struct cache_req *rqstp);
> extern void cache_flush(void);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
next prev parent reply other threads:[~2015-04-24 3:00 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-21 14:50 [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root Kinglong Mee
2015-04-21 21:54 ` J. Bruce Fields
2015-04-22 5:07 ` NeilBrown
2015-04-22 11:11 ` Kinglong Mee
2015-04-22 15:07 ` J. Bruce Fields
2015-04-22 23:44 ` NeilBrown
2015-04-23 12:52 ` Kinglong Mee
2015-04-24 3:00 ` NeilBrown [this message]
2015-04-27 12:11 ` Kinglong Mee
2015-04-29 2:57 ` NeilBrown
2015-04-29 8:45 ` Kinglong Mee
2015-04-29 19:19 ` J. Bruce Fields
2015-04-29 21:52 ` NeilBrown
2015-04-30 21:36 ` J. Bruce Fields
2015-05-01 1:53 ` NeilBrown
2015-05-01 2:03 ` Al Viro
2015-05-01 2:23 ` NeilBrown
2015-05-01 2:29 ` Al Viro
2015-05-01 3:08 ` NeilBrown
2015-05-01 13:29 ` J. Bruce Fields
2015-05-02 23:16 ` NeilBrown
2015-05-03 0:37 ` J. Bruce Fields
2015-05-04 4:11 ` NeilBrown
2015-05-04 21:48 ` J. Bruce Fields
2015-05-05 22:27 ` NeilBrown
2015-05-04 22:01 ` J. Bruce Fields
2015-05-05 13:54 ` Kinglong Mee
2015-05-05 14:18 ` J. Bruce Fields
2015-05-05 15:52 ` J. Bruce Fields
2015-05-05 22:26 ` NeilBrown
2015-05-08 16:15 ` J. Bruce Fields
2015-05-08 20:01 ` [PATCH] nfsd: don't hold i_mutex over userspace upcalls J. Bruce Fields
2015-06-03 15:18 ` J. Bruce Fields
2015-07-05 11:27 ` Kinglong Mee
2015-07-06 18:22 ` J. Bruce Fields
2015-08-18 19:10 ` J. Bruce Fields
2015-11-12 21:22 ` J. Bruce Fields
2015-05-07 15:31 ` [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root J. Bruce Fields
2015-05-07 22:42 ` NeilBrown
2015-05-08 14:10 ` J. Bruce Fields
2015-05-05 3:53 ` Kinglong Mee
2015-05-05 4:19 ` NeilBrown
2015-05-05 8:32 ` Kinglong Mee
2015-05-05 13:52 ` J. Bruce Fields
2015-06-26 23:14 ` Kinglong Mee
2015-06-26 23:35 ` NeilBrown
2015-07-02 9:42 ` Kinglong Mee
2015-05-01 1:55 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150424130045.6bbdb2f9@notabene.brown \
--to=neilb@suse.de \
--cc=bfields@fieldses.org \
--cc=kinglongmee@gmail.com \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).