From: Kinglong Mee <kinglongmee@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
kinglongmee@gmail.com
Subject: Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root
Date: Mon, 27 Apr 2015 20:11:48 +0800 [thread overview]
Message-ID: <553E2784.6020906@gmail.com> (raw)
In-Reply-To: <20150424130045.6bbdb2f9@notabene.brown>
On 4/24/2015 11:00 AM, NeilBrown wrote:
> On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
>
>> On 4/23/2015 7:44 AM, NeilBrown wrote:
>>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
>>> wrote:
>>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
>>>>> it is necessary to store them in svc_export.
>>>>>
>>>>> Neil points out another way of 'fs_pin', I will check that.
>>>>
>>>> Yes, that'd be interesting. On a very quick look--I don't understand
>>>> what it's meant to do at all. But if it does provide a way to get a
>>>> callback on umount, that'd certainly be interesting....
>>>
>>> Yeah, on a quick look it isn't really obvious at all.
>>>
>>> But I didn't read the code. I read
>>> https://lwn.net/Articles/636730/
>>>
>>> which says:
>>>
>>> In its place is a mechanism by which an object can be added to a vfsmount
>>> structure (which represents a mounted filesystem); that object supports
>>> only one method: kill(). These "pin" objects hang around until the final
>>> reference to the vfsmount goes away, at which point each one's kill() function
>>> is called. It thus is a clean mechanism for performing cleanup when a filesystem
>>> goes away.
>>>
>>> This is used to close "BSD process accounting" files on umount, and sound
>>> like the perfect interface for flushing things from the sunrpc cache on
>>> umount.
>>
>> I have review those codes in fs/fs_pin.c and kernel/acct.c.
>> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
>> and umount, file->f_path.mnt just holds the reference of vfsmount
>> but not keep busy, umount will check the reference and return busy.
>>
>> The logical is same as sunrpc cache for exports.
>>
>> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
>> method, acct get a pin to vfsmount and then put the reference,
>> so umount finds no other reference and callback those pins in vfsmount,
>> at last umount success.
>>
>> After that commit, besides pin to original vfsmount and put the reference
>> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
>>
>> The different between those two methods is the value of file->f_path.mnt,
>> the first one, it contains the original vfsmnt without reference,
>> the second one, contains a cloned vfsmnt with reference.
>>
>> I have test the first method, pins to vfsmount for all exports cache,
>> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
>> of name_to_handle_at, I can't find the reason.
>
> A kernel stack trace of exactly where it is hanging would help a lot.
Witch adding fs_pin to exports, it seems there is a mutex race between
nfsd and rpc.mountd for locking parent inode.
Apr 27 19:57:03 ntest kernel: rpc.mountd D ffff88006ac5fc28 0 1152 1 0x00000000
Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
Apr 27 19:57:03 ntest kernel: Call Trace:
Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
Apr 27 19:57:03 ntest kernel: nfsd S ffff88006e92b708 0 1170 2 0x00000000
Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
Apr 27 19:57:03 ntest kernel: Call Trace:
Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
>>
>> Also test the second method, there are many problem caused by the cloned
>> vfsmount, mnt_clone_internal() change mnt_root values to the path dentry.
>>
>> Those cases are tested for all exports cache, but, I think nfsd should
>> put the reference of vfsmount when finding exports upcall fail.
>> The success exports cache should also take it.
>>
>> The following is a draft of the first method.
>
> I think the first method sounds a lot better than the second.
>
>>
>> thanks,
>> Kinglong Mee
>>
>> ----------------------------------------------------------------------
>> diff --git a/fs/fs_pin.c b/fs/fs_pin.c
>> index 611b540..553e8b1 100644
>> --- a/fs/fs_pin.c
>> +++ b/fs/fs_pin.c
>> @@ -17,6 +17,7 @@ void pin_remove(struct fs_pin *pin)
>> wake_up_locked(&pin->wait);
>> spin_unlock_irq(&pin->wait.lock);
>> }
>> +EXPORT_SYMBOL(pin_remove);
>>
>> void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head *p)
>> {
>> @@ -26,11 +27,13 @@ void pin_insert_group(struct fs_pin *pin, struct vfsmount *m, struct hlist_head
>> hlist_add_head(&pin->m_list, &real_mount(m)->mnt_pins);
>> spin_unlock(&pin_lock);
>> }
>> +EXPORT_SYMBOL(pin_insert_group);
>>
>> void pin_insert(struct fs_pin *pin, struct vfsmount *m)
>> {
>> pin_insert_group(pin, m, &m->mnt_sb->s_pins);
>> }
>> +EXPORT_SYMBOL(pin_insert);
>>
>> void pin_kill(struct fs_pin *p)
>> {
>> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
>> index c3e3b6e..3e3df8c 100644
>> --- a/fs/nfsd/export.c
>> +++ b/fs/nfsd/export.c
>> @@ -309,7 +309,7 @@ static void nfsd4_fslocs_free(struct nfsd4_fs_locations *fsloc)
>> static void svc_export_put(struct kref *ref)
>> {
>> struct svc_export *exp = container_of(ref, struct svc_export, h.ref);
>> - path_put(&exp->ex_path);
>> + pin_remove(&exp->ex_pin);
>> auth_domain_put(exp->ex_client);
>> nfsd4_fslocs_free(&exp->ex_fslocs);
>> kfree(exp->ex_uuid);
>> @@ -695,15 +695,26 @@ static int svc_export_match(struct cache_head *a, struct cache_head *b)
>> orig->ex_path.mnt == new->ex_path.mnt;
>> }
>>
>> +static void export_pin_kill(struct fs_pin *pin)
>> +{
>> + struct svc_export *exp = container_of(pin, struct svc_export, ex_pin);
>> +
>> + write_lock(&exp->cd->hash_lock);
>> + cache_mark_expired(&exp->h);
>> + pin_remove(pin);
>> + write_unlock(&exp->cd->hash_lock);
>> +}
>
> I think you need to add some sort of delay here. The svc_export entry might
> still be in active use by an nfsd thread and you need to wait for that to
> complete.
>
> I think you need to wait for the refcnt to drop to '1'. Maybe create a
> global wait_queue to wait for that.
>
> Actually... svc_expkey contains a reference to a 'path' too, so you need to
> use pinning to purge those when the filesystem is unmounted.
>
> Oh.. and you don't want to call 'pin_remove' from export_pin_kill(). It is
> called from svc_export_put() and calling from both could be problematic.
>
> Hmmm... Ahhhh.
>
> If you get export_pin_kill to only call cache_mark_expired() (maybe move the
> locking into that function) and *not* call pin_remove(), then the pin will
> remain on the list and ->done will be -1.
> So mnt_pin_kill will loop around and this time pin_kill() will wait for
> p->done to be set.
> So we really need that thing to be removed from cache promptly. I don't
> think we can wait for the next time cache_clean will be run. We could
> call cache_flush, but I think I'd rather remove just that one entry.
>
> Also.... this requires that the pin (and the struct containing it) be freed
> using RCU.
>
> So:
> - add an rcu_head to svc_export and svc_expkey
> - use rcu_kfree to free both those objects
> - write a 'cache_force_expire()' which sets the expiry time, and then
> removes the entry from the cache.
> - Use pin_insert_group rather then mntget for both svc_export and svc_expkey
> - have the 'kill' functions for both call cache_force_expire(), but *not*
> pin_remove
Thanks very much for your comment in great detail.
I will send a patch after fix the above race and basic tests.
thanks,
Kinglong Mee
>
>> +
>> static void svc_export_init(struct cache_head *cnew, struct cache_head *citem)
>> {
>> struct svc_export *new = container_of(cnew, struct svc_export, h);
>> struct svc_export *item = container_of(citem, struct svc_export, h);
>>
>> + init_fs_pin(&new->ex_pin, export_pin_kill);
>> kref_get(&item->ex_client->ref);
>> new->ex_client = item->ex_client;
>> new->ex_path = item->ex_path;
>> - path_get(&item->ex_path);
>> + pin_insert_group(&new->ex_pin, new->ex_path.mnt, NULL);
>
> This doesn't look right. path_get does a 'mntget()' and a 'dget()'.
> You are replacing 'mntget()' with 'pin_insert_group()', but I think you still
> need the dget().
>
> Similarly you need a dput() up where you removed path_put().
>
>
> Thanks for working on this!
>
> NeilBrown
>
>
>
>> new->ex_fslocs.locations = NULL;
>> new->ex_fslocs.locations_count = 0;
>> new->ex_fslocs.migrated = 0;
>> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
>> index 1f52bfc..718a27e 100644
>> --- a/fs/nfsd/export.h
>> +++ b/fs/nfsd/export.h
>> @@ -4,6 +4,7 @@
>> #ifndef NFSD_EXPORT_H
>> #define NFSD_EXPORT_H
>>
>> +#include <linux/fs_pin.h>
>> #include <linux/sunrpc/cache.h>
>> #include <uapi/linux/nfsd/export.h>
>>
>> @@ -49,6 +50,7 @@ struct svc_export {
>> struct auth_domain * ex_client;
>> int ex_flags;
>> struct path ex_path;
>> + struct fs_pin ex_pin;
>> kuid_t ex_anon_uid;
>> kgid_t ex_anon_gid;
>> int ex_fsid;
>> diff --git a/include/linux/sunrpc/cache.h b/include/linux/sunrpc/cache.h
>> index 437ddb6..6936684 100644
>> --- a/include/linux/sunrpc/cache.h
>> +++ b/include/linux/sunrpc/cache.h
>> @@ -206,6 +206,11 @@ static inline int cache_is_expired(struct cache_detail *detail, struct cache_hea
>> (detail->flush_time > h->last_refresh);
>> }
>>
>> +static inline void cache_mark_expired(struct cache_head *h)
>> +{
>> + h->expiry_time = seconds_since_boot() - 1;
>> +}
>> +
>> extern int cache_check(struct cache_detail *detail,
>> struct cache_head *h, struct cache_req *rqstp);
>> extern void cache_flush(void);
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2015-04-27 12:11 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-21 14:50 [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root Kinglong Mee
2015-04-21 21:54 ` J. Bruce Fields
2015-04-22 5:07 ` NeilBrown
2015-04-22 11:11 ` Kinglong Mee
2015-04-22 15:07 ` J. Bruce Fields
2015-04-22 23:44 ` NeilBrown
2015-04-23 12:52 ` Kinglong Mee
2015-04-24 3:00 ` NeilBrown
2015-04-27 12:11 ` Kinglong Mee [this message]
2015-04-29 2:57 ` NeilBrown
2015-04-29 8:45 ` Kinglong Mee
2015-04-29 19:19 ` J. Bruce Fields
2015-04-29 21:52 ` NeilBrown
2015-04-30 21:36 ` J. Bruce Fields
2015-05-01 1:53 ` NeilBrown
2015-05-01 2:03 ` Al Viro
2015-05-01 2:23 ` NeilBrown
2015-05-01 2:29 ` Al Viro
2015-05-01 3:08 ` NeilBrown
2015-05-01 13:29 ` J. Bruce Fields
2015-05-02 23:16 ` NeilBrown
2015-05-03 0:37 ` J. Bruce Fields
2015-05-04 4:11 ` NeilBrown
2015-05-04 21:48 ` J. Bruce Fields
2015-05-05 22:27 ` NeilBrown
2015-05-04 22:01 ` J. Bruce Fields
2015-05-05 13:54 ` Kinglong Mee
2015-05-05 14:18 ` J. Bruce Fields
2015-05-05 15:52 ` J. Bruce Fields
2015-05-05 22:26 ` NeilBrown
2015-05-08 16:15 ` J. Bruce Fields
2015-05-08 20:01 ` [PATCH] nfsd: don't hold i_mutex over userspace upcalls J. Bruce Fields
2015-06-03 15:18 ` J. Bruce Fields
2015-07-05 11:27 ` Kinglong Mee
2015-07-06 18:22 ` J. Bruce Fields
2015-08-18 19:10 ` J. Bruce Fields
2015-11-12 21:22 ` J. Bruce Fields
2015-05-07 15:31 ` [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root J. Bruce Fields
2015-05-07 22:42 ` NeilBrown
2015-05-08 14:10 ` J. Bruce Fields
2015-05-05 3:53 ` Kinglong Mee
2015-05-05 4:19 ` NeilBrown
2015-05-05 8:32 ` Kinglong Mee
2015-05-05 13:52 ` J. Bruce Fields
2015-06-26 23:14 ` Kinglong Mee
2015-06-26 23:35 ` NeilBrown
2015-07-02 9:42 ` Kinglong Mee
2015-05-01 1:55 ` Al Viro
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=553E2784.6020906@gmail.com \
--to=kinglongmee@gmail.com \
--cc=bfields@fieldses.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).