linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kinglong Mee <kinglongmee@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	"linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	kinglongmee@gmail.com
Subject: Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root
Date: Wed, 29 Apr 2015 16:45:16 +0800	[thread overview]
Message-ID: <55409A1C.60604@gmail.com> (raw)
In-Reply-To: <20150429125728.69ddfc6c@notabene.brown>

On 4/29/2015 10:57 AM, NeilBrown wrote:
> On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
> 
>> On 4/24/2015 11:00 AM, NeilBrown wrote:
>>> On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee <kinglongmee@gmail.com> wrote:
>>>
>>>> On 4/23/2015 7:44 AM, NeilBrown wrote:
>>>>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" <bfields@fieldses.org>
>>>>> wrote:
>>>>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them,
>>>>>>> it is necessary to store them in svc_export.
>>>>>>>
>>>>>>> Neil points out another way of 'fs_pin', I will check that.
>>>>>>
>>>>>> Yes, that'd be interesting.  On a very quick look--I don't understand
>>>>>> what it's meant to do at all.  But if it does provide a way to get a
>>>>>> callback on umount, that'd certainly be interesting....
>>>>>
>>>>> Yeah, on a quick look it isn't really obvious at all.
>>>>>
>>>>> But I didn't read the code.  I read
>>>>>  https://lwn.net/Articles/636730/
>>>>>
>>>>> which says:
>>>>>
>>>>>     In its place is a mechanism by which an object can be added to a vfsmount
>>>>>     structure (which represents a mounted filesystem); that object supports 
>>>>>     only one method: kill(). These "pin" objects hang around until the final
>>>>>     reference to the vfsmount goes away, at which point each one's kill() function
>>>>>     is called. It thus is a clean mechanism for performing cleanup when a filesystem
>>>>>     goes away.
>>>>>
>>>>> This is used to close "BSD process accounting" files on umount, and sound
>>>>> like the perfect interface for flushing things from the sunrpc cache on
>>>>> umount.
>>>>
>>>> I have review those codes in fs/fs_pin.c and kernel/acct.c.
>>>> 'fs_pin' is used to fix the race between file->f_path.mnt for acct
>>>> and umount, file->f_path.mnt just holds the reference of vfsmount
>>>> but not keep busy, umount will check the reference and return busy.
>>>>
>>>> The logical is same as sunrpc cache for exports.
>>>>
>>>> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack
>>>> method, acct get a pin to vfsmount and then put the reference,
>>>> so umount finds no other reference and callback those pins in vfsmount,
>>>> at last umount success.
>>>>
>>>> After that commit, besides pin to original vfsmount and put the reference
>>>> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt.
>>>>
>>>> The different between those two methods is the value of file->f_path.mnt,
>>>> the first one, it contains the original vfsmnt without reference,
>>>> the second one, contains a cloned vfsmnt with reference.
>>>>
>>>> I have test the first method, pins to vfsmount for all exports cache,
>>>> nfsd can work but sometimes will delay or hang at rpc.mountd's calling
>>>> of name_to_handle_at, I can't find the reason.
>>>
>>> A kernel stack trace of exactly where it is hanging would help a lot.
>>
>> Witch adding fs_pin to exports, it seems there is a mutex race between
>> nfsd and rpc.mountd for locking parent inode.
>>
>> Apr 27 19:57:03 ntest kernel: rpc.mountd      D ffff88006ac5fc28     0  1152      1 0x00000000
>> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08
>> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff
>> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000
>> Apr 27 19:57:03 ntest kernel: Call Trace:
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8178031e>] schedule_preempt_disabled+0xe/0x10
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81781e92>] __mutex_lock_slowpath+0xb2/0x120
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81781f23>] mutex_lock+0x23/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81230714>] lookup_slow+0x34/0xc0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812358ee>] path_lookupat+0x89e/0xc60
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81205192>] ? kmem_cache_alloc+0x1e2/0x260
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812367a6>] ? getname_flags+0x56/0x200
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81235cd7>] filename_lookup+0x27/0xc0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81237b53>] user_path_at_empty+0x63/0xd0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d722>] ? put_object+0x32/0x60
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8121d94d>] ? delete_object_full+0x2d/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123e273>] ? dput+0x33/0x230
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81237bd1>] user_path_at+0x11/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81286489>] SyS_name_to_handle_at+0x59/0x200
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81783f6e>] system_call_fastpath+0x12/0x71
>> Apr 27 19:57:03 ntest kernel: nfsd            S ffff88006e92b708     0  1170      2 0x00000000
>> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758
>> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040
>> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09
>> Apr 27 19:57:03 ntest kernel: Call Trace:
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8177ffd7>] schedule+0x37/0x90
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81782e97>] schedule_timeout+0x117/0x230
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8110b240>] ? internal_add_timer+0xb0/0xb0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81780ff3>] wait_for_completion_interruptible_timeout+0xf3/0x150
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810cb090>] ? wake_up_state+0x20/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa01350cc>] cache_wait_req.isra.10+0x9c/0x110 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0134760>] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0135c54>] cache_check+0x1d4/0x380 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8133248c>] ? inode_doinit_with_dentry+0x48c/0x6a0
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045e132>] exp_get_by_name+0x82/0xb0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81776d0a>] ? kmemleak_free+0x3a/0xa0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81332210>] ? inode_doinit_with_dentry+0x210/0x6a0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff813326bc>] ? selinux_d_instantiate+0x1c/0x20
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123def7>] ? _d_rehash+0x37/0x40
>> Apr 27 19:57:03 ntest kernel: [<ffffffff8123f776>] ? d_splice_alias+0xa6/0x2d0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff812be90b>] ? ext4_lookup+0xdb/0x160
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0460114>] rqst_exp_get_by_name+0x64/0x140 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a8f6>] nfsd_cross_mnt+0x76/0x1b0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0475300>] nfsd4_encode_dirent+0x160/0x3d0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa04751a0>] ? nfsd4_encode_getattr+0x40/0x40 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045c631>] nfsd_readdir+0x1c1/0x2a0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045a1e0>] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa046bd70>] nfsd4_encode_readdir+0x120/0x220 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa047573d>] nfsd4_encode_operation+0x7d/0x190 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa046a51d>] nfsd4_proc_compound+0x24d/0x6f0 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455f83>] nfsd_dispatch+0xc3/0x220 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa012a01b>] svc_process_common+0x43b/0x690 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa012b133>] svc_process+0x103/0x1b0 [sunrpc]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa045596f>] nfsd+0xff/0x170 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffffa0455870>] ? nfsd_destroy+0x80/0x80 [nfsd]
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd378>] kthread+0xd8/0xf0
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
>> Apr 27 19:57:03 ntest kernel: [<ffffffff81784362>] ret_from_fork+0x42/0x70
>> Apr 27 19:57:03 ntest kernel: [<ffffffff810bd2a0>] ? kthread_worker_fn+0x180/0x180
>>
> 
> Hmm... That looks bad.  I wonder why we haven't hit the deadlock before.

It appears after nfs-utils commit 6091c0a4c4 
("mountd: add support for case-insensitive file names")
adds using name_to_handle_at which will locking parent.

> 
> nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR
> request for $DIR which contains 'name' which is a mountpoint.
> 
> nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do
> the mountpoint check.  If the mounted filesystem is not exported, but the $DIR
> filesystem has 'crossmnt', an upcall to mountd asks for export information.
> When mountd try to do a lookup on the name, it works fine if lookup_fast finds
> the child, but it deadlocks on the parent's i_mutex if it has to go to
> lookup_slow.

I have send a patch for it,
"[PATCH] NFSD: Avoid race of locking parent's mutex at cross mount"

> 
> It should only go to lookup_slow() if:
>   - the child isn't in cache .. but as it is mounted on, it must be

Yes, it is.
If after some operations of those files, they are cached,
the race will not appear.

>   - d_revalidate returns 0
> 
> This probably means that $DIR is on an NFS filesystem and it failed the
> revalidation for some reason....

It is caused by readdir of pseudo root, there are many unexported files in it.

> 
> Kinglong: can you reproduce this easily?  If so can you enable nfs debugging 
> and collect the output?
> I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that
> the name is invalid.

Yes, I will send the reproduce method in the other threads, add cc you.

thanks,
Kinglong Mee


  reply	other threads:[~2015-04-29  8:45 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-21 14:50 [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root Kinglong Mee
2015-04-21 21:54 ` J. Bruce Fields
2015-04-22  5:07   ` NeilBrown
2015-04-22 11:11   ` Kinglong Mee
2015-04-22 15:07     ` J. Bruce Fields
2015-04-22 23:44       ` NeilBrown
2015-04-23 12:52         ` Kinglong Mee
2015-04-24  3:00           ` NeilBrown
2015-04-27 12:11             ` Kinglong Mee
2015-04-29  2:57               ` NeilBrown
2015-04-29  8:45                 ` Kinglong Mee [this message]
2015-04-29 19:19                 ` J. Bruce Fields
2015-04-29 21:52                   ` NeilBrown
2015-04-30 21:36                     ` J. Bruce Fields
2015-05-01  1:53                       ` NeilBrown
2015-05-01  2:03                         ` Al Viro
2015-05-01  2:23                           ` NeilBrown
2015-05-01  2:29                             ` Al Viro
2015-05-01  3:08                               ` NeilBrown
2015-05-01 13:29                                 ` J. Bruce Fields
2015-05-02 23:16                                   ` NeilBrown
2015-05-03  0:37                                     ` J. Bruce Fields
2015-05-04  4:11                                       ` NeilBrown
2015-05-04 21:48                                     ` J. Bruce Fields
2015-05-05 22:27                                       ` NeilBrown
2015-05-04 22:01                         ` J. Bruce Fields
2015-05-05 13:54                           ` Kinglong Mee
2015-05-05 14:18                             ` J. Bruce Fields
2015-05-05 15:52                               ` J. Bruce Fields
2015-05-05 22:26                                 ` NeilBrown
2015-05-08 16:15                                   ` J. Bruce Fields
2015-05-08 20:01                                     ` [PATCH] nfsd: don't hold i_mutex over userspace upcalls J. Bruce Fields
2015-06-03 15:18                                       ` J. Bruce Fields
2015-07-05 11:27                                         ` Kinglong Mee
2015-07-06 18:22                                           ` J. Bruce Fields
2015-08-18 19:10                                         ` J. Bruce Fields
2015-11-12 21:22                                           ` J. Bruce Fields
2015-05-07 15:31                                 ` [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root J. Bruce Fields
2015-05-07 22:42                                   ` NeilBrown
2015-05-08 14:10                                     ` J. Bruce Fields
2015-05-05  3:53                       ` Kinglong Mee
2015-05-05  4:19                         ` NeilBrown
2015-05-05  8:32                           ` Kinglong Mee
2015-05-05 13:52                             ` J. Bruce Fields
2015-06-26 23:14                             ` Kinglong Mee
2015-06-26 23:35                               ` NeilBrown
2015-07-02  9:42                                 ` Kinglong Mee
2015-05-01  1:55                     ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55409A1C.60604@gmail.com \
    --to=kinglongmee@gmail.com \
    --cc=bfields@fieldses.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).