linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@poochiereds.net>
To: Oleg Drokin <green@linuxhacker.ru>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	linux-nfs@vger.kernel.org,
	"<linux-kernel@vger.kernel.org> Mailing List"
	<linux-kernel@vger.kernel.org>
Subject: Re: Files leak from nfsd in 4.7.1-rc1 (and more?)
Date: Tue, 07 Jun 2016 20:03:25 -0400	[thread overview]
Message-ID: <1465344205.3024.42.camel@poochiereds.net> (raw)
In-Reply-To: <752F7196-1EE7-4FB3-8769-177131C8A793@linuxhacker.ru>

On Tue, 2016-06-07 at 19:39 -0400, Oleg Drokin wrote:
> On Jun 7, 2016, at 4:04 PM, Jeff Layton wrote:
> 
> > 
> > On Tue, 2016-06-07 at 13:30 -0400, Oleg Drokin wrote:
> > > 
> > > On Jun 7, 2016, at 1:10 PM, Jeff Layton wrote:
> > > 
> > > > 
> > > > 
> > > > On Tue, 2016-06-07 at 11:37 -0400, Oleg Drokin wrote:
> > > > > 
> > > > > 
> > > > > Hello!
> > > > > 
> > > > >    I've been trying to better understand this problem I was having where sometimes
> > > > >    a formerly NFS-exported mountpoint becomes unmountable (after nfsd stop).
> > > > > 
> > > > >    I finally traced it to a leaked filedescriptor that was allocated from
> > > > >    nfsd4_open()->nfsd4_process_open2()->nfs4_get_vfs_file()->nfsd_open().
> > > > > 
> > > > >    Also together with it we see leaked credentials allocated along the same path from
> > > > >    fh_verify() and groups allocated from svcauth_unix_accept()->groups_alloc() that
> > > > >    are presumably used by the credentials.
> > > > > 
> > > > >    Unfortunately I was not able to make total sense out of the state handling in nfsd,
> > > > >    but it's clear that one of the file descriptors inside struct nfs4_file is
> > > > >    lost. I added a patch like this (always a good idea, so surprised it was not
> > > > >    there already):
> > > > > @@ -271,6 +274,9 @@ static void nfsd4_free_file_rcu(struct rcu_head *rcu)
> > > > >  {
> > > > >         struct nfs4_file *fp = container_of(rcu, struct nfs4_file, fi_rcu);
> > > > >  
> > > > > +       WARN_ON(fp->fi_fds[0]);
> > > > > +       WARN_ON(fp->fi_fds[1]);
> > > > > +       WARN_ON(fp->fi_fds[2]);
> > > > >         kmem_cache_free(file_slab, fp);
> > > > >  }
> > > > > 
> > > > >    And when the problem is hit, I am also triggering (Always this one which is fd[1])
> > > > > [ 3588.143002] ------------[ cut here ]------------
> > > > > [ 3588.143662] WARNING: CPU: 5 PID: 9 at /home/green/bk/linux/fs/nfsd/nfs4state.c:278 nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.144947] Modules linked in: loop rpcsec_gss_krb5 joydev acpi_cpufreq tpm_tis i2c_piix4 tpm virtio_console pcspkr nfsd ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm floppy serio_raw virtio_blk
> > > > > [ 3588.147135] CPU: 5 PID: 9 Comm: rcuos/0 Not tainted 4.7.0-rc1-vm-nfs+ #120
> > > > > [ 3588.153826] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > > > > [ 3588.153830]  0000000000000286 00000000e2d5ccdf ffff88011965bd50 ffffffff814a11a5
> > > > > [ 3588.153832]  0000000000000000 0000000000000000 ffff88011965bd90 ffffffff8108806b
> > > > > [ 3588.153834]  0000011600000000 ffff8800c476a0b8 ffff8800c476a048 ffffffffc0110fc0
> > > > > [ 3588.153834] Call Trace:
> > > > > [ 3588.153839]  [] dump_stack+0x86/0xc1
> > > > > [ 3588.153841]  [] __warn+0xcb/0xf0
> > > > > [ 3588.153852]  [] ? trace_raw_output_fh_want_write+0x60/0x60 [nfsd]
> > > > > [ 3588.153853]  [] warn_slowpath_null+0x1d/0x20
> > > > > [ 3588.153859]  [] nfsd4_free_file_rcu+0x65/0x80 [nfsd]
> > > > > [ 3588.153861]  [] rcu_nocb_kthread+0x335/0x510
> > > > > [ 3588.153862]  [] ? rcu_nocb_kthread+0x27f/0x510
> > > > > [ 3588.153863]  [] ? rcu_cpu_notify+0x3e0/0x3e0
> > > > > [ 3588.153866]  [] kthread+0x101/0x120
> > > > > [ 3588.153868]  [] ? trace_hardirqs_on_caller+0xf4/0x1b0
> > > > > [ 3588.153871]  [] ret_from_fork+0x1f/0x40
> > > > > [ 3588.153872]  [] ? kthread_create_on_node+0x250/0x250
> > > > > 
> > > > > 
> > > > >   release_all_access() seems to be doing correct job of all that cleaning, so
> > > > >   there must be some other path that I do not quite see.
> > > > > 
> > > > >   Hopefully you are more familiar with the code and can see the problem right away ;)
> > > > Hmm...well I'm not seeing it right offhand, and haven't been able to
> > > > reproduce the problem so far after a couple of attempts by hand. What
> > > > sort of workload are you running before you see that warning pop?
> > > Well, the workload is a crazed twisty conflicting operations set in a very limited
> > > namespace (called racer.sh,
> > > http://git.whamcloud.com/fs/lustrerelease.git/tree/refs/heads/master:/lustre/tests/racer )
> > > 
> > > It does all sorts of crazy unimaginable stuff that is hard to predict, and typically I
> > > run it on several mountpoints all pointing to the same spot
> > > (now 3-ways: one instance on direct mount point in ext4, one on nfs mount mounted as
> > > mount -o nfs and one on mount -o nfs4).
> > > It must be hitting some rare race or a race + operations combination that leads to
> > > the leak.
> > > 
> > > To reproduce I run it like this until it stops (and it usually stops in
> > > 2 or 3 iterations, so it takes some time):
> > > $ cat ~/stress.sh 
> > > #!/bin/bash
> > > 
> > > set -x
> > > 
> > > cd /home/green/git/lustre-release/lustre/tests/racer
> > > dd if=/dev/zero of=/tmp/loop bs=1024k count=1024
> > > mkfs.ext4 /tmp/loop
> > > service rpcbind start
> > > mount none /var/lib/nfs -t tmpfs
> > > touch /var/lib/nfs/etab
> > > service nfs-mountd start
> > > 
> > > while :; do
> > > 
> > > mount /tmp/loop /mnt/lustre -o loop || exit 1
> > > mkdir /mnt/lustre/racer
> > > service nfs-server start || exit 2
> > > mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock || exit 3
> > > mount localhost:/ /mnt/nfs2 -t nfs4 || exit 4
> > > DURATION=3600 sh racer.sh /mnt/nfs/racer &
> > > DURATION=3600 sh racer.sh /mnt/nfs2/racer &
> > > DURATION=3600 sh racer.sh /mnt/lustre/racer &
> > > wait %1 %2 %3
> > > umount /mnt/nfs || exit 5
> > > umount /mnt/nfs2 || exit 6
> > > service nfs-server stop || exit 7
> > > umount /mnt/lustre || exit 8
> > > e2fsck -n -f /tmp/loop || exit 9
> > > done
> > > 
> > > Please ignore all the Lustre references, the test is generic, it's just I am running
> > > off nfs-readonly-root setup, so want to reuse existing mountpoint locations,
> > > there's no Lustre in picture here).
> > > With all the other problems from before fixed, now it typically stops either for
> > > this leak discussed here, or for this other problem in nfs4 I described here:
> > > http://marc.info/?l=linux-nfs&m=146518324823160&w=2
> > Thanks. So it seems likely we're looking at a race of some sort then...
> > 
> > So, the way this is supposed to work is that the stateids each hold a
> > reference to the nfs4_file. They also take an fi_access reference for
> > the open modes that they represent. This is how we know when it's OK to
> > close one of the files in the fd[] slots. It might be interesting to
> > also print out the fi_access counters if the slot hasn't been zeroed
> > out if you feel up to it.
> Ok, I have primed that and will let you know next time it triggers.
> 

Thanks. Maybe that will give us some idea of where to look...

> > 
> > I think though that the open upgrade accounting may be wrong.
> > __nfs4_file_get_access does an unconditional atomic_inc, so if the
> > client does a open for write and then opens again for read/write, then
> > we may end up with too many write fi_access references taken.
> 
> The read-write access is all fine I think, because when dropping read or write
> it also checks if the readwrite is filled?
> Also whoever did that atomic_inc for get_access would eventually do put,
> and that should free the file, no?
> I was mostly thinking there's an error exit path somewhere that forgets to do
> the put.
> This also would explain the other pieces leaked that are not directly attached
> into the nfs4_file.
> 
> > 
> > That said, this code is quite subtle. I'd need to look over it in more
> > detail before I offer up any fixes. I'd also appreciate it if anyone
> > else wants to sanity check my analysis there.
> > 

Yeah, I think you're right. It's fine since r/w opens have a distinct
slot, even though the refcounting just tracks the number of read and
write references. So yeah, the leak probably is in an error path
someplace, or maybe a race someplace.

-- 
Jeff Layton <jlayton@poochiereds.net>

  reply	other threads:[~2016-06-08  0:03 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-07 15:37 Files leak from nfsd in 4.7.1-rc1 (and more?) Oleg Drokin
2016-06-07 17:10 ` Jeff Layton
2016-06-07 17:30   ` Oleg Drokin
2016-06-07 20:04     ` Jeff Layton
2016-06-07 23:39       ` Oleg Drokin
2016-06-08  0:03         ` Jeff Layton [this message]
2016-06-08  0:46           ` Oleg Drokin
2016-06-08  2:22           ` Oleg Drokin
2016-06-08  3:55             ` Oleg Drokin
2016-06-08 10:58             ` Jeff Layton
2016-06-08 14:44               ` Oleg Drokin
2016-06-08 16:10               ` Oleg Drokin
2016-06-08 17:22                 ` Jeff Layton
2016-06-08 17:37                   ` Oleg Drokin
2016-06-09  2:55                   ` [PATCH] nfsd: Always lock state exclusively Oleg Drokin
2016-06-09 10:13                     ` Jeff Layton
2016-06-09 21:01                   ` [PATCH] nfsd: Close a race between access checking/setting in nfs4_get_vfs_file Oleg Drokin
2016-06-10  4:18                     ` Oleg Drokin
2016-06-10 10:50                       ` Jeff Layton
2016-06-10 20:55                         ` J . Bruce Fields
2016-06-11 15:41                           ` Oleg Drokin
2016-06-12  1:33                             ` Jeff Layton
2016-06-12  2:06                               ` Oleg Drokin
2016-06-12  2:50                                 ` Jeff Layton
2016-06-12  3:15                                   ` Oleg Drokin
2016-06-12 13:13                                     ` Jeff Layton
2016-06-13  1:26                                     ` [PATCH v2] nfsd: Always lock state exclusively Oleg Drokin
2016-06-14 15:38                                       ` J . Bruce Fields
2016-06-14 15:53                                         ` Oleg Drokin
2016-06-14 18:50                                           ` J . Bruce Fields
2016-06-14 22:52                                             ` Jeff Layton
2016-06-14 22:54                                               ` Oleg Drokin
2016-06-14 22:57                                                 ` Jeff Layton
2016-06-15  3:28                                                   ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 1/3] nfsd: Always lock state exclusively Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 2/3] nfsd: Extend the mutex holding region around in nfsd4_process_open2() Oleg Drokin
2016-06-15  3:28                                                     ` [PATCH 3/3] nfsd: Make init_open_stateid() a bit more whole Oleg Drokin
2016-06-16  1:54                                                     ` [PATCH 0/3] nfsd state handling fixes Oleg Drokin
2016-06-16  2:07                                                       ` J . Bruce Fields
2016-06-14 15:46                                       ` [PATCH v2] nfsd: Always lock state exclusively J . Bruce Fields
2016-06-14 15:56                                         ` Oleg Drokin
2016-06-14 18:46                                           ` J . Bruce Fields
2016-06-15  2:19                                             ` Oleg Drokin
2016-06-15 13:31                                               ` J . Bruce Fields
2016-06-09 12:13               ` Files leak from nfsd in 4.7.1-rc1 (and more?) Andrew W Elble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1465344205.3024.42.camel@poochiereds.net \
    --to=jlayton@poochiereds.net \
    --cc=bfields@fieldses.org \
    --cc=green@linuxhacker.ru \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).