Re: NFSroot hangs with bad unlock balance in Linux next

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@ZenIV.linux.org.uk>
To: Tony Lindgren <tony@atomide.com>
Cc: Christoph Hellwig <hch@lst.de>,
	Trond Myklebust <trond.myklebust@primarydata.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	linux-nfs@vger.kernel.org, linux-omap@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	Eric Dumazet <edumazet@google.com>,
	linux-net@vger.kernel.org
Subject: Re: NFSroot hangs with bad unlock balance in Linux next
Date: Mon, 9 May 2016 08:32:35 +0100	[thread overview]
Message-ID: <20160509073235.GI2694@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20160508141629.GF2694@ZenIV.linux.org.uk>

On Sun, May 08, 2016 at 03:16:29PM +0100, Al Viro wrote:

> Very strange.  We grab that rwsem at the entry into nfs_call_unlink()
> and then either release it there and return or call nfs_do_call_unlink().
> Which arranges for eventual call of nfs_async_unlink_release() (via
> ->rpc_release); nfs_async_unlink_release() releases the rwsem.  Nobody else
> releases it (on the read side, that is).
> 
> The only kinda-sorta possibility I see here is that the inode we are
> unlocking in that nfs_async_unlink_release() is not the one we'd locked
> in nfs_call_unlink() that has lead to it.  That really shouldn't happen,
> though...  Just to verify whether that's what we are hitting, could you
> try to reproduce that thing with the patch below on top of -next and see
> if it triggers any of those WARN_ON?

D'oh...  Lockdep warnings are easy to trigger (and, AFAICS, bogus).
up_read/down_read in fs/nfs/unlink.c should be replaced with
up_read_non_owner/down_read_non_owner, lest the lockdep gets confused.
Hangs are different - I've no idea what's triggering those.  I've seen
something similar on that -next, but not on work.lookups.

The joy of bisecting -next...  <a couple of hours later>
9317bb69824ec8d078b0b786b6971aedb0af3d4f is the first bad commit
commit 9317bb69824ec8d078b0b786b6971aedb0af3d4f
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Apr 25 10:39:32 2016 -0700

    net: SOCKWQ_ASYNC_NOSPACE optimizations

Reverting changes to sk_set_bit/sk_clear_bit gets rid of the hangs.  Plain
revert gives a conflict, since there had been additional change in
"net: SOCKWQ_ASYNC_WAITDATA optimizations"; removing both fixed the hangs.

Note that hangs appear without any fs/nfs/unlink.c modifications being
there.  When the hang happens it affects NFS traffic; ssh session still
works fine until it steps on a filesystem operation on NFS (i.e. you
can use builtins, access procfs, etc.)

WARNING: multiple messages have this Message-ID (diff)

From: Al Viro <viro-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
To: Tony Lindgren <tony-4v6yS6AI5VpBDgjK7y7TUQ@public.gmane.org>
Cc: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>,
	Trond Myklebust
	<trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org>,
	Anna Schumaker
	<anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-omap-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	Eric Dumazet <edumazet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>,
	linux-net-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: NFSroot hangs with bad unlock balance in Linux next
Date: Mon, 9 May 2016 08:32:35 +0100	[thread overview]
Message-ID: <20160509073235.GI2694@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20160508141629.GF2694-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

On Sun, May 08, 2016 at 03:16:29PM +0100, Al Viro wrote:

> Very strange.  We grab that rwsem at the entry into nfs_call_unlink()
> and then either release it there and return or call nfs_do_call_unlink().
> Which arranges for eventual call of nfs_async_unlink_release() (via
> ->rpc_release); nfs_async_unlink_release() releases the rwsem.  Nobody else
> releases it (on the read side, that is).
> 
> The only kinda-sorta possibility I see here is that the inode we are
> unlocking in that nfs_async_unlink_release() is not the one we'd locked
> in nfs_call_unlink() that has lead to it.  That really shouldn't happen,
> though...  Just to verify whether that's what we are hitting, could you
> try to reproduce that thing with the patch below on top of -next and see
> if it triggers any of those WARN_ON?

D'oh...  Lockdep warnings are easy to trigger (and, AFAICS, bogus).
up_read/down_read in fs/nfs/unlink.c should be replaced with
up_read_non_owner/down_read_non_owner, lest the lockdep gets confused.
Hangs are different - I've no idea what's triggering those.  I've seen
something similar on that -next, but not on work.lookups.

The joy of bisecting -next...  <a couple of hours later>
9317bb69824ec8d078b0b786b6971aedb0af3d4f is the first bad commit
commit 9317bb69824ec8d078b0b786b6971aedb0af3d4f
Author: Eric Dumazet <edumazet-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Date:   Mon Apr 25 10:39:32 2016 -0700

    net: SOCKWQ_ASYNC_NOSPACE optimizations

Reverting changes to sk_set_bit/sk_clear_bit gets rid of the hangs.  Plain
revert gives a conflict, since there had been additional change in
"net: SOCKWQ_ASYNC_WAITDATA optimizations"; removing both fixed the hangs.

Note that hangs appear without any fs/nfs/unlink.c modifications being
there.  When the hang happens it affects NFS traffic; ssh session still
works fine until it steps on a filesystem operation on NFS (i.e. you
can use builtins, access procfs, etc.)
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

WARNING: multiple messages have this Message-ID (diff)

From: viro@ZenIV.linux.org.uk (Al Viro)
To: linux-arm-kernel@lists.infradead.org
Subject: NFSroot hangs with bad unlock balance in Linux next
Date: Mon, 9 May 2016 08:32:35 +0100	[thread overview]
Message-ID: <20160509073235.GI2694@ZenIV.linux.org.uk> (raw)
In-Reply-To: <20160508141629.GF2694@ZenIV.linux.org.uk>

On Sun, May 08, 2016 at 03:16:29PM +0100, Al Viro wrote:

> Very strange.  We grab that rwsem at the entry into nfs_call_unlink()
> and then either release it there and return or call nfs_do_call_unlink().
> Which arranges for eventual call of nfs_async_unlink_release() (via
> ->rpc_release); nfs_async_unlink_release() releases the rwsem.  Nobody else
> releases it (on the read side, that is).
> 
> The only kinda-sorta possibility I see here is that the inode we are
> unlocking in that nfs_async_unlink_release() is not the one we'd locked
> in nfs_call_unlink() that has lead to it.  That really shouldn't happen,
> though...  Just to verify whether that's what we are hitting, could you
> try to reproduce that thing with the patch below on top of -next and see
> if it triggers any of those WARN_ON?

D'oh...  Lockdep warnings are easy to trigger (and, AFAICS, bogus).
up_read/down_read in fs/nfs/unlink.c should be replaced with
up_read_non_owner/down_read_non_owner, lest the lockdep gets confused.
Hangs are different - I've no idea what's triggering those.  I've seen
something similar on that -next, but not on work.lookups.

The joy of bisecting -next...  <a couple of hours later>
9317bb69824ec8d078b0b786b6971aedb0af3d4f is the first bad commit
commit 9317bb69824ec8d078b0b786b6971aedb0af3d4f
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Apr 25 10:39:32 2016 -0700

    net: SOCKWQ_ASYNC_NOSPACE optimizations

Reverting changes to sk_set_bit/sk_clear_bit gets rid of the hangs.  Plain
revert gives a conflict, since there had been additional change in
"net: SOCKWQ_ASYNC_WAITDATA optimizations"; removing both fixed the hangs.

Note that hangs appear without any fs/nfs/unlink.c modifications being
there.  When the hang happens it affects NFS traffic; ssh session still
works fine until it steps on a filesystem operation on NFS (i.e. you
can use builtins, access procfs, etc.)

next prev parent reply	other threads:[~2016-05-09  7:32 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-05 22:03 NFSroot hangs with bad unlock balance in Linux next Tony Lindgren
2016-05-05 22:03 ` Tony Lindgren
2016-05-05 22:03 ` Tony Lindgren
2016-05-08 14:16 ` Al Viro
2016-05-08 14:16   ` Al Viro
2016-05-08 14:16   ` Al Viro
2016-05-09  7:32   ` Al Viro [this message]
2016-05-09  7:32     ` Al Viro
2016-05-09  7:32     ` Al Viro
2016-05-09 14:14     ` Eric Dumazet
2016-05-09 14:14       ` Eric Dumazet
2016-05-09 15:12       ` Tony Lindgren
2016-05-09 15:12         ` Tony Lindgren
2016-05-09 15:12         ` Tony Lindgren
2016-05-09 15:21         ` Tony Lindgren
2016-05-09 15:21           ` Tony Lindgren
2016-05-09 15:39           ` Al Viro
2016-05-09 15:39             ` Al Viro
2016-05-09 15:39             ` Al Viro
2016-05-09 19:40             ` Tony Lindgren
2016-05-09 19:40               ` Tony Lindgren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160509073235.GI2694@ZenIV.linux.org.uk \
    --to=viro@zeniv.linux.org.uk \
    --cc=anna.schumaker@netapp.com \
    --cc=edumazet@google.com \
    --cc=hch@lst.de \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-net@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-omap@vger.kernel.org \
    --cc=tony@atomide.com \
    --cc=trond.myklebust@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.