[PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-16  5:31 [PATCH 0/4] Remove possible deadlocks in nfs_release_page() NeilBrown
@ 2014-09-16  5:31 ` NeilBrown
  2014-09-16 22:04   ` Trond Myklebust
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-09-16  5:31 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Trond Myklebust, Ingo Molnar
  Cc: linux-fsdevel, linux-mm, linux-nfs, linux-kernel, Jeff Layton

Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
 - it doesn't hurt for kswapd to block occasionally.  If it doesn't
   want to block it would clear __GFP_WAIT.  The current_is_kswapd()
   was only added to avoid deadlocks and we have a new approach for
   that.
 - memory allocation in the SUNRPC layer can very rarely try to
   ->releasepage() a page it is trying to handle.  The deadlock
   is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c                   |   16 +++++++---------
 net/sunrpc/sched.c              |    2 --
 net/sunrpc/xprtrdma/transport.c |    2 --
 net/sunrpc/xprtsock.c           |   10 ----------
 4 files changed, 7 insertions(+), 23 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 8d74983417af..5949ca37cd18 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,18 +469,16 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
 	/* Always try to initiate a 'commit' if relevant, but only
-	 * wait for it if __GFP_WAIT is set and the calling process is
-	 * allowed to block.  Even then, only wait 1 second.  Waiting
-	 * indefinitely can cause deadlocks when the NFS server is on
-	 * this machine, and there is no particular need to wait
-	 * extensively here.  A short wait has the benefit that
-	 * someone else can worry about the freezer.
+	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
+	 * second.  Waiting indefinitely can cause deadlocks when the
+	 * NFS server is on this machine, when a new TCP connection is
+	 * needed and in other rare cases.  There is no particular
+	 * need to wait extensively here.  A short wait has the
+	 * benefit that someone else can worry about the freezer.
 	 */
 	if (mapping) {
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT) &&
-		    !current_is_kswapd() &&
-		    !(current->flags & PF_FSTRANS))
+		if ((gfp & __GFP_WAIT))
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
 	}
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79fd589..fe3441abdbe5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -821,9 +821,7 @@ void rpc_execute(struct rpc_task *task)
 
 static void rpc_async_schedule(struct work_struct *work)
 {
-	current->flags |= PF_FSTRANS;
 	__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
-	current->flags &= ~PF_FSTRANS;
 }
 
 /**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac4940563..6a4615dd0261 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -205,7 +205,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
 	struct rpc_xprt *xprt = &r_xprt->xprt;
 	int rc = 0;
 
-	current->flags |= PF_FSTRANS;
 	xprt_clear_connected(xprt);
 
 	dprintk("RPC:       %s: %sconnect\n", __func__,
@@ -216,7 +215,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
 
 	dprintk("RPC:       %s: exit\n", __func__);
 	xprt_clear_connecting(xprt);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..4707c0c8568b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1927,8 +1927,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
 	struct socket *sock;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 	status = __sock_create(xprt->xprt_net, AF_LOCAL,
 					SOCK_STREAM, 0, &sock, 1);
@@ -1968,7 +1966,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 	return status;
 }
 
@@ -2071,8 +2068,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	struct socket *sock = transport->sock;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	/* Start by resetting any existing state */
 	xs_reset_transport(transport);
 	sock = xs_create_sock(xprt, transport,
@@ -2092,7 +2087,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /*
@@ -2229,8 +2223,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	struct rpc_xprt *xprt = &transport->xprt;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	if (!sock) {
 		clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 		sock = xs_create_sock(xprt, transport,
@@ -2276,7 +2268,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	case -EINPROGRESS:
 	case -EALREADY:
 		xprt_clear_connecting(xprt);
-		current->flags &= ~PF_FSTRANS;
 		return;
 	case -EINVAL:
 		/* Happens, for instance, if the user specified a link
@@ -2294,7 +2285,6 @@ out_eagain:
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /**



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-16  5:31 ` [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms " NeilBrown
@ 2014-09-16 22:04   ` Trond Myklebust
  2014-09-17  1:10     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Trond Myklebust @ 2014-09-16 22:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Devel FS Linux,
	linux-mm, Linux NFS Mailing List, Linux Kernel mailing list,
	Jeff Layton

Hi Neil,

On Tue, Sep 16, 2014 at 1:31 AM, NeilBrown <neilb@suse.de> wrote:
> Now that nfs_release_page() doesn't block indefinitely, other deadlock
> avoidance mechanisms aren't needed.
>  - it doesn't hurt for kswapd to block occasionally.  If it doesn't
>    want to block it would clear __GFP_WAIT.  The current_is_kswapd()
>    was only added to avoid deadlocks and we have a new approach for
>    that.
>  - memory allocation in the SUNRPC layer can very rarely try to
>    ->releasepage() a page it is trying to handle.  The deadlock
>    is removed as nfs_release_page() doesn't block indefinitely.
>
> So we don't need to set PF_FSTRANS for sunrpc network operations any
> more.

Jeff Layton and I had a little discussion about this earlier today.
The issue that Jeff raised was that these 1 second waits, although
they will eventually complete, can nevertheless have a cumulative
large effect if, say, the reason why we're not making progress is that
we're being called as part of a socket reconnect attempt in
xs_tcp_setup_socket().

In that case, any attempts to call nfs_release_page() on pages that
need to use that socket, will result in a 1 second wait, and no
progress in satisfying the allocation attempt.

Our conclusion was that we still need the PF_FSTRANS in order to deal
with that case, where we need to actually circumvent the new wait in
order to guarantee progress on the task of allocating and connecting
the new socket.

Comments?

Cheers
  Trond

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-16 22:04   ` Trond Myklebust
@ 2014-09-17  1:10     ` NeilBrown
  2014-09-17  1:32       ` Trond Myklebust
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-09-17  1:10 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Devel FS Linux,
	linux-mm, Linux NFS Mailing List, Linux Kernel mailing list,
	Jeff Layton

[-- Attachment #1: Type: text/plain, Size: 4896 bytes --]

On Tue, 16 Sep 2014 18:04:55 -0400 Trond Myklebust
<trond.myklebust@primarydata.com> wrote:

> Hi Neil,
> 
> On Tue, Sep 16, 2014 at 1:31 AM, NeilBrown <neilb@suse.de> wrote:
> > Now that nfs_release_page() doesn't block indefinitely, other deadlock
> > avoidance mechanisms aren't needed.
> >  - it doesn't hurt for kswapd to block occasionally.  If it doesn't
> >    want to block it would clear __GFP_WAIT.  The current_is_kswapd()
> >    was only added to avoid deadlocks and we have a new approach for
> >    that.
> >  - memory allocation in the SUNRPC layer can very rarely try to
> >    ->releasepage() a page it is trying to handle.  The deadlock
> >    is removed as nfs_release_page() doesn't block indefinitely.
> >
> > So we don't need to set PF_FSTRANS for sunrpc network operations any
> > more.
> 
> Jeff Layton and I had a little discussion about this earlier today.
> The issue that Jeff raised was that these 1 second waits, although
> they will eventually complete, can nevertheless have a cumulative
> large effect if, say, the reason why we're not making progress is that
> we're being called as part of a socket reconnect attempt in
> xs_tcp_setup_socket().
> 
> In that case, any attempts to call nfs_release_page() on pages that
> need to use that socket, will result in a 1 second wait, and no
> progress in satisfying the allocation attempt.
> 
> Our conclusion was that we still need the PF_FSTRANS in order to deal
> with that case, where we need to actually circumvent the new wait in
> order to guarantee progress on the task of allocating and connecting
> the new socket.
> 
> Comments?

This is the one weak point in the patch that had occurred to me.
What if shrink_page_list() gets a list of pages all in the same NFS file.  It
will then spend one second on each of those pages...
It will typically only do 32 pages at a time (I think), but that could still
be rather long.
When I was testing with only one large NFS file, and lots of dirty anon pages
to create the required pressure, I didn't see any evidence of extensive
delays, though it is possible that I didn't look in the right place.

My general feeling is that these deadlocks a very rare and an occasional one
or two second pause is a small price to pay - a price you would be unlikely
to even notice.

However ... something else occurs to me.  We could use the bdi congestion
markers to guide the timeout.
When the wait for PG_private times out, or when a connection re-establishment
is required (and maybe other similar times) we could set_bdi_congested().
Then in nfs_release_page() we could completely avoid the wait if
bdi_write_congested().

The congestion setting should encourage vmscan away from the filesystem so it
won't keep calling nfs_release_page() which is a bonus.

Setting bdi_congestion from the RPC layer might be awkward from a layering
perspective, but probably isn't necessary.

Would the following allay your concerns?  The change to
nfs_inode_remove_request ensures that any congestion is removed when a
'commit' completes.

We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
why it was a separate patch.  It would be nice to find a uniform solution
though.

Thanks,
NeilBrown



diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5949ca37cd18..bc674ad250ce 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	 * benefit that someone else can worry about the freezer.
 	 */
 	if (mapping) {
+		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT))
+		if ((gfp & __GFP_WAIT) &&
+		    !bdi_write_congested(&nfss->backing_dev_info))
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
+		if (PagePrivate(page))
+			set_bdi_congested(&nfss->backing_dev_info,
+					  BLK_RW_ASYNC);
 	}
 	/* If PagePrivate() is set, then the page is not freeable */
 	if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 700e7a865e6d..3ab122e92c9d 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 	struct inode *inode = req->wb_context->dentry->d_inode;
 	struct nfs_inode *nfsi = NFS_I(inode);
 	struct nfs_page *head;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
 		head = req->wb_head;
@@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 		spin_unlock(&inode->i_lock);
 	}
 
+	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
 	if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags))
 		nfs_release_request(req);
 	else

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-17  1:10     ` NeilBrown
@ 2014-09-17  1:32       ` Trond Myklebust
  2014-09-17  3:12         ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Trond Myklebust @ 2014-09-17  1:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Devel FS Linux,
	linux-mm, Linux NFS Mailing List, Linux Kernel mailing list,
	Jeff Layton

On Tue, Sep 16, 2014 at 9:10 PM, NeilBrown <neilb@suse.de> wrote:
>
> However ... something else occurs to me.  We could use the bdi congestion
> markers to guide the timeout.
> When the wait for PG_private times out, or when a connection re-establishment
> is required (and maybe other similar times) we could set_bdi_congested().
> Then in nfs_release_page() we could completely avoid the wait if
> bdi_write_congested().
>
> The congestion setting should encourage vmscan away from the filesystem so it
> won't keep calling nfs_release_page() which is a bonus.
>
> Setting bdi_congestion from the RPC layer might be awkward from a layering
> perspective, but probably isn't necessary.
>
> Would the following allay your concerns?  The change to
> nfs_inode_remove_request ensures that any congestion is removed when a
> 'commit' completes.
>
> We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
> why it was a separate patch.  It would be nice to find a uniform solution
> though.
>
> Thanks,
> NeilBrown
>
>
>
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5949ca37cd18..bc674ad250ce 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>          * benefit that someone else can worry about the freezer.
>          */
>         if (mapping) {
> +               struct nfs_server *nfss = NFS_SERVER(mapping->host);
>                 nfs_commit_inode(mapping->host, 0);
> -               if ((gfp & __GFP_WAIT))
> +               if ((gfp & __GFP_WAIT) &&
> +                   !bdi_write_congested(&nfss->backing_dev_info))
>                         wait_on_page_bit_killable_timeout(page, PG_private,
>                                                           HZ);
> +               if (PagePrivate(page))
> +                       set_bdi_congested(&nfss->backing_dev_info,
> +                                         BLK_RW_ASYNC);
>         }
>         /* If PagePrivate() is set, then the page is not freeable */
>         if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 700e7a865e6d..3ab122e92c9d 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>         struct inode *inode = req->wb_context->dentry->d_inode;
>         struct nfs_inode *nfsi = NFS_I(inode);
>         struct nfs_page *head;
> +       struct nfs_server *nfss = NFS_SERVER(inode);
>
>         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
>                 head = req->wb_head;
> @@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>                 spin_unlock(&inode->i_lock);
>         }
>
> +       if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> +               clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);

Hmm.... We already have this equivalent functionality in
nfs_end_page_writeback(), so adding it to nfs_inode_remove_request()
is just causing duplication as far as the stable writeback path is
concerned. How about adding it to nfs_commit_release_pages() instead?

Otherwise, yes, the above does indeed look at if it has merit. Have
you got a good test?

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-17  1:32       ` Trond Myklebust
@ 2014-09-17  3:12         ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-17  3:12 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Peter Zijlstra, Andrew Morton, Ingo Molnar, Devel FS Linux,
	linux-mm, Linux NFS Mailing List, Linux Kernel mailing list,
	Jeff Layton

[-- Attachment #1: Type: text/plain, Size: 6617 bytes --]

On Tue, 16 Sep 2014 21:32:43 -0400 Trond Myklebust
<trond.myklebust@primarydata.com> wrote:

> On Tue, Sep 16, 2014 at 9:10 PM, NeilBrown <neilb@suse.de> wrote:
> >
> > However ... something else occurs to me.  We could use the bdi congestion
> > markers to guide the timeout.
> > When the wait for PG_private times out, or when a connection re-establishment
> > is required (and maybe other similar times) we could set_bdi_congested().
> > Then in nfs_release_page() we could completely avoid the wait if
> > bdi_write_congested().
> >
> > The congestion setting should encourage vmscan away from the filesystem so it
> > won't keep calling nfs_release_page() which is a bonus.
> >
> > Setting bdi_congestion from the RPC layer might be awkward from a layering
> > perspective, but probably isn't necessary.
> >
> > Would the following allay your concerns?  The change to
> > nfs_inode_remove_request ensures that any congestion is removed when a
> > 'commit' completes.
> >
> > We certainly could keep the PF_FSTRANS setting in the SUNRPC layer - that was
> > why it was a separate patch.  It would be nice to find a uniform solution
> > though.
> >
> > Thanks,
> > NeilBrown
> >
> >
> >
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 5949ca37cd18..bc674ad250ce 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> >          * benefit that someone else can worry about the freezer.
> >          */
> >         if (mapping) {
> > +               struct nfs_server *nfss = NFS_SERVER(mapping->host);
> >                 nfs_commit_inode(mapping->host, 0);
> > -               if ((gfp & __GFP_WAIT))
> > +               if ((gfp & __GFP_WAIT) &&
> > +                   !bdi_write_congested(&nfss->backing_dev_info))
> >                         wait_on_page_bit_killable_timeout(page, PG_private,
> >                                                           HZ);
> > +               if (PagePrivate(page))
> > +                       set_bdi_congested(&nfss->backing_dev_info,
> > +                                         BLK_RW_ASYNC);
> >         }
> >         /* If PagePrivate() is set, then the page is not freeable */
> >         if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 700e7a865e6d..3ab122e92c9d 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -726,6 +726,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> >         struct inode *inode = req->wb_context->dentry->d_inode;
> >         struct nfs_inode *nfsi = NFS_I(inode);
> >         struct nfs_page *head;
> > +       struct nfs_server *nfss = NFS_SERVER(inode);
> >
> >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> >                 head = req->wb_head;
> > @@ -742,6 +743,9 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> >                 spin_unlock(&inode->i_lock);
> >         }
> >
> > +       if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > +               clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> 
> Hmm.... We already have this equivalent functionality in
> nfs_end_page_writeback(), so adding it to nfs_inode_remove_request()
> is just causing duplication as far as the stable writeback path is
> concerned. How about adding it to nfs_commit_release_pages() instead?
> 
> Otherwise, yes, the above does indeed look at if it has merit. Have
> you got a good test?
> 

Altered patch below.  I'll post a proper one after some testing.

For testing I create a memory pressure load with:
=============================
#!/bin/bash

umount /mnt/ramdisk
umount /mnt/ramdisk
mount -t tmpfs -o size=4G none /mnt/ramdisk
#swapoff -a

i=0
while [ $i -le 10000 ]; do
        i=$(($i+1))
        dd if=/dev/zero of=/mnt/ramdisk/testdata.dd bs=1M count=6500
	date
done
==============================

Where the '4G' matches memory size, and then write out to an NFS file with

=================================
#!/bin/bash

umount /mnt2 /mnt3
umount /mnt2 /mnt3
mount /dev/sdd /mnt2
exportfs -avu
exportfs -av
mount $* 127.0.0.1:/mnt2 /mnt3
for j in {1..100}; do
i=1
while [ $i -le 10000 ]; do
        echo "Step $i"
        date +%H:%M:%S
        i=$(($i+1))
        zcat /boot/vmlinux-3.13.3-1-desktop.gz | uuencode -
        date +%H:%M:%S
done | dd of=/mnt3/testdat.file bs=1M

done
====================================

Pick your own way to create a large file of random data .. though
probably /dev/zero would do.

With both those going for a few hours the current kernel will deadlock.
With my patches it doesn't.
I'll see if I can come up with some way to measure maximum delay in
try_to_free_pages() and see how the 'congestion' change affects that.

Thanks,
NeilBrown


----------------------
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5949ca37cd18..bc674ad250ce 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -477,10 +477,15 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	 * benefit that someone else can worry about the freezer.
 	 */
 	if (mapping) {
+		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT))
+		if ((gfp & __GFP_WAIT) &&
+		    !bdi_write_congested(&nfss->backing_dev_info))
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
+		if (PagePrivate(page))
+			set_bdi_congested(&nfss->backing_dev_info,
+					  BLK_RW_ASYNC);
 	}
 	/* If PagePrivate() is set, then the page is not freeable */
 	if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 700e7a865e6d..8d4aae9d977a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1641,6 +1641,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	struct nfs_page	*req;
 	int status = data->task.tk_status;
 	struct nfs_commit_info cinfo;
+	struct nfs_server *nfss;
 
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
@@ -1674,6 +1675,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	next:
 		nfs_unlock_and_release_request(req);
 	}
+	nfss = NFS_SERVER(data->inode);
+	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
 	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
 	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
 		nfs_commit_clear_lock(NFS_I(data->inode));


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.
  2014-09-18  6:03 [PATCH 0/4] Remove possible deadlocks in nfs_release_page() - V2 NeilBrown
  2014-09-18  6:03 ` [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() NeilBrown
@ 2014-09-18  6:03 ` NeilBrown
  2014-09-18 12:01   ` Jeff Layton
  1 sibling, 1 reply; 10+ messages in thread
From: NeilBrown @ 2014-09-18  6:03 UTC (permalink / raw)
  To: Trond Myklebust, Jeff Layton; +Cc: linux-nfs

Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service.  If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second.  If the commit doesn't complete
in this time, nfs_release_page() will fail.  This means it might now
fail in some cases where it wouldn't before.  These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
 - page_cache_pipe_buf_steal() in splice.c
 - shrink_page_list() in vmscan.c
 - invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely.  The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive.  I have
seen delays of as much as 208 seconds.

To keep the delay to one second, the bdi is marked as write-congested
if the commit didn't finished.  Once it does finish, the
write-congested flag will be cleared.

With this, the longest total delay in try_to_free_pages that I have
seen in under 3 seconds.  With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c  |   30 ++++++++++++++++++++----------
 fs/nfs/write.c |    7 +++++++
 2 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 524dd80d1898..febba950d8a6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
-	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
-	 * doing this memory reclaim for a fs-related allocation.
+	/* Always try to initiate a 'commit' if relevant, but only
+	 * wait for it if __GFP_WAIT is set and the calling process is
+	 * allowed to block.  Even then, only wait 1 second and only
+	 * if the 'bdi' is not congested.
+	 * Waiting indefinitely can cause deadlocks when the NFS
+	 * server is on this machine, and there is no particular need
+	 * to wait extensively here.  A short wait has the benefit
+	 * that someone else can worry about the freezer.
 	 */
-	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
-	    !(current->flags & PF_FSTRANS)) {
-		int how = FLUSH_SYNC;
-
-		/* Don't let kswapd deadlock waiting for OOM RPC calls */
-		if (current_is_kswapd())
-			how = 0;
-		nfs_commit_inode(mapping->host, how);
+	if (mapping) {
+		struct nfs_server *nfss = NFS_SERVER(mapping->host);
+		nfs_commit_inode(mapping->host, 0);
+		if ((gfp & __GFP_WAIT) &&
+		    !current_is_kswapd() &&
+		    !(current->flags & PF_FSTRANS) &&
+		    !bdi_write_congested(&nfss->backing_dev_info))
+			wait_on_page_bit_killable_timeout(page, PG_private,
+							  HZ);
+		if (PagePrivate(page))
+			set_bdi_congested(&nfss->backing_dev_info,
+					  BLK_RW_ASYNC);
 	}
 	/* If PagePrivate() is set, then the page is not freeable */
 	if (PagePrivate(page))
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 175d5d073ccf..3066c7fcb565 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
 		if (likely(!PageSwapCache(head->wb_page))) {
 			set_page_private(head->wb_page, 0);
 			ClearPagePrivate(head->wb_page);
+			smp_mb__after_atomic();
+			wake_up_page(head->wb_page, PG_private);
 			clear_bit(PG_MAPPED, &head->wb_flags);
 		}
 		nfsi->npages--;
@@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	struct nfs_page	*req;
 	int status = data->task.tk_status;
 	struct nfs_commit_info cinfo;
+	struct nfs_server *nfss;
 
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
@@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
 	next:
 		nfs_unlock_and_release_request(req);
 	}
+	nfss = NFS_SERVER(data->inode);
+	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
+		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
 	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
 	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
 		nfs_commit_clear_lock(NFS_I(data->inode));



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()
  2014-09-18  6:03 [PATCH 0/4] Remove possible deadlocks in nfs_release_page() - V2 NeilBrown
@ 2014-09-18  6:03 ` NeilBrown
  2014-09-18  6:03 ` [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems NeilBrown
  1 sibling, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-18  6:03 UTC (permalink / raw)
  To: Trond Myklebust, Jeff Layton; +Cc: linux-nfs

Now that nfs_release_page() doesn't block indefinitely, other deadlock
avoidance mechanisms aren't needed.
 - it doesn't hurt for kswapd to block occasionally.  If it doesn't
   want to block it would clear __GFP_WAIT.  The current_is_kswapd()
   was only added to avoid deadlocks and we have a new approach for
   that.
 - memory allocation in the SUNRPC layer can very rarely try to
   ->releasepage() a page it is trying to handle.  The deadlock
   is removed as nfs_release_page() doesn't block indefinitely.

So we don't need to set PF_FSTRANS for sunrpc network operations any
more.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/file.c                   |   14 ++++++--------
 net/sunrpc/sched.c              |    2 --
 net/sunrpc/xprtrdma/transport.c |    2 --
 net/sunrpc/xprtsock.c           |   10 ----------
 4 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index febba950d8a6..3c032b1f1b75 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -469,20 +469,18 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
 	/* Always try to initiate a 'commit' if relevant, but only
-	 * wait for it if __GFP_WAIT is set and the calling process is
-	 * allowed to block.  Even then, only wait 1 second and only
-	 * if the 'bdi' is not congested.
+	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
+	 * second and only if the 'bdi' is not congested.
 	 * Waiting indefinitely can cause deadlocks when the NFS
-	 * server is on this machine, and there is no particular need
-	 * to wait extensively here.  A short wait has the benefit
-	 * that someone else can worry about the freezer.
+	 * server is on this machine, when a new TCP connection is
+	 * needed and in other rare cases.  There is no particular
+	 * need to wait extensively here.  A short wait has the
+	 * benefit that someone else can worry about the freezer.
 	 */
 	if (mapping) {
 		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
 		if ((gfp & __GFP_WAIT) &&
-		    !current_is_kswapd() &&
-		    !(current->flags & PF_FSTRANS) &&
 		    !bdi_write_congested(&nfss->backing_dev_info))
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9358c79fd589..fe3441abdbe5 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -821,9 +821,7 @@ void rpc_execute(struct rpc_task *task)
 
 static void rpc_async_schedule(struct work_struct *work)
 {
-	current->flags |= PF_FSTRANS;
 	__rpc_execute(container_of(work, struct rpc_task, u.tk_work));
-	current->flags &= ~PF_FSTRANS;
 }
 
 /**
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2faac4940563..6a4615dd0261 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -205,7 +205,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
 	struct rpc_xprt *xprt = &r_xprt->xprt;
 	int rc = 0;
 
-	current->flags |= PF_FSTRANS;
 	xprt_clear_connected(xprt);
 
 	dprintk("RPC:       %s: %sconnect\n", __func__,
@@ -216,7 +215,6 @@ xprt_rdma_connect_worker(struct work_struct *work)
 
 	dprintk("RPC:       %s: exit\n", __func__);
 	xprt_clear_connecting(xprt);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /*
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 43cd89eacfab..4707c0c8568b 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1927,8 +1927,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
 	struct socket *sock;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 	status = __sock_create(xprt->xprt_net, AF_LOCAL,
 					SOCK_STREAM, 0, &sock, 1);
@@ -1968,7 +1966,6 @@ static int xs_local_setup_socket(struct sock_xprt *transport)
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 	return status;
 }
 
@@ -2071,8 +2068,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
 	struct socket *sock = transport->sock;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	/* Start by resetting any existing state */
 	xs_reset_transport(transport);
 	sock = xs_create_sock(xprt, transport,
@@ -2092,7 +2087,6 @@ static void xs_udp_setup_socket(struct work_struct *work)
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /*
@@ -2229,8 +2223,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	struct rpc_xprt *xprt = &transport->xprt;
 	int status = -EIO;
 
-	current->flags |= PF_FSTRANS;
-
 	if (!sock) {
 		clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
 		sock = xs_create_sock(xprt, transport,
@@ -2276,7 +2268,6 @@ static void xs_tcp_setup_socket(struct work_struct *work)
 	case -EINPROGRESS:
 	case -EALREADY:
 		xprt_clear_connecting(xprt);
-		current->flags &= ~PF_FSTRANS;
 		return;
 	case -EINVAL:
 		/* Happens, for instance, if the user specified a link
@@ -2294,7 +2285,6 @@ out_eagain:
 out:
 	xprt_clear_connecting(xprt);
 	xprt_wake_pending_tasks(xprt, status);
-	current->flags &= ~PF_FSTRANS;
 }
 
 /**



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 0/4] Remove possible deadlocks in nfs_release_page() - V2
@ 2014-09-18  6:03 NeilBrown
  2014-09-18  6:03 ` [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() NeilBrown
  2014-09-18  6:03 ` [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems NeilBrown
  0 siblings, 2 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-18  6:03 UTC (permalink / raw)
  To: Trond Myklebust, Jeff Layton; +Cc: linux-nfs

These two patches are updated versions of the last two patches of this
series.  They include the use of congestion to avoid excessive
waiting.

(I'm not resenting 1/4 and 2/4, they are unchanged).

Without the congestion check, I've seen wait times in
try_to_free_pages as long as 208 seconds.
With no waiting at all in nfs_release_page() I've seen wait times as long
as 1.4 seconds.
With the 1 second wait, I've seen 2 seconds.
These numbers will vary based on numerous factors, but it does seem
to suggest that 1 second is a good ball-park number.

NeilBrown

---

NeilBrown (2):
      NFS: avoid deadlocks with loop-back mounted NFS filesystems.
      NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page()

 fs/nfs/file.c                   |   28 ++++++++++++++++++----------
 fs/nfs/write.c                  |    7 +++++++
 net/sunrpc/sched.c              |    2 --
 net/sunrpc/xprtrdma/transport.c |    2 --
 net/sunrpc/xprtsock.c           |   10 ----------
 5 files changed, 25 insertions(+), 24 deletions(-)

-- 
Signature

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.
  2014-09-18  6:03 ` [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems NeilBrown
@ 2014-09-18 12:01   ` Jeff Layton
  2014-09-22  1:37     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff Layton @ 2014-09-18 12:01 UTC (permalink / raw)
  To: NeilBrown; +Cc: Trond Myklebust, Jeff Layton, linux-nfs

On Thu, 18 Sep 2014 16:03:17 +1000
NeilBrown <neilb@suse.de> wrote:

> Support for loop-back mounted NFS filesystems is useful when NFS is
> used to access shared storage in a high-availability cluster.
> 
> If the node running the NFS server fails, some other node can mount the
> filesystem and start providing NFS service.  If that node already had
> the filesystem NFS mounted, it will now have it loop-back mounted.
> 
> nfsd can suffer a deadlock when allocating memory and entering direct
> reclaim.
> While direct reclaim does not write to the NFS filesystem it can send
> and wait for a COMMIT through nfs_release_page().
> 
> This patch modifies nfs_release_page() to wait a limited time for the
> commit to complete - one second.  If the commit doesn't complete
> in this time, nfs_release_page() will fail.  This means it might now
> fail in some cases where it wouldn't before.  These cases are only
> when 'gfp' includes '__GFP_WAIT'.
> 
> nfs_release_page() is only called by try_to_release_page(), and that
> can only be called on an NFS page with required 'gfp' flags from
>  - page_cache_pipe_buf_steal() in splice.c
>  - shrink_page_list() in vmscan.c
>  - invalidate_inode_pages2_range() in truncate.c
> 
> The first two handle failure quite safely.  The last is only called
> after ->launder_page() has been called, and that will have waited
> for the commit to finish already.
> 
> So aborting if the commit takes longer than 1 second is perfectly safe.
> 
> If nfs_release_page() is called on a sequence of pages which are all
> in the same file which is blocked on COMMIT, each page could
> contribute a 1 second delay which could be come excessive.  I have
> seen delays of as much as 208 seconds.
> 
> To keep the delay to one second, the bdi is marked as write-congested
> if the commit didn't finished.  Once it does finish, the
> write-congested flag will be cleared.
> 
> With this, the longest total delay in try_to_free_pages that I have
> seen in under 3 seconds.  With no waiting in nfs_release_page at all
> I have seen delays of nearly 1.5 seconds.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
>  fs/nfs/write.c |    7 +++++++
>  2 files changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 524dd80d1898..febba950d8a6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  
>  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>  
> -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> -	 * doing this memory reclaim for a fs-related allocation.
> +	/* Always try to initiate a 'commit' if relevant, but only
> +	 * wait for it if __GFP_WAIT is set and the calling process is
> +	 * allowed to block.  Even then, only wait 1 second and only
> +	 * if the 'bdi' is not congested.
> +	 * Waiting indefinitely can cause deadlocks when the NFS
> +	 * server is on this machine, and there is no particular need
> +	 * to wait extensively here.  A short wait has the benefit
> +	 * that someone else can worry about the freezer.
>  	 */
> -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> -	    !(current->flags & PF_FSTRANS)) {
> -		int how = FLUSH_SYNC;
> -
> -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> -		if (current_is_kswapd())
> -			how = 0;
> -		nfs_commit_inode(mapping->host, how);
> +	if (mapping) {
> +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> +		nfs_commit_inode(mapping->host, 0);
> +		if ((gfp & __GFP_WAIT) &&
> +		    !current_is_kswapd() &&
> +		    !(current->flags & PF_FSTRANS) &&
> +		    !bdi_write_congested(&nfss->backing_dev_info))
> +			wait_on_page_bit_killable_timeout(page, PG_private,
> +							  HZ);
> +		if (PagePrivate(page))
> +			set_bdi_congested(&nfss->backing_dev_info,
> +					  BLK_RW_ASYNC);

I've never had a great feel for the BDI congestion stuff, but won't
this have some unintended effects?

For instance, suppose the VM decides to try to free this page and
passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
COMMIT, but don't wait for it. The COMMIT is actually going to go
reasonably fast, but we now set the BDI congested because we didn't
wait for it to occur.

That in turn causes writeout for other inodes on this BDI to get
throttled even though there really is no congestion. It just looks that
way due to how releasepage got called.

Am I making mountains out of molehills here?

>  	}
>  	/* If PagePrivate() is set, then the page is not freeable */
>  	if (PagePrivate(page))
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 175d5d073ccf..3066c7fcb565 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>  		if (likely(!PageSwapCache(head->wb_page))) {
>  			set_page_private(head->wb_page, 0);
>  			ClearPagePrivate(head->wb_page);
> +			smp_mb__after_atomic();
> +			wake_up_page(head->wb_page, PG_private);
>  			clear_bit(PG_MAPPED, &head->wb_flags);
>  		}
>  		nfsi->npages--;
> @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	struct nfs_page	*req;
>  	int status = data->task.tk_status;
>  	struct nfs_commit_info cinfo;
> +	struct nfs_server *nfss;
>  
>  	while (!list_empty(&data->pages)) {
>  		req = nfs_list_entry(data->pages.next);
> @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
>  	next:
>  		nfs_unlock_and_release_request(req);
>  	}
> +	nfss = NFS_SERVER(data->inode);
> +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> +
>  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
>  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
>  		nfs_commit_clear_lock(NFS_I(data->inode));
> 
> 


-- 
Jeff Layton <jlayton@primarydata.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems.
  2014-09-18 12:01   ` Jeff Layton
@ 2014-09-22  1:37     ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2014-09-22  1:37 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Trond Myklebust, linux-nfs

[-- Attachment #1: Type: text/plain, Size: 6777 bytes --]

On Thu, 18 Sep 2014 08:01:07 -0400 Jeff Layton <jeff.layton@primarydata.com>
wrote:

> On Thu, 18 Sep 2014 16:03:17 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > Support for loop-back mounted NFS filesystems is useful when NFS is
> > used to access shared storage in a high-availability cluster.
> > 
> > If the node running the NFS server fails, some other node can mount the
> > filesystem and start providing NFS service.  If that node already had
> > the filesystem NFS mounted, it will now have it loop-back mounted.
> > 
> > nfsd can suffer a deadlock when allocating memory and entering direct
> > reclaim.
> > While direct reclaim does not write to the NFS filesystem it can send
> > and wait for a COMMIT through nfs_release_page().
> > 
> > This patch modifies nfs_release_page() to wait a limited time for the
> > commit to complete - one second.  If the commit doesn't complete
> > in this time, nfs_release_page() will fail.  This means it might now
> > fail in some cases where it wouldn't before.  These cases are only
> > when 'gfp' includes '__GFP_WAIT'.
> > 
> > nfs_release_page() is only called by try_to_release_page(), and that
> > can only be called on an NFS page with required 'gfp' flags from
> >  - page_cache_pipe_buf_steal() in splice.c
> >  - shrink_page_list() in vmscan.c
> >  - invalidate_inode_pages2_range() in truncate.c
> > 
> > The first two handle failure quite safely.  The last is only called
> > after ->launder_page() has been called, and that will have waited
> > for the commit to finish already.
> > 
> > So aborting if the commit takes longer than 1 second is perfectly safe.
> > 
> > If nfs_release_page() is called on a sequence of pages which are all
> > in the same file which is blocked on COMMIT, each page could
> > contribute a 1 second delay which could be come excessive.  I have
> > seen delays of as much as 208 seconds.
> > 
> > To keep the delay to one second, the bdi is marked as write-congested
> > if the commit didn't finished.  Once it does finish, the
> > write-congested flag will be cleared.
> > 
> > With this, the longest total delay in try_to_free_pages that I have
> > seen in under 3 seconds.  With no waiting in nfs_release_page at all
> > I have seen delays of nearly 1.5 seconds.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/nfs/file.c  |   30 ++++++++++++++++++++----------
> >  fs/nfs/write.c |    7 +++++++
> >  2 files changed, 27 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> > index 524dd80d1898..febba950d8a6 100644
> > --- a/fs/nfs/file.c
> > +++ b/fs/nfs/file.c
> > @@ -468,17 +468,27 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
> >  
> >  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
> >  
> > -	/* Only do I/O if gfp is a superset of GFP_KERNEL, and we're not
> > -	 * doing this memory reclaim for a fs-related allocation.
> > +	/* Always try to initiate a 'commit' if relevant, but only
> > +	 * wait for it if __GFP_WAIT is set and the calling process is
> > +	 * allowed to block.  Even then, only wait 1 second and only
> > +	 * if the 'bdi' is not congested.
> > +	 * Waiting indefinitely can cause deadlocks when the NFS
> > +	 * server is on this machine, and there is no particular need
> > +	 * to wait extensively here.  A short wait has the benefit
> > +	 * that someone else can worry about the freezer.
> >  	 */
> > -	if (mapping && (gfp & GFP_KERNEL) == GFP_KERNEL &&
> > -	    !(current->flags & PF_FSTRANS)) {
> > -		int how = FLUSH_SYNC;
> > -
> > -		/* Don't let kswapd deadlock waiting for OOM RPC calls */
> > -		if (current_is_kswapd())
> > -			how = 0;
> > -		nfs_commit_inode(mapping->host, how);
> > +	if (mapping) {
> > +		struct nfs_server *nfss = NFS_SERVER(mapping->host);
> > +		nfs_commit_inode(mapping->host, 0);
> > +		if ((gfp & __GFP_WAIT) &&
> > +		    !current_is_kswapd() &&
> > +		    !(current->flags & PF_FSTRANS) &&
> > +		    !bdi_write_congested(&nfss->backing_dev_info))
> > +			wait_on_page_bit_killable_timeout(page, PG_private,
> > +							  HZ);
> > +		if (PagePrivate(page))
> > +			set_bdi_congested(&nfss->backing_dev_info,
> > +					  BLK_RW_ASYNC);
> 
> I've never had a great feel for the BDI congestion stuff, but won't
> this have some unintended effects?
> 
> For instance, suppose the VM decides to try to free this page and
> passes in a gfp mask that doesn't contain __GFP_WAIT. We issue the
> COMMIT, but don't wait for it. The COMMIT is actually going to go
> reasonably fast, but we now set the BDI congested because we didn't
> wait for it to occur.
> 
> That in turn causes writeout for other inodes on this BDI to get
> throttled even though there really is no congestion. It just looks that
> way due to how releasepage got called.
> 
> Am I making mountains out of molehills here?

Excellent molehill - thanks :-)

I was being lazy.  The 'if (PagePrivate())' should really be inside the
other if statement with the wait_on_page_bit...().  I've moved it there.
Once I get an Ack for the mm bits I'll report it all.

Thanks,
NeilBrown

(Molehills are worse for your suspension than mountains!)


> 
> >  	}
> >  	/* If PagePrivate() is set, then the page is not freeable */
> >  	if (PagePrivate(page))
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index 175d5d073ccf..3066c7fcb565 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -731,6 +731,8 @@ static void nfs_inode_remove_request(struct nfs_page *req)
> >  		if (likely(!PageSwapCache(head->wb_page))) {
> >  			set_page_private(head->wb_page, 0);
> >  			ClearPagePrivate(head->wb_page);
> > +			smp_mb__after_atomic();
> > +			wake_up_page(head->wb_page, PG_private);
> >  			clear_bit(PG_MAPPED, &head->wb_flags);
> >  		}
> >  		nfsi->npages--;
> > @@ -1636,6 +1638,7 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	struct nfs_page	*req;
> >  	int status = data->task.tk_status;
> >  	struct nfs_commit_info cinfo;
> > +	struct nfs_server *nfss;
> >  
> >  	while (!list_empty(&data->pages)) {
> >  		req = nfs_list_entry(data->pages.next);
> > @@ -1669,6 +1672,10 @@ static void nfs_commit_release_pages(struct nfs_commit_data *data)
> >  	next:
> >  		nfs_unlock_and_release_request(req);
> >  	}
> > +	nfss = NFS_SERVER(data->inode);
> > +	if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
> > +		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > +
> >  	nfs_init_cinfo(&cinfo, data->inode, data->dreq);
> >  	if (atomic_dec_and_test(&cinfo.mds->rpcs_out))
> >  		nfs_commit_clear_lock(NFS_I(data->inode));
> > 
> > 
> 
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-09-22  1:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-09-18  6:03 [PATCH 0/4] Remove possible deadlocks in nfs_release_page() - V2 NeilBrown
2014-09-18  6:03 ` [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms in nfs_release_page() NeilBrown
2014-09-18  6:03 ` [PATCH 3/4] NFS: avoid deadlocks with loop-back mounted NFS filesystems NeilBrown
2014-09-18 12:01   ` Jeff Layton
2014-09-22  1:37     ` NeilBrown
  -- strict thread matches above, loose matches on Subject: below --
2014-09-16  5:31 [PATCH 0/4] Remove possible deadlocks in nfs_release_page() NeilBrown
2014-09-16  5:31 ` [PATCH 4/4] NFS/SUNRPC: Remove other deadlock-avoidance mechanisms " NeilBrown
2014-09-16 22:04   ` Trond Myklebust
2014-09-17  1:10     ` NeilBrown
2014-09-17  1:32       ` Trond Myklebust
2014-09-17  3:12         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox