netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH/RFC 00/19] Support loop-back NFS mounts
@ 2014-04-16  4:03 NeilBrown
  2014-04-16  4:03 ` [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface NeilBrown
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: NeilBrown @ 2014-04-16  4:03 UTC (permalink / raw)
  To: linux-mm, linux-nfs, linux-kernel
  Cc: xfs, Peter Zijlstra, Ingo Molnar, Ming Lei, netdev

Loop-back NFS mounts are when the NFS client and server run on the
same host.

The use-case for this is a high availability cluster with shared
storage.  The shared filesystem is mounted on any one machine and
NFS-mounted on the others.
If the nfs server fails, some other node will take over that service,
and then it will have a loop-back NFS mount which needs to keep
working.

This patch set addresses the "keep working" bit and specifically
addresses deadlocks and livelocks.
Allowing the fail-over itself to be deadlock free is a separate
challenge for another day.

The short description of how this works is:

deadlocks:
  - Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
    PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
    disables __GFP_IO.
  - Set PF_FSTRANS in nfsd when handling requests related to
    memory reclaim, or requests which could block requests related
    to memory reclaim.
  - Use lockdep to find all consequent deadlocks from some other
    thread allocating memory while holding a lock that nfsd might
    want.
  - Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
    as appropriate.

livelocks:
  - identify throttling during reclaim and bypass it when
    PF_LESS_THROTTLE is set
  - only set PF_LESS_THROTTLE for nfsd when handling write requests
    from the local host.

The last 12 patches address various deadlocks due to locking chains.
11 were found by lockdep, 2 by testing.  There is a reasonable chance
that there are more, I just need to exercise more code while
testing....

There is one issue that lockdep reports which I haven't fixed (I've
just hacked the code out for my testing).  That issue relates to
freeze_super().
I may not be interpreting the lockdep reports perfectly, but I think
they are basically saying that if I were to freeze a filesystem that
was exported to the local host, then we could end up deadlocking.
This is to be expected.  The NFS filesystem would need to be frozen
first.  I don't know how to tell lockdep that I know that is a problem
and I don't want to be warned about it.  Suggestions welcome.
Until this is addressed I cannot really ask others to test the code
with lockdep enabled.

There are more subsidiary places that I needed to add PF_FSTRANS than
I would have liked.  The thought keeps crossing my mind that maybe we
can get rid of __GFP_FS and require that memory reclaim never ever
block on a filesystem.  Then most of these patches go away.

Now that writeback doesn't happen from reclaim (but from kswapd) much
of the calls from reclaim to FS are gone.
The ->releasepage call is the only one that I *know* causes me
problems so I'd like to just say that that must never block.  I don't
really understand the consequences of that though.
There are a couple of other places where __GFP_FS is used and I'd need
to carefully analyze those.  But if someone just said "no, that is
impossible", I could be happy and stick with the current approach....

I've cc:ed Peter Zijlstra and Ingo Molnar only on the lockdep-related
patches, Ming Lei only on the PF_MEMALLOC_NOIO related patches,
and net-dev only on the network-related patches.
There are probably other people I should CC.  Apologies if I missed you.
I'll ensure better coverage if the nfs/mm/xfs people are reasonably happy.

Comments, criticisms, etc most welcome.

Thanks,
NeilBrown


---

NeilBrown (19):
      Promote current_{set,restore}_flags_nested from xfs to global.
      lockdep: lockdep_set_current_reclaim_state should save old value
      lockdep: improve scenario messages for RECLAIM_FS errors.
      Make effect of PF_FSTRANS to disable __GFP_FS universal.
      SUNRPC: track whether a request is coming from a loop-back interface.
      nfsd: set PF_FSTRANS for nfsd threads.
      nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.
      Set PF_FSTRANS while write_cache_pages calls ->writepage
      XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.
      NET: set PF_FSTRANS while holding sk_lock
      FS: set PF_FSTRANS while holding mmap_sem in exec.c
      NET: set PF_FSTRANS while holding rtnl_lock
      MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.
      driver core: set PF_FSTRANS while holding gdp_mutex
      nfsd: set PF_FSTRANS when client_mutex is held.
      VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.
      VFS: set PF_FSTRANS while namespace_sem is held.
      nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.
      XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks


 drivers/base/core.c             |    3 ++
 drivers/base/power/runtime.c    |    6 ++---
 drivers/block/nbd.c             |    6 ++---
 drivers/md/dm-bufio.c           |    6 ++---
 drivers/md/dm-ioctl.c           |    6 ++---
 drivers/mtd/nand/nandsim.c      |   28 ++++++---------------
 drivers/scsi/iscsi_tcp.c        |    6 ++---
 drivers/usb/core/hub.c          |    6 ++---
 fs/dcache.c                     |    4 ++-
 fs/exec.c                       |    6 +++++
 fs/fs-writeback.c               |    5 ++--
 fs/namespace.c                  |    4 +++
 fs/nfs/file.c                   |    3 +-
 fs/nfsd/nfs4callback.c          |    5 ++++
 fs/nfsd/nfs4state.c             |    3 ++
 fs/nfsd/nfssvc.c                |   24 ++++++++++++++----
 fs/nfsd/vfs.c                   |    6 +++++
 fs/xfs/kmem.h                   |    2 --
 fs/xfs/xfs_aops.c               |    7 -----
 fs/xfs/xfs_bmap_util.c          |    4 +++
 fs/xfs/xfs_file.c               |   12 +++++++++
 fs/xfs/xfs_linux.h              |    7 -----
 include/linux/lockdep.h         |    8 +++---
 include/linux/sched.h           |   32 +++++++++---------------
 include/linux/sunrpc/svc.h      |    2 ++
 include/linux/sunrpc/svc_xprt.h |    1 +
 include/net/sock.h              |    1 +
 kernel/locking/lockdep.c        |   51 ++++++++++++++++++++++++++++-----------
 kernel/softirq.c                |    6 ++---
 mm/migrate.c                    |    9 +++----
 mm/page-writeback.c             |    3 ++
 mm/page_alloc.c                 |   18 ++++++++------
 mm/percpu.c                     |    4 +++
 mm/slab.c                       |    2 ++
 mm/slob.c                       |    2 ++
 mm/slub.c                       |    1 +
 mm/vmscan.c                     |   31 +++++++++++++++---------
 net/core/dev.c                  |    6 ++---
 net/core/rtnetlink.c            |    9 ++++++-
 net/core/sock.c                 |    8 ++++--
 net/sunrpc/sched.c              |    5 ++--
 net/sunrpc/svc.c                |    6 +++++
 net/sunrpc/svcsock.c            |   10 ++++++++
 net/sunrpc/xprtrdma/transport.c |    5 ++--
 net/sunrpc/xprtsock.c           |   17 ++++++++-----
 45 files changed, 247 insertions(+), 149 deletions(-)

-- 
Signature

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.
  2014-04-16  4:03 [PATCH/RFC 00/19] Support loop-back NFS mounts NeilBrown
@ 2014-04-16  4:03 ` NeilBrown
  2014-04-16 14:47   ` Jeff Layton
  2014-04-16  4:03 ` [PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock NeilBrown
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2014-04-16  4:03 UTC (permalink / raw)
  To: linux-mm, linux-nfs, linux-kernel; +Cc: xfs, netdev

If an incoming NFS request is coming from the local host, then
nfsd will need to perform some special handling.  So detect that
possibility and make the source visible in rq_local.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svc.h      |    1 +
 include/linux/sunrpc/svc_xprt.h |    1 +
 net/sunrpc/svcsock.c            |   10 ++++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 04e763221246..a0dbbd1e00e9 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -254,6 +254,7 @@ struct svc_rqst {
 	u32			rq_prot;	/* IP protocol */
 	unsigned short
 				rq_secure  : 1;	/* secure port */
+	unsigned short		rq_local   : 1;	/* local request */
 
 	void *			rq_argp;	/* decoded arguments */
 	void *			rq_resp;	/* xdr'd results */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f09ebf..b99bdfb0fcf9 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -63,6 +63,7 @@ struct svc_xprt {
 #define	XPT_DETACHED	10		/* detached from tempsocks list */
 #define XPT_LISTENER	11		/* listening endpoint */
 #define XPT_CACHE_AUTH	12		/* cache auth info */
+#define XPT_LOCAL	13		/* connection from loopback interface */
 
 	struct svc_serv		*xpt_server;	/* service for transport */
 	atomic_t    	    	xpt_reserved;	/* space on outq that is rsvd */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index b6e59f0a9475..193115fe968c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
 	struct socket	*newsock;
 	struct svc_sock	*newsvsk;
 	int		err, slen;
+	struct dst_entry *dst;
 	RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
 
 	dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
@@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
 	}
 	svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
 
+	clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+	rcu_read_lock();
+	dst = rcu_dereference(newsock->sk->sk_dst_cache);
+	if (dst && dst->dev &&
+	    (dst->dev->features & NETIF_F_LOOPBACK))
+		set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
+	rcu_read_unlock();
+
 	if (serv->sv_stats)
 		serv->sv_stats->nettcpconn++;
 
@@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 
 	rqstp->rq_xprt_ctxt   = NULL;
 	rqstp->rq_prot	      = IPPROTO_TCP;
+	rqstp->rq_local	      = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);
 
 	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
 	calldir = p[1];


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock
  2014-04-16  4:03 [PATCH/RFC 00/19] Support loop-back NFS mounts NeilBrown
  2014-04-16  4:03 ` [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface NeilBrown
  2014-04-16  4:03 ` [PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock NeilBrown
@ 2014-04-16  4:03 ` NeilBrown
       [not found]   ` <20140416040336.10604.96000.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
  2014-04-16 14:42 ` [PATCH/RFC 00/19] Support loop-back NFS mounts Jeff Layton
  3 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2014-04-16  4:03 UTC (permalink / raw)
  To: linux-mm, linux-nfs, linux-kernel; +Cc: xfs, netdev

sk_lock can be taken while reclaiming memory (in nfsd for loop-back
NFS mounts, and presumably in nfs), and memory can be allocated while
holding sk_lock, at least via:

 inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc

So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.

This deadlock was found by lockdep.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/net/sock.h |    1 +
 net/core/sock.c    |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index b9586a137cad..27c355637e44 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -324,6 +324,7 @@ struct sock {
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr
 
 	socket_lock_t		sk_lock;
+	unsigned int		sk_pflags; /* process flags before taking lock */
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
diff --git a/net/core/sock.c b/net/core/sock.c
index cf9bd24e4099..8bc677ef072e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2341,6 +2341,7 @@ void lock_sock_nested(struct sock *sk, int subclass)
 	/*
 	 * The sk_lock has mutex_lock() semantics here:
 	 */
+	current_set_flags_nested(&sk->sk_pflags, PF_FSTRANS);
 	mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
 	local_bh_enable();
 }
@@ -2352,6 +2353,7 @@ void release_sock(struct sock *sk)
 	 * The sk_lock has mutex_unlock() semantics:
 	 */
 	mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
+	current_restore_flags_nested(&sk->sk_pflags, PF_FSTRANS);
 
 	spin_lock_bh(&sk->sk_lock.slock);
 	if (sk->sk_backlog.tail)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock
  2014-04-16  4:03 [PATCH/RFC 00/19] Support loop-back NFS mounts NeilBrown
  2014-04-16  4:03 ` [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface NeilBrown
@ 2014-04-16  4:03 ` NeilBrown
  2014-04-16  4:03 ` [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock NeilBrown
  2014-04-16 14:42 ` [PATCH/RFC 00/19] Support loop-back NFS mounts Jeff Layton
  3 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2014-04-16  4:03 UTC (permalink / raw)
  To: linux-mm, linux-nfs, linux-kernel; +Cc: xfs, netdev

As rtnl_mutex can be taken while holding sk_lock, and sk_lock can be
taken while performing memory reclaim (at least when loop-back NFS is
active), any memory allocation under rtnl_mutex must avoid __GFP_FS,
which is most easily done by setting PF_MEMALLOC.


        CPU0                    CPU1
        ----                    ----
   lock(rtnl_mutex);
                                lock(sk_lock-AF_INET);
                                lock(rtnl_mutex);
   <Memory allocation/reclaim>
     lock(sk_lock-AF_INET);

  *** DEADLOCK ***

1/ rtnl_mutex is taken while holding sk_lock:

    [<ffffffff81abb442>] rtnl_lock+0x12/0x20
    [<ffffffff81b28c3a>] ip_mc_leave_group+0x2a/0x160
    [<ffffffff81aec70b>] do_ip_setsockopt.isra.18+0x96b/0xed0
    [<ffffffff81aecc97>] ip_setsockopt+0x27/0x90
    [<ffffffff81b151c6>] udp_setsockopt+0x16/0x30
    [<ffffffff81a9144f>] sock_common_setsockopt+0xf/0x20
    [<ffffffff81a907de>] SyS_setsockopt+0x5e/0xc0

2/ memory is allocated under rtnl_mutex:
    [<ffffffff8166eb41>] kobject_set_name_vargs+0x21/0x70
    [<ffffffff81840d92>] dev_set_name+0x42/0x50
    [<ffffffff81ac5e97>] netdev_register_kobject+0x57/0x130
    [<ffffffff81aaf574>] register_netdevice+0x354/0x550
    [<ffffffff81aaf785>] register_netdev+0x15/0x30


Signed-off-by: NeilBrown <neilb@suse.de>
---
 net/core/rtnetlink.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 120eecc0f5a4..6870211e93a6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -61,15 +61,18 @@ struct rtnl_link {
 };
 
 static DEFINE_MUTEX(rtnl_mutex);
+static int rtnl_pflags;
 
 void rtnl_lock(void)
 {
 	mutex_lock(&rtnl_mutex);
+	current_set_flags_nested(&rtnl_pflags, PF_FSTRANS);
 }
 EXPORT_SYMBOL(rtnl_lock);
 
 void __rtnl_unlock(void)
 {
+	current_restore_flags_nested(&rtnl_pflags, PF_FSTRANS);
 	mutex_unlock(&rtnl_mutex);
 }
 
@@ -82,7 +85,11 @@ EXPORT_SYMBOL(rtnl_unlock);
 
 int rtnl_trylock(void)
 {
-	return mutex_trylock(&rtnl_mutex);
+	if (mutex_trylock(&rtnl_mutex)) {
+		current_set_flags_nested(&rtnl_pflags, PF_FSTRANS);
+		return 1;
+	}
+	return 0;
 }
 EXPORT_SYMBOL(rtnl_trylock);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock
       [not found]   ` <20140416040336.10604.96000.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
@ 2014-04-16  5:13     ` Eric Dumazet
       [not found]       ` <1397625226.4222.113.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
  2014-04-16 13:00       ` David Miller
  0 siblings, 2 replies; 15+ messages in thread
From: Eric Dumazet @ 2014-04-16  5:13 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	netdev-u79uwXL29TY76Z2rM5mHXA

On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> NFS mounts, and presumably in nfs), and memory can be allocated while
> holding sk_lock, at least via:
> 
>  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> 
> So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> 
> This deadlock was found by lockdep.

Wow, this is adding expensive stuff in fast path, only for nfsd :(

BTW, why should the current->flags should be saved on a socket field,
and not a current->save_flags. This really looks a thread property, not
a socket one.

Why nfsd could not have PF_FSTRANS in its current->flags ?

For applications handling millions of sockets, this makes a difference.


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock
       [not found]       ` <1397625226.4222.113.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2014-04-16  5:47         ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2014-04-16  5:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	netdev-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 1922 bytes --]

On Tue, 15 Apr 2014 22:13:46 -0700 Eric Dumazet <eric.dumazet-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
wrote:

> On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> > sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> > NFS mounts, and presumably in nfs), and memory can be allocated while
> > holding sk_lock, at least via:
> > 
> >  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> > 
> > So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> > 
> > This deadlock was found by lockdep.
> 
> Wow, this is adding expensive stuff in fast path, only for nfsd :(

Yes, this was probably one part that I was least comfortable about.

> 
> BTW, why should the current->flags should be saved on a socket field,
> and not a current->save_flags. This really looks a thread property, not
> a socket one.
> 
> Why nfsd could not have PF_FSTRANS in its current->flags ?

nfsd does have PF_FSTRANS set in current->flags.  But some other processes
might not.

If any process takes sk_lock, allocates memory, and then blocks in reclaim it
could be waiting for nfsd.  If nfsd waits for that sk_lock, it would cause a
deadlock.

Thinking a bit more carefully .... I suspect that any socket that nfsd
created would only ever be locked by nfsd.  If that is the case then the
problem can be resolved entirely within nfsd.  We would need to tell lockdep
that there are two sorts of sk_locks, those which nfsd uses and all the
rest.  That might get a little messy, but wouldn't impact performance.

Is it justified to assume that sockets created by nfsd threads would only
ever be locked by nfsd threads (and interrupts, which won't be allocating
memory so don't matter), or might there be locked by other threads - e.g for
'netstat -a' etc??


> 
> For applications handling millions of sockets, this makes a difference.
> 

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock
  2014-04-16  5:13     ` Eric Dumazet
       [not found]       ` <1397625226.4222.113.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
@ 2014-04-16 13:00       ` David Miller
  2014-04-17  2:38         ` NeilBrown
  1 sibling, 1 reply; 15+ messages in thread
From: David Miller @ 2014-04-16 13:00 UTC (permalink / raw)
  To: eric.dumazet; +Cc: neilb, linux-mm, linux-nfs, linux-kernel, xfs, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Apr 2014 22:13:46 -0700

> For applications handling millions of sockets, this makes a difference.

Indeed, this really is not acceptable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH/RFC 00/19] Support loop-back NFS mounts
  2014-04-16  4:03 [PATCH/RFC 00/19] Support loop-back NFS mounts NeilBrown
                   ` (2 preceding siblings ...)
  2014-04-16  4:03 ` [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock NeilBrown
@ 2014-04-16 14:42 ` Jeff Layton
       [not found]   ` <20140416104207.75b044e8-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
  3 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2014-04-16 14:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-nfs, Peter Zijlstra, netdev, Ming Lei, linux-kernel, xfs,
	linux-mm, Ingo Molnar

On Wed, 16 Apr 2014 14:03:35 +1000
NeilBrown <neilb@suse.de> wrote:

> Loop-back NFS mounts are when the NFS client and server run on the
> same host.
> 
> The use-case for this is a high availability cluster with shared
> storage.  The shared filesystem is mounted on any one machine and
> NFS-mounted on the others.
> If the nfs server fails, some other node will take over that service,
> and then it will have a loop-back NFS mount which needs to keep
> working.
> 
> This patch set addresses the "keep working" bit and specifically
> addresses deadlocks and livelocks.
> Allowing the fail-over itself to be deadlock free is a separate
> challenge for another day.
> 
> The short description of how this works is:
> 
> deadlocks:
>   - Elevate PF_FSTRANS to apply globally instead of just in NFS and XFS.
>     PF_FSTRANS disables __GFP_NS in the same way that PF_MEMALLOC_NOIO
>     disables __GFP_IO.
>   - Set PF_FSTRANS in nfsd when handling requests related to
>     memory reclaim, or requests which could block requests related
>     to memory reclaim.
>   - Use lockdep to find all consequent deadlocks from some other
>     thread allocating memory while holding a lock that nfsd might
>     want.
>   - Fix those other deadlocks by setting PF_FSTRANS or using GFP_NOFS
>     as appropriate.
> 
> livelocks:
>   - identify throttling during reclaim and bypass it when
>     PF_LESS_THROTTLE is set
>   - only set PF_LESS_THROTTLE for nfsd when handling write requests
>     from the local host.
> 
> The last 12 patches address various deadlocks due to locking chains.
> 11 were found by lockdep, 2 by testing.  There is a reasonable chance
> that there are more, I just need to exercise more code while
> testing....
> 
> There is one issue that lockdep reports which I haven't fixed (I've
> just hacked the code out for my testing).  That issue relates to
> freeze_super().
> I may not be interpreting the lockdep reports perfectly, but I think
> they are basically saying that if I were to freeze a filesystem that
> was exported to the local host, then we could end up deadlocking.
> This is to be expected.  The NFS filesystem would need to be frozen
> first.  I don't know how to tell lockdep that I know that is a problem
> and I don't want to be warned about it.  Suggestions welcome.
> Until this is addressed I cannot really ask others to test the code
> with lockdep enabled.
> 
> There are more subsidiary places that I needed to add PF_FSTRANS than
> I would have liked.  The thought keeps crossing my mind that maybe we
> can get rid of __GFP_FS and require that memory reclaim never ever
> block on a filesystem.  Then most of these patches go away.
> 
> Now that writeback doesn't happen from reclaim (but from kswapd) much
> of the calls from reclaim to FS are gone.
> The ->releasepage call is the only one that I *know* causes me
> problems so I'd like to just say that that must never block.  I don't
> really understand the consequences of that though.
> There are a couple of other places where __GFP_FS is used and I'd need
> to carefully analyze those.  But if someone just said "no, that is
> impossible", I could be happy and stick with the current approach....
> 
> I've cc:ed Peter Zijlstra and Ingo Molnar only on the lockdep-related
> patches, Ming Lei only on the PF_MEMALLOC_NOIO related patches,
> and net-dev only on the network-related patches.
> There are probably other people I should CC.  Apologies if I missed you.
> I'll ensure better coverage if the nfs/mm/xfs people are reasonably happy.
> 
> Comments, criticisms, etc most welcome.
> 
> Thanks,
> NeilBrown
> 

I've only given this a once-over, but the basic concept seems a bit
flawed. IIUC, the basic idea is to disallow allocations done in knfsd
threads context from doing fs-based reclaim.

This seems very heavy-handed, and like it could cause problems on a
busy NFS server. Those sorts of servers are likely to have a lot of
data in pagecache and thus we generally want to allow them to do do
writeback when memory is tight.

It's generally acceptable for knfsd to recurse into local filesystem
code for writeback. What you want to avoid in this situation is reclaim
on NFS filesystems that happen to be from knfsd on the same box.

If you really want to fix this, what may make more sense is trying to
plumb that information down more granularly. Maybe GFP_NONETFS and/or
PF_NETFSTRANS flags?

> 
> ---
> 
> NeilBrown (19):
>       Promote current_{set,restore}_flags_nested from xfs to global.
>       lockdep: lockdep_set_current_reclaim_state should save old value
>       lockdep: improve scenario messages for RECLAIM_FS errors.
>       Make effect of PF_FSTRANS to disable __GFP_FS universal.
>       SUNRPC: track whether a request is coming from a loop-back interface.
>       nfsd: set PF_FSTRANS for nfsd threads.
>       nfsd and VM: use PF_LESS_THROTTLE to avoid throttle in shrink_inactive_list.
>       Set PF_FSTRANS while write_cache_pages calls ->writepage
>       XFS: ensure xfs_file_*_read cannot deadlock in memory allocation.
>       NET: set PF_FSTRANS while holding sk_lock
>       FS: set PF_FSTRANS while holding mmap_sem in exec.c
>       NET: set PF_FSTRANS while holding rtnl_lock
>       MM: set PF_FSTRANS while allocating per-cpu memory to avoid deadlock.
>       driver core: set PF_FSTRANS while holding gdp_mutex
>       nfsd: set PF_FSTRANS when client_mutex is held.
>       VFS: use GFP_NOFS rather than GFP_KERNEL in __d_alloc.
>       VFS: set PF_FSTRANS while namespace_sem is held.
>       nfsd: set PF_FSTRANS during nfsd4_do_callback_rpc.
>       XFS: set PF_FSTRANS while ilock is held in xfs_free_eofblocks
> 
> 
>  drivers/base/core.c             |    3 ++
>  drivers/base/power/runtime.c    |    6 ++---
>  drivers/block/nbd.c             |    6 ++---
>  drivers/md/dm-bufio.c           |    6 ++---
>  drivers/md/dm-ioctl.c           |    6 ++---
>  drivers/mtd/nand/nandsim.c      |   28 ++++++---------------
>  drivers/scsi/iscsi_tcp.c        |    6 ++---
>  drivers/usb/core/hub.c          |    6 ++---
>  fs/dcache.c                     |    4 ++-
>  fs/exec.c                       |    6 +++++
>  fs/fs-writeback.c               |    5 ++--
>  fs/namespace.c                  |    4 +++
>  fs/nfs/file.c                   |    3 +-
>  fs/nfsd/nfs4callback.c          |    5 ++++
>  fs/nfsd/nfs4state.c             |    3 ++
>  fs/nfsd/nfssvc.c                |   24 ++++++++++++++----
>  fs/nfsd/vfs.c                   |    6 +++++
>  fs/xfs/kmem.h                   |    2 --
>  fs/xfs/xfs_aops.c               |    7 -----
>  fs/xfs/xfs_bmap_util.c          |    4 +++
>  fs/xfs/xfs_file.c               |   12 +++++++++
>  fs/xfs/xfs_linux.h              |    7 -----
>  include/linux/lockdep.h         |    8 +++---
>  include/linux/sched.h           |   32 +++++++++---------------
>  include/linux/sunrpc/svc.h      |    2 ++
>  include/linux/sunrpc/svc_xprt.h |    1 +
>  include/net/sock.h              |    1 +
>  kernel/locking/lockdep.c        |   51 ++++++++++++++++++++++++++++-----------
>  kernel/softirq.c                |    6 ++---
>  mm/migrate.c                    |    9 +++----
>  mm/page-writeback.c             |    3 ++
>  mm/page_alloc.c                 |   18 ++++++++------
>  mm/percpu.c                     |    4 +++
>  mm/slab.c                       |    2 ++
>  mm/slob.c                       |    2 ++
>  mm/slub.c                       |    1 +
>  mm/vmscan.c                     |   31 +++++++++++++++---------
>  net/core/dev.c                  |    6 ++---
>  net/core/rtnetlink.c            |    9 ++++++-
>  net/core/sock.c                 |    8 ++++--
>  net/sunrpc/sched.c              |    5 ++--
>  net/sunrpc/svc.c                |    6 +++++
>  net/sunrpc/svcsock.c            |   10 ++++++++
>  net/sunrpc/xprtrdma/transport.c |    5 ++--
>  net/sunrpc/xprtsock.c           |   17 ++++++++-----
>  45 files changed, 247 insertions(+), 149 deletions(-)
> 


-- 
Jeff Layton <jlayton@redhat.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.
  2014-04-16  4:03 ` [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface NeilBrown
@ 2014-04-16 14:47   ` Jeff Layton
  2014-04-16 23:25     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Jeff Layton @ 2014-04-16 14:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-mm, linux-nfs, linux-kernel, xfs, netdev

On Wed, 16 Apr 2014 14:03:36 +1000
NeilBrown <neilb@suse.de> wrote:

> If an incoming NFS request is coming from the local host, then
> nfsd will need to perform some special handling.  So detect that
> possibility and make the source visible in rq_local.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/sunrpc/svc.h      |    1 +
>  include/linux/sunrpc/svc_xprt.h |    1 +
>  net/sunrpc/svcsock.c            |   10 ++++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 04e763221246..a0dbbd1e00e9 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -254,6 +254,7 @@ struct svc_rqst {
>  	u32			rq_prot;	/* IP protocol */
>  	unsigned short
>  				rq_secure  : 1;	/* secure port */
> +	unsigned short		rq_local   : 1;	/* local request */
>  
>  	void *			rq_argp;	/* decoded arguments */
>  	void *			rq_resp;	/* xdr'd results */
> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f09ebf..b99bdfb0fcf9 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -63,6 +63,7 @@ struct svc_xprt {
>  #define	XPT_DETACHED	10		/* detached from tempsocks list */
>  #define XPT_LISTENER	11		/* listening endpoint */
>  #define XPT_CACHE_AUTH	12		/* cache auth info */
> +#define XPT_LOCAL	13		/* connection from loopback interface */
>  
>  	struct svc_serv		*xpt_server;	/* service for transport */
>  	atomic_t    	    	xpt_reserved;	/* space on outq that is rsvd */
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index b6e59f0a9475..193115fe968c 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
>  	struct socket	*newsock;
>  	struct svc_sock	*newsvsk;
>  	int		err, slen;
> +	struct dst_entry *dst;
>  	RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
>  
>  	dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
> @@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
>  	}
>  	svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
>  
> +	clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> +	rcu_read_lock();
> +	dst = rcu_dereference(newsock->sk->sk_dst_cache);
> +	if (dst && dst->dev &&
> +	    (dst->dev->features & NETIF_F_LOOPBACK))
> +		set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> +	rcu_read_unlock();
> +

In the use-case you describe, the client is unlikely to have mounted
"localhost", but is more likely to be mounting a floating address in
the cluster.

Will this flag end up being set in such a situation? It looks like
NETIF_F_LOOPBACK isn't set on all adapters -- mostly on "lo" and a few
others that support MAC-LOOPBACK.


>  	if (serv->sv_stats)
>  		serv->sv_stats->nettcpconn++;
>  
> @@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>  
>  	rqstp->rq_xprt_ctxt   = NULL;
>  	rqstp->rq_prot	      = IPPROTO_TCP;
> +	rqstp->rq_local	      = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);
>  
>  	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
>  	calldir = p[1];
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Jeff Layton <jlayton@poochiereds.net>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface.
  2014-04-16 14:47   ` Jeff Layton
@ 2014-04-16 23:25     ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2014-04-16 23:25 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-mm, linux-nfs, linux-kernel, xfs, netdev

[-- Attachment #1: Type: text/plain, Size: 4281 bytes --]

On Wed, 16 Apr 2014 10:47:02 -0400 Jeff Layton <jlayton@poochiereds.net>
wrote:

> On Wed, 16 Apr 2014 14:03:36 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > If an incoming NFS request is coming from the local host, then
> > nfsd will need to perform some special handling.  So detect that
> > possibility and make the source visible in rq_local.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  include/linux/sunrpc/svc.h      |    1 +
> >  include/linux/sunrpc/svc_xprt.h |    1 +
> >  net/sunrpc/svcsock.c            |   10 ++++++++++
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 04e763221246..a0dbbd1e00e9 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -254,6 +254,7 @@ struct svc_rqst {
> >  	u32			rq_prot;	/* IP protocol */
> >  	unsigned short
> >  				rq_secure  : 1;	/* secure port */
> > +	unsigned short		rq_local   : 1;	/* local request */
> >  
> >  	void *			rq_argp;	/* decoded arguments */
> >  	void *			rq_resp;	/* xdr'd results */
> > diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> > index b05963f09ebf..b99bdfb0fcf9 100644
> > --- a/include/linux/sunrpc/svc_xprt.h
> > +++ b/include/linux/sunrpc/svc_xprt.h
> > @@ -63,6 +63,7 @@ struct svc_xprt {
> >  #define	XPT_DETACHED	10		/* detached from tempsocks list */
> >  #define XPT_LISTENER	11		/* listening endpoint */
> >  #define XPT_CACHE_AUTH	12		/* cache auth info */
> > +#define XPT_LOCAL	13		/* connection from loopback interface */
> >  
> >  	struct svc_serv		*xpt_server;	/* service for transport */
> >  	atomic_t    	    	xpt_reserved;	/* space on outq that is rsvd */
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index b6e59f0a9475..193115fe968c 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -811,6 +811,7 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> >  	struct socket	*newsock;
> >  	struct svc_sock	*newsvsk;
> >  	int		err, slen;
> > +	struct dst_entry *dst;
> >  	RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
> >  
> >  	dprintk("svc: tcp_accept %p sock %p\n", svsk, sock);
> > @@ -867,6 +868,14 @@ static struct svc_xprt *svc_tcp_accept(struct svc_xprt *xprt)
> >  	}
> >  	svc_xprt_set_local(&newsvsk->sk_xprt, sin, slen);
> >  
> > +	clear_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> > +	rcu_read_lock();
> > +	dst = rcu_dereference(newsock->sk->sk_dst_cache);
> > +	if (dst && dst->dev &&
> > +	    (dst->dev->features & NETIF_F_LOOPBACK))
> > +		set_bit(XPT_LOCAL, &newsvsk->sk_xprt.xpt_flags);
> > +	rcu_read_unlock();
> > +
> 
> In the use-case you describe, the client is unlikely to have mounted
> "localhost", but is more likely to be mounting a floating address in
> the cluster.
> 
> Will this flag end up being set in such a situation? It looks like
> NETIF_F_LOOPBACK isn't set on all adapters -- mostly on "lo" and a few
> others that support MAC-LOOPBACK.

I was hoping someone on net-dev would have commented if it didn't work - I
probably should have explicitly asked.

My testing certainly suggests that this works if I use any local address to
perform the mount, not just 127.0.0.1.
I don't know enough about routing and neighbours and the dst cache to be
certain but my shallow understanding was always that traffic to any local
address was pseudo-routed through the lo interface and would never get
anywhere near e.g. the eth0 interface.

Can any network people confirm?

Thanks,
NeilBrown


> 
> 
> >  	if (serv->sv_stats)
> >  		serv->sv_stats->nettcpconn++;
> >  
> > @@ -1112,6 +1121,7 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> >  
> >  	rqstp->rq_xprt_ctxt   = NULL;
> >  	rqstp->rq_prot	      = IPPROTO_TCP;
> > +	rqstp->rq_local	      = !!test_bit(XPT_LOCAL, &svsk->sk_xprt.xpt_flags);
> >  
> >  	p = (__be32 *)rqstp->rq_arg.head[0].iov_base;
> >  	calldir = p[1];
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH/RFC 00/19] Support loop-back NFS mounts
       [not found]   ` <20140416104207.75b044e8-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
@ 2014-04-17  0:20     ` NeilBrown
  2014-04-17  1:27       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2014-04-17  0:20 UTC (permalink / raw)
  To: Jeff Layton
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	Peter Zijlstra, Ingo Molnar, Ming Lei,
	netdev-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3316 bytes --]

On Wed, 16 Apr 2014 10:42:07 -0400 Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Wed, 16 Apr 2014 14:03:35 +1000
> NeilBrown <neilb-l3A5Bk7waGM@public.gmane.org> wrote:
> 

> > Comments, criticisms, etc most welcome.
> > 
> > Thanks,
> > NeilBrown
> > 
> 
> I've only given this a once-over, but the basic concept seems a bit
> flawed. IIUC, the basic idea is to disallow allocations done in knfsd
> threads context from doing fs-based reclaim.
> 
> This seems very heavy-handed, and like it could cause problems on a
> busy NFS server. Those sorts of servers are likely to have a lot of
> data in pagecache and thus we generally want to allow them to do do
> writeback when memory is tight.
> 
> It's generally acceptable for knfsd to recurse into local filesystem
> code for writeback. What you want to avoid in this situation is reclaim
> on NFS filesystems that happen to be from knfsd on the same box.
> 
> If you really want to fix this, what may make more sense is trying to
> plumb that information down more granularly. Maybe GFP_NONETFS and/or
> PF_NETFSTRANS flags?

Hi Jeff,
 a few clarifications first:

 1/ These changes probably won't affect a "busy NFS server" at all.  The
    PF_FSTRANS flag only get set in nfsd when it sees a request from the local
    host.  Most busy NFS servers would never see that, and so would never set
    PF_FSTRANS.

 2/ Setting PF_FSTRANS does not affect where writeback is done.  Direct
    reclaim hasn't performed filesystem writeback since 3.2, it is all done
    by kswapd (I think direct reclaim still writes to swap sometimes).
    The main effects of setting PF_FSTRANS (as modified by this page set)
    are:
      - when reclaim calls ->releasepage  __GFP_FS is not set in the gfp_t arg
      - various caches like dcache, icache etc are not shrunk from
        direct reclaim
    There are other effects, but I'm less clear on exactly what they mean.

A flag specific to network filesystems might make sense, but I don't think it
would solve all the deadlocks.

A good example is the deadlock with the flush-* threads.
flush-* will lock a page, and  then call ->writepage.  If ->writepage
allocates memory it can enter reclaim, call ->releasepage on NFS, and block
waiting for a COMMIT to complete.
The COMMIT might already be running, performing fsync on that same file that
flush-* is flushing.  It locks each page in turn.  When it  gets to the page
that flush-* has locked, it will deadlock.

xfs_vm_writepage does allocate memory with __GFP_FS set
   xfs_vm_writepage -> xfs_setfilesize_trans_alloc -> xfs_trans_alloc ->
   _xfs_trans_allo

and I have had this deadlock happen.  To avoid this we need flush-* to ensure
that no memory allocation blocks on NFS.  We could set a PF_NETFSTRANS there,
but as that code really has nothing to do with networks it would seem an odd
place to put a network-fs-specific flag.

In general, if nfsd is allowed to block on local filesystem, and local
filesystem is allowed to block on NFS, then a deadlock can happen.
We would need a clear hierarchy

   __GFP_NETFS > __GFP_FS > __GFP_IO

for it to work.  I'm not sure the extra level really helps a lot and it would
be a lot of churn.


Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH/RFC 00/19] Support loop-back NFS mounts
  2014-04-17  0:20     ` NeilBrown
@ 2014-04-17  1:27       ` Dave Chinner
  2014-04-17  1:50         ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-04-17  1:27 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-nfs, Peter Zijlstra, netdev, Ming Lei,
	linux-kernel, xfs, linux-mm, Ingo Molnar

On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> A good example is the deadlock with the flush-* threads.
> flush-* will lock a page, and  then call ->writepage.  If ->writepage
> allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> waiting for a COMMIT to complete.
> The COMMIT might already be running, performing fsync on that same file that
> flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> that flush-* has locked, it will deadlock.

It's nfs_release_page() again....

> In general, if nfsd is allowed to block on local filesystem, and local
> filesystem is allowed to block on NFS, then a deadlock can happen.
> We would need a clear hierarchy
> 
>    __GFP_NETFS > __GFP_FS > __GFP_IO
> 
> for it to work.  I'm not sure the extra level really helps a lot and it would
> be a lot of churn.

I think you are looking at this the wrong way - it's not the other
filesystems that have to avoid memory reclaim recursion, it's the
NFS client mount that is on loopback that needs to avoid recursion.

IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
on the same host during memory reclaim. That is, nfs_release_page()
cannot send commit messages to the server if the server is on
localhost. Instead, it just tells memory reclaim that it can't
reclaim that page.

If nfs_release_page() no longer blocks in memory reclaim, and all
these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
problems go away. Do the same for all the other memory reclaim
operations in the NFS client, and you've got a solution that should
work without needing to walk all over the rest of the kernel....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH/RFC 00/19] Support loop-back NFS mounts
  2014-04-17  1:27       ` Dave Chinner
@ 2014-04-17  1:50         ` NeilBrown
  2014-04-17  4:23           ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2014-04-17  1:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jeff Layton, linux-nfs-u79uwXL29TY76Z2rM5mHXA, Peter Zijlstra,
	netdev-u79uwXL29TY76Z2rM5mHXA, Ming Lei,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, xfs-VZNHf3L845pBDgjK7y7TUQ,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 2596 bytes --]

On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote:

> On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > A good example is the deadlock with the flush-* threads.
> > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > waiting for a COMMIT to complete.
> > The COMMIT might already be running, performing fsync on that same file that
> > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > that flush-* has locked, it will deadlock.
> 
> It's nfs_release_page() again....
> 
> > In general, if nfsd is allowed to block on local filesystem, and local
> > filesystem is allowed to block on NFS, then a deadlock can happen.
> > We would need a clear hierarchy
> > 
> >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > 
> > for it to work.  I'm not sure the extra level really helps a lot and it would
> > be a lot of churn.
> 
> I think you are looking at this the wrong way - it's not the other
> filesystems that have to avoid memory reclaim recursion, it's the
> NFS client mount that is on loopback that needs to avoid recursion.
> 
> IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> on the same host during memory reclaim. That is, nfs_release_page()
> cannot send commit messages to the server if the server is on
> localhost. Instead, it just tells memory reclaim that it can't
> reclaim that page.
> 
> If nfs_release_page() no longer blocks in memory reclaim, and all
> these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
> problems go away. Do the same for all the other memory reclaim
> operations in the NFS client, and you've got a solution that should
> work without needing to walk all over the rest of the kernel....

Maybe.
It is nfs_release_page() today. I wonder if it could be other things another
day.  I want to be sure I have a solution that really makes sense.

However ... the thing that nfs_release_page is doing it sending a COMMIT to
tell the server to flush to stable storage.  It does that so that if the
server crashes, then the client can re-send.
Of course when it is a loop-back mount the client is the server so the COMMIT
is completely pointless.  If the client notices that it is sending a COMMIT
to itself, it can simply assume a positive reply.

You are right, that would make the patch set a lot less intrusive.  I'll give
it some serious thought - thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock
  2014-04-16 13:00       ` David Miller
@ 2014-04-17  2:38         ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2014-04-17  2:38 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, linux-mm, linux-nfs, linux-kernel, xfs, netdev

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

On Wed, 16 Apr 2014 09:00:02 -0400 (EDT) David Miller <davem@davemloft.net>
wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 15 Apr 2014 22:13:46 -0700
> 
> > For applications handling millions of sockets, this makes a difference.
> 
> Indeed, this really is not acceptable.

As you say...
I've just discovered that I can get rid of the lockdep message (and hence
presumably the deadlock risk) with a well placed:

		newsock->sk->sk_allocation = GFP_NOFS;

which surprised me as it seemed to be an explicit GFP_KERNEL allocation that
was mentioned in the lockdep trace.  Obviously these traces require quite
some sophistication to understand.

So - thanks for the feedback, patch can be ignored.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH/RFC 00/19] Support loop-back NFS mounts
  2014-04-17  1:50         ` NeilBrown
@ 2014-04-17  4:23           ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-04-17  4:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jeff Layton, linux-nfs, Peter Zijlstra, netdev, Ming Lei,
	linux-kernel, xfs, linux-mm, Ingo Molnar

On Thu, Apr 17, 2014 at 11:50:18AM +1000, NeilBrown wrote:
> On Thu, 17 Apr 2014 11:27:39 +1000 Dave Chinner <david@fromorbit.com> wrote:
> 
> > On Thu, Apr 17, 2014 at 10:20:48AM +1000, NeilBrown wrote:
> > > A good example is the deadlock with the flush-* threads.
> > > flush-* will lock a page, and  then call ->writepage.  If ->writepage
> > > allocates memory it can enter reclaim, call ->releasepage on NFS, and block
> > > waiting for a COMMIT to complete.
> > > The COMMIT might already be running, performing fsync on that same file that
> > > flush-* is flushing.  It locks each page in turn.  When it  gets to the page
> > > that flush-* has locked, it will deadlock.
> > 
> > It's nfs_release_page() again....
> > 
> > > In general, if nfsd is allowed to block on local filesystem, and local
> > > filesystem is allowed to block on NFS, then a deadlock can happen.
> > > We would need a clear hierarchy
> > > 
> > >    __GFP_NETFS > __GFP_FS > __GFP_IO
> > > 
> > > for it to work.  I'm not sure the extra level really helps a lot and it would
> > > be a lot of churn.
> > 
> > I think you are looking at this the wrong way - it's not the other
> > filesystems that have to avoid memory reclaim recursion, it's the
> > NFS client mount that is on loopback that needs to avoid recursion.
> > 
> > IMO, the fix should be that the NFS client cannot block on messages sent to the NFSD
> > on the same host during memory reclaim. That is, nfs_release_page()
> > cannot send commit messages to the server if the server is on
> > localhost. Instead, it just tells memory reclaim that it can't
> > reclaim that page.
> > 
> > If nfs_release_page() no longer blocks in memory reclaim, and all
> > these nfsd-gets-blocked-in-GFP_KERNEL-memory-allocation recursion
> > problems go away. Do the same for all the other memory reclaim
> > operations in the NFS client, and you've got a solution that should
> > work without needing to walk all over the rest of the kernel....
> 
> Maybe.
> It is nfs_release_page() today. I wonder if it could be other things another
> day.  I want to be sure I have a solution that really makes sense.

There could be other things, but in the absence of those things,
I don't think that adding another layer to memory reclaim
dependencies for this niche corner case makes a lot of sense. ;)

> However ... the thing that nfs_release_page is doing it sending a COMMIT to
> tell the server to flush to stable storage.  It does that so that if the
> server crashes, then the client can re-send.
> Of course when it is a loop-back mount the client is the server so the COMMIT
> is completely pointless.  If the client notices that it is sending a COMMIT
> to itself, it can simply assume a positive reply.

Yes, that's very true. You might have to treat ->writepage
specially, too, if that can block, say, on the number of outstanding
requests that can be sent to the server.

> You are right, that would make the patch set a lot less intrusive.  I'll give
> it some serious thought - thanks.

No worries. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-04-17  4:23 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-16  4:03 [PATCH/RFC 00/19] Support loop-back NFS mounts NeilBrown
2014-04-16  4:03 ` [PATCH 05/19] SUNRPC: track whether a request is coming from a loop-back interface NeilBrown
2014-04-16 14:47   ` Jeff Layton
2014-04-16 23:25     ` NeilBrown
2014-04-16  4:03 ` [PATCH 12/19] NET: set PF_FSTRANS while holding rtnl_lock NeilBrown
2014-04-16  4:03 ` [PATCH 10/19] NET: set PF_FSTRANS while holding sk_lock NeilBrown
     [not found]   ` <20140416040336.10604.96000.stgit-wvvUuzkyo1EYVZTmpyfIwg@public.gmane.org>
2014-04-16  5:13     ` Eric Dumazet
     [not found]       ` <1397625226.4222.113.camel-XN9IlZ5yJG9HTL0Zs8A6p/gx64E7kk8eUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
2014-04-16  5:47         ` NeilBrown
2014-04-16 13:00       ` David Miller
2014-04-17  2:38         ` NeilBrown
2014-04-16 14:42 ` [PATCH/RFC 00/19] Support loop-back NFS mounts Jeff Layton
     [not found]   ` <20140416104207.75b044e8-9yPaYZwiELC+kQycOl6kW4xkIHaj4LzF@public.gmane.org>
2014-04-17  0:20     ` NeilBrown
2014-04-17  1:27       ` Dave Chinner
2014-04-17  1:50         ` NeilBrown
2014-04-17  4:23           ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).