[PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware
@ 2006-07-31  0:41 NeilBrown
  2006-07-31  0:41 ` [PATCH 001 of 11] knfsd: move tempsock aging to a timer NeilBrown
                   ` (10 more replies)
  0 siblings, 11 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel

Following are 11 patches from Greg Banks which combine to make knfsd
more Numa-aware.  They reduce hitting on 'global' data structures, and
create some data-structures that can be node-local.

knfsd threads are bound to a particular node, and the thread to handle
a new request is chosen from the threads that are attach to the node
that received the interrupt.

The distribution of threads across nodes can be controlled by a new
file in the 'nfsd' filesystem, though the default approach of an even
spread is probably fine for most sites.

Some (old) numbers that show the efficacy of these patches:
N == number of NICs == number of CPUs == nmber of clients.
Number of NUMA nodes == N/2

N	Throughput, MiB/s	CPU usage, % (max=N*100)
	Before	After		Before	After
---	------	----		-----	-----
4	312	435		350	228
6	500	656		501	418
8	562	804		690	589

 [PATCH 001 of 11] knfsd: move tempsock aging to a timer
 [PATCH 002 of 11] knfsd: convert sk_inuse to atomic_t
 [PATCH 003 of 11] knfsd: use new lock for svc_sock deferred list
 [PATCH 004 of 11] knfsd: convert sk_reserved to atomic_t
 [PATCH 005 of 11] knfsd: test and set SK_BUSY atomically
 [PATCH 006 of 11] knfsd: split svc_serv into pools
 [PATCH 007 of 11] knfsd: add svc_get
 [PATCH 008 of 11] knfsd: add svc_set_num_threads
 [PATCH 009 of 11] knfsd: use svc_set_num_threads to manage threads in knfsd
 [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
 [PATCH 011 of 11] knfsd: allow admin to set nthreads per node

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 001 of 11] knfsd: move tempsock aging to a timer
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
@ 2006-07-31  0:41 ` NeilBrown
  2006-07-31  0:41 ` [PATCH 002 of 11] knfsd: convert sk_inuse to atomic_t NeilBrown
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Move the aging of RPC/TCP connection sockets from the main
svc_recv() loop to a timer which uses a mark-and-sweep algorithm
every 6 minutes.  This reduces the amount of work that needs to be
done in the main RPC loop and the length of time we need to hold
the (effectively global) svc_serv->sv_lock.


Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svc.h     |    1 
 ./include/linux/sunrpc/svcsock.h |    2 
 ./net/sunrpc/svc.c               |    3 +
 ./net/sunrpc/svcsock.c           |   96 ++++++++++++++++++++++++++++-----------
 4 files changed, 76 insertions(+), 26 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h	2006-07-28 11:54:16.000000000 +1000
+++ ./include/linux/sunrpc/svc.h	2006-07-31 09:56:44.000000000 +1000
@@ -40,6 +40,7 @@ struct svc_serv {
 	struct list_head	sv_permsocks;	/* all permanent sockets */
 	struct list_head	sv_tempsocks;	/* all temporary sockets */
 	int			sv_tmpcnt;	/* count of temporary sockets */
+	struct timer_list	sv_temptimer;	/* timer for aging temporary sockets */
 
 	char *			sv_name;	/* service name */
 

diff .prev/include/linux/sunrpc/svcsock.h ./include/linux/sunrpc/svcsock.h
--- .prev/include/linux/sunrpc/svcsock.h	2006-07-28 11:54:40.000000000 +1000
+++ ./include/linux/sunrpc/svcsock.h	2006-07-31 09:56:44.000000000 +1000
@@ -31,6 +31,8 @@ struct svc_sock {
 #define	SK_DEAD		6			/* socket closed */
 #define	SK_CHNGBUF	7			/* need to change snd/rcv buffer sizes */
 #define	SK_DEFERRED	8			/* request on sk_deferred */
+#define	SK_OLD		9			/* used for temp socket aging mark+sweep */
+#define	SK_DETACHED	10			/* detached from tempsocks list */
 
 	int			sk_reserved;	/* space on outq that is reserved */
 

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c	2006-07-28 11:53:47.000000000 +1000
+++ ./net/sunrpc/svc.c	2006-07-31 09:56:44.000000000 +1000
@@ -59,6 +59,7 @@ svc_create(struct svc_program *prog, uns
 	INIT_LIST_HEAD(&serv->sv_sockets);
 	INIT_LIST_HEAD(&serv->sv_tempsocks);
 	INIT_LIST_HEAD(&serv->sv_permsocks);
+	init_timer(&serv->sv_temptimer);
 	spin_lock_init(&serv->sv_lock);
 
 	/* Remove any stale portmap registrations */
@@ -87,6 +88,8 @@ svc_destroy(struct svc_serv *serv)
 	} else
 		printk("svc_destroy: no threads for serv=%p!\n", serv);
 
+	del_timer_sync(&serv->sv_temptimer);
+
 	while (!list_empty(&serv->sv_tempsocks)) {
 		svsk = list_entry(serv->sv_tempsocks.next,
 				  struct svc_sock,

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-28 11:55:15.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 09:56:44.000000000 +1000
@@ -74,6 +74,13 @@ static struct svc_deferred_req *svc_defe
 static int svc_deferred_recv(struct svc_rqst *rqstp);
 static struct cache_deferred_req *svc_defer(struct cache_req *req);
 
+/* apparently the "standard" is that clients close
+ * idle connections after 5 minutes, servers after
+ * 6 minutes
+ *   http://www.connectathon.org/talks96/nfstcp.pdf
+ */
+static int svc_conn_age_period = 6*60;
+
 /*
  * Queue up an idle server thread.  Must have serv->sv_lock held.
  * Note: this is really a stack rather than a queue, so that we only
@@ -1230,24 +1237,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 		return -EINTR;
 
 	spin_lock_bh(&serv->sv_lock);
-	if (!list_empty(&serv->sv_tempsocks)) {
-		svsk = list_entry(serv->sv_tempsocks.next,
-				  struct svc_sock, sk_list);
-		/* apparently the "standard" is that clients close
-		 * idle connections after 5 minutes, servers after
-		 * 6 minutes
-		 *   http://www.connectathon.org/talks96/nfstcp.pdf 
-		 */
-		if (get_seconds() - svsk->sk_lastrecv < 6*60
-		    || test_bit(SK_BUSY, &svsk->sk_flags))
-			svsk = NULL;
-	}
-	if (svsk) {
-		set_bit(SK_BUSY, &svsk->sk_flags);
-		set_bit(SK_CLOSE, &svsk->sk_flags);
-		rqstp->rq_sock = svsk;
-		svsk->sk_inuse++;
-	} else if ((svsk = svc_sock_dequeue(serv)) != NULL) {
+	if ((svsk = svc_sock_dequeue(serv)) != NULL) {
 		rqstp->rq_sock = svsk;
 		svsk->sk_inuse++;
 		rqstp->rq_reserved = serv->sv_bufsz;	
@@ -1292,13 +1282,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 		return -EAGAIN;
 	}
 	svsk->sk_lastrecv = get_seconds();
-	if (test_bit(SK_TEMP, &svsk->sk_flags)) {
-		/* push active sockets to end of list */
-		spin_lock_bh(&serv->sv_lock);
-		if (!list_empty(&svsk->sk_list))
-			list_move_tail(&svsk->sk_list, &serv->sv_tempsocks);
-		spin_unlock_bh(&serv->sv_lock);
-	}
+	clear_bit(SK_OLD, &svsk->sk_flags);
 
 	rqstp->rq_secure  = ntohs(rqstp->rq_addr.sin_port) < 1024;
 	rqstp->rq_chandle.defer = svc_defer;
@@ -1358,6 +1342,58 @@ svc_send(struct svc_rqst *rqstp)
 }
 
 /*
+ * Timer function to close old temporary sockets, using
+ * a mark-and-sweep algorithm.
+ */
+static void
+svc_age_temp_sockets(unsigned long closure)
+{
+	struct svc_serv *serv = (struct svc_serv *)closure;
+	struct svc_sock *svsk;
+	struct list_head *le, *next;
+	LIST_HEAD(to_be_aged);
+
+	dprintk("svc_age_temp_sockets\n");
+
+	if (!spin_trylock_bh(&serv->sv_lock)) {
+		/* busy, try again 1 sec later */
+		dprintk("svc_age_temp_sockets: busy\n");
+		mod_timer(&serv->sv_temptimer, jiffies + HZ);
+		return;
+	}
+
+	list_for_each_safe(le, next, &serv->sv_tempsocks) {
+		svsk = list_entry(le, struct svc_sock, sk_list);
+
+		if (!test_and_set_bit(SK_OLD, &svsk->sk_flags))
+			continue;
+		if (svsk->sk_inuse || test_bit(SK_BUSY, &svsk->sk_flags))
+			continue;
+		svsk->sk_inuse++;
+		list_move(le, &to_be_aged);
+		set_bit(SK_CLOSE, &svsk->sk_flags);
+		set_bit(SK_DETACHED, &svsk->sk_flags);
+	}
+	spin_unlock_bh(&serv->sv_lock);
+
+	while (!list_empty(&to_be_aged)) {
+		le = to_be_aged.next;
+		/* fiddling the sk_list node is safe 'cos we're SK_DETACHED */
+		list_del_init(le);
+		svsk = list_entry(le, struct svc_sock, sk_list);
+
+		dprintk("queuing svsk %p for closing, %lu seconds old\n",
+			svsk, get_seconds() - svsk->sk_lastrecv);
+
+		/* a thread will dequeue and close it soon */
+		svc_sock_enqueue(svsk);
+		svc_sock_put(svsk);
+	}
+
+	mod_timer(&serv->sv_temptimer, jiffies + svc_conn_age_period * HZ);
+}
+
+/*
  * Initialize socket for RPC use and create svc_sock struct
  * XXX: May want to setsockopt SO_SNDBUF and SO_RCVBUF.
  */
@@ -1410,6 +1446,13 @@ svc_setup_socket(struct svc_serv *serv, 
 		set_bit(SK_TEMP, &svsk->sk_flags);
 		list_add(&svsk->sk_list, &serv->sv_tempsocks);
 		serv->sv_tmpcnt++;
+		if (serv->sv_temptimer.function == NULL) {
+			/* setup timer to age temp sockets */
+			serv->sv_temptimer.function = svc_age_temp_sockets;
+			serv->sv_temptimer.data = (unsigned long)serv;
+			serv->sv_temptimer.expires = jiffies + svc_conn_age_period * HZ;
+			add_timer(&serv->sv_temptimer);
+		}
 	} else {
 		clear_bit(SK_TEMP, &svsk->sk_flags);
 		list_add(&svsk->sk_list, &serv->sv_permsocks);
@@ -1525,7 +1568,8 @@ svc_delete_socket(struct svc_sock *svsk)
 
 	spin_lock_bh(&serv->sv_lock);
 
-	list_del_init(&svsk->sk_list);
+	if (!test_and_set_bit(SK_DETACHED, &svsk->sk_flags))
+		list_del_init(&svsk->sk_list);
 	list_del_init(&svsk->sk_ready);
 	if (!test_and_set_bit(SK_DEAD, &svsk->sk_flags))
 		if (test_bit(SK_TEMP, &svsk->sk_flags))

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 002 of 11] knfsd: convert sk_inuse to atomic_t
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
  2006-07-31  0:41 ` [PATCH 001 of 11] knfsd: move tempsock aging to a timer NeilBrown
@ 2006-07-31  0:41 ` NeilBrown
  2006-07-31  0:41 ` [PATCH 003 of 11] knfsd: use new lock for svc_sock deferred list NeilBrown
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Convert the svc_sock->sk_inuse counter from an int protected
by svc_serv->sv_lock, to an atomic.  This reduces the number of places
we need to take the (effectively global) svc_serv->sv_lock.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svcsock.h |    2 +-
 ./net/sunrpc/svcsock.c           |   29 +++++++++++------------------
 2 files changed, 12 insertions(+), 19 deletions(-)

diff .prev/include/linux/sunrpc/svcsock.h ./include/linux/sunrpc/svcsock.h
--- .prev/include/linux/sunrpc/svcsock.h	2006-07-31 09:56:44.000000000 +1000
+++ ./include/linux/sunrpc/svcsock.h	2006-07-31 09:58:07.000000000 +1000
@@ -21,7 +21,7 @@ struct svc_sock {
 	struct sock *		sk_sk;		/* INET layer */
 
 	struct svc_serv *	sk_server;	/* service for this socket */
-	unsigned int		sk_inuse;	/* use count */
+	atomic_t		sk_inuse;	/* use count */
 	unsigned long		sk_flags;
 #define	SK_BUSY		0			/* enqueued/receiving */
 #define	SK_CONN		1			/* conn pending */

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 09:56:44.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 09:58:07.000000000 +1000
@@ -206,7 +206,7 @@ svc_sock_enqueue(struct svc_sock *svsk)
 				"svc_sock_enqueue: server %p, rq_sock=%p!\n",
 				rqstp, rqstp->rq_sock);
 		rqstp->rq_sock = svsk;
-		svsk->sk_inuse++;
+		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;
 		svsk->sk_reserved += rqstp->rq_reserved;
 		wake_up(&rqstp->rq_wait);
@@ -235,7 +235,7 @@ svc_sock_dequeue(struct svc_serv *serv)
 	list_del_init(&svsk->sk_ready);
 
 	dprintk("svc: socket %p dequeued, inuse=%d\n",
-		svsk->sk_sk, svsk->sk_inuse);
+		svsk->sk_sk, atomic_read(&svsk->sk_inuse));
 
 	return svsk;
 }
@@ -285,17 +285,11 @@ void svc_reserve(struct svc_rqst *rqstp,
 static inline void
 svc_sock_put(struct svc_sock *svsk)
 {
-	struct svc_serv *serv = svsk->sk_server;
-
-	spin_lock_bh(&serv->sv_lock);
-	if (!--(svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) {
-		spin_unlock_bh(&serv->sv_lock);
+	if (atomic_dec_and_test(&svsk->sk_inuse) && test_bit(SK_DEAD, &svsk->sk_flags)) {
 		dprintk("svc: releasing dead socket\n");
 		sock_release(svsk->sk_sock);
 		kfree(svsk);
 	}
-	else
-		spin_unlock_bh(&serv->sv_lock);
 }
 
 static void
@@ -907,7 +901,7 @@ svc_tcp_accept(struct svc_sock *svsk)
 					  struct svc_sock,
 					  sk_list);
 			set_bit(SK_CLOSE, &svsk->sk_flags);
-			svsk->sk_inuse ++;
+			atomic_inc(&svsk->sk_inuse);
 		}
 		spin_unlock_bh(&serv->sv_lock);
 
@@ -1239,7 +1233,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 	spin_lock_bh(&serv->sv_lock);
 	if ((svsk = svc_sock_dequeue(serv)) != NULL) {
 		rqstp->rq_sock = svsk;
-		svsk->sk_inuse++;
+		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;	
 		svsk->sk_reserved += rqstp->rq_reserved;
 	} else {
@@ -1271,7 +1265,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 	spin_unlock_bh(&serv->sv_lock);
 
 	dprintk("svc: server %p, socket %p, inuse=%d\n",
-		 rqstp, svsk, svsk->sk_inuse);
+		 rqstp, svsk, atomic_read(&svsk->sk_inuse));
 	len = svsk->sk_recvfrom(rqstp);
 	dprintk("svc: got len=%d\n", len);
 
@@ -1367,9 +1361,9 @@ svc_age_temp_sockets(unsigned long closu
 
 		if (!test_and_set_bit(SK_OLD, &svsk->sk_flags))
 			continue;
-		if (svsk->sk_inuse || test_bit(SK_BUSY, &svsk->sk_flags))
+		if (atomic_read(&svsk->sk_inuse) || test_bit(SK_BUSY, &svsk->sk_flags))
 			continue;
-		svsk->sk_inuse++;
+		atomic_inc(&svsk->sk_inuse);
 		list_move(le, &to_be_aged);
 		set_bit(SK_CLOSE, &svsk->sk_flags);
 		set_bit(SK_DETACHED, &svsk->sk_flags);
@@ -1430,6 +1424,7 @@ svc_setup_socket(struct svc_serv *serv, 
 	svsk->sk_odata = inet->sk_data_ready;
 	svsk->sk_owspace = inet->sk_write_space;
 	svsk->sk_server = serv;
+	atomic_set(&svsk->sk_inuse, 0);
 	svsk->sk_lastrecv = get_seconds();
 	INIT_LIST_HEAD(&svsk->sk_deferred);
 	INIT_LIST_HEAD(&svsk->sk_ready);
@@ -1575,7 +1570,7 @@ svc_delete_socket(struct svc_sock *svsk)
 		if (test_bit(SK_TEMP, &svsk->sk_flags))
 			serv->sv_tmpcnt--;
 
-	if (!svsk->sk_inuse) {
+	if (!atomic_read(&svsk->sk_inuse)) {
 		spin_unlock_bh(&serv->sv_lock);
 		if (svsk->sk_sock->file)
 			sockfd_put(svsk->sk_sock);
@@ -1656,10 +1651,8 @@ svc_defer(struct cache_req *req)
 		dr->argslen = rqstp->rq_arg.len >> 2;
 		memcpy(dr->args, rqstp->rq_arg.head[0].iov_base-skip, dr->argslen<<2);
 	}
-	spin_lock_bh(&rqstp->rq_server->sv_lock);
-	rqstp->rq_sock->sk_inuse++;
+	atomic_inc(&rqstp->rq_sock->sk_inuse);
 	dr->svsk = rqstp->rq_sock;
-	spin_unlock_bh(&rqstp->rq_server->sv_lock);
 
 	dr->handle.revisit = svc_revisit;
 	return &dr->handle;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 003 of 11] knfsd: use new lock for svc_sock deferred list
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
  2006-07-31  0:41 ` [PATCH 001 of 11] knfsd: move tempsock aging to a timer NeilBrown
  2006-07-31  0:41 ` [PATCH 002 of 11] knfsd: convert sk_inuse to atomic_t NeilBrown
@ 2006-07-31  0:41 ` NeilBrown
  2006-07-31  0:42 ` [PATCH 004 of 11] knfsd: convert sk_reserved to atomic_t NeilBrown
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Protect the svc_sock->sk_deferred list with a new
lock svc_sock->sk_defer_lock instead of svc_serv->sv_lock.
Using the more fine-grained lock reduces the number of places
we need to take the svc_serv lock.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svcsock.h |    1 +
 ./net/sunrpc/svcsock.c           |   12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff .prev/include/linux/sunrpc/svcsock.h ./include/linux/sunrpc/svcsock.h
--- .prev/include/linux/sunrpc/svcsock.h	2006-07-31 09:58:07.000000000 +1000
+++ ./include/linux/sunrpc/svcsock.h	2006-07-31 10:00:30.000000000 +1000
@@ -36,6 +36,7 @@ struct svc_sock {
 
 	int			sk_reserved;	/* space on outq that is reserved */
 
+	spinlock_t		sk_defer_lock;	/* protects sk_deferred */
 	struct list_head	sk_deferred;	/* deferred requests that need to
 						 * be revisted */
 	struct mutex		sk_mutex;	/* to serialize sending data */

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 09:58:07.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 10:00:30.000000000 +1000
@@ -47,6 +47,7 @@
 /* SMP locking strategy:
  *
  * 	svc_serv->sv_lock protects most stuff for that service.
+ *	svc_sock->sk_defer_lock protects the svc_sock->sk_deferred list
  *
  *	Some flags can be set to certain values at any time
  *	providing that certain rules are followed:
@@ -1426,6 +1427,7 @@ svc_setup_socket(struct svc_serv *serv, 
 	svsk->sk_server = serv;
 	atomic_set(&svsk->sk_inuse, 0);
 	svsk->sk_lastrecv = get_seconds();
+	spin_lock_init(&svsk->sk_defer_lock);
 	INIT_LIST_HEAD(&svsk->sk_deferred);
 	INIT_LIST_HEAD(&svsk->sk_ready);
 	mutex_init(&svsk->sk_mutex);
@@ -1606,7 +1608,6 @@ svc_makesock(struct svc_serv *serv, int 
 static void svc_revisit(struct cache_deferred_req *dreq, int too_many)
 {
 	struct svc_deferred_req *dr = container_of(dreq, struct svc_deferred_req, handle);
-	struct svc_serv *serv = dreq->owner;
 	struct svc_sock *svsk;
 
 	if (too_many) {
@@ -1617,9 +1618,9 @@ static void svc_revisit(struct cache_def
 	dprintk("revisit queued\n");
 	svsk = dr->svsk;
 	dr->svsk = NULL;
-	spin_lock_bh(&serv->sv_lock);
+	spin_lock_bh(&svsk->sk_defer_lock);
 	list_add(&dr->handle.recent, &svsk->sk_deferred);
-	spin_unlock_bh(&serv->sv_lock);
+	spin_unlock_bh(&svsk->sk_defer_lock);
 	set_bit(SK_DEFERRED, &svsk->sk_flags);
 	svc_sock_enqueue(svsk);
 	svc_sock_put(svsk);
@@ -1679,11 +1680,10 @@ static int svc_deferred_recv(struct svc_
 static struct svc_deferred_req *svc_deferred_dequeue(struct svc_sock *svsk)
 {
 	struct svc_deferred_req *dr = NULL;
-	struct svc_serv	*serv = svsk->sk_server;
 	
 	if (!test_bit(SK_DEFERRED, &svsk->sk_flags))
 		return NULL;
-	spin_lock_bh(&serv->sv_lock);
+	spin_lock_bh(&svsk->sk_defer_lock);
 	clear_bit(SK_DEFERRED, &svsk->sk_flags);
 	if (!list_empty(&svsk->sk_deferred)) {
 		dr = list_entry(svsk->sk_deferred.next,
@@ -1692,6 +1692,6 @@ static struct svc_deferred_req *svc_defe
 		list_del_init(&dr->handle.recent);
 		set_bit(SK_DEFERRED, &svsk->sk_flags);
 	}
-	spin_unlock_bh(&serv->sv_lock);
+	spin_unlock_bh(&svsk->sk_defer_lock);
 	return dr;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 004 of 11] knfsd: convert sk_reserved to atomic_t
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (2 preceding siblings ...)
  2006-07-31  0:41 ` [PATCH 003 of 11] knfsd: use new lock for svc_sock deferred list NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  0:42 ` [PATCH 005 of 11] knfsd: test and set SK_BUSY atomically NeilBrown
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Convert the svc_sock->sk_reserved variable from an int protected
by svc_serv->sv_lock, to an atomic.  This reduces (by 1) the number of
places we need to take the (effectively global) svc_serv->sv_lock.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svcsock.h |    2 +-
 ./net/sunrpc/svcsock.c           |   12 +++++-------
 2 files changed, 6 insertions(+), 8 deletions(-)

diff .prev/include/linux/sunrpc/svcsock.h ./include/linux/sunrpc/svcsock.h
--- .prev/include/linux/sunrpc/svcsock.h	2006-07-31 10:00:30.000000000 +1000
+++ ./include/linux/sunrpc/svcsock.h	2006-07-31 10:00:44.000000000 +1000
@@ -34,7 +34,7 @@ struct svc_sock {
 #define	SK_OLD		9			/* used for temp socket aging mark+sweep */
 #define	SK_DETACHED	10			/* detached from tempsocks list */
 
-	int			sk_reserved;	/* space on outq that is reserved */
+	atomic_t    	    	sk_reserved;	/* space on outq that is reserved */
 
 	spinlock_t		sk_defer_lock;	/* protects sk_deferred */
 	struct list_head	sk_deferred;	/* deferred requests that need to

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 10:00:30.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 10:00:44.000000000 +1000
@@ -177,13 +177,13 @@ svc_sock_enqueue(struct svc_sock *svsk)
 	}
 
 	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
-	if (((svsk->sk_reserved + serv->sv_bufsz)*2
+	if (((atomic_read(&svsk->sk_reserved) + serv->sv_bufsz)*2
 	     > svc_sock_wspace(svsk))
 	    && !test_bit(SK_CLOSE, &svsk->sk_flags)
 	    && !test_bit(SK_CONN, &svsk->sk_flags)) {
 		/* Don't enqueue while not enough space for reply */
 		dprintk("svc: socket %p  no space, %d*2 > %ld, not enqueued\n",
-			svsk->sk_sk, svsk->sk_reserved+serv->sv_bufsz,
+			svsk->sk_sk, atomic_read(&svsk->sk_reserved)+serv->sv_bufsz,
 			svc_sock_wspace(svsk));
 		goto out_unlock;
 	}
@@ -209,7 +209,7 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		rqstp->rq_sock = svsk;
 		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;
-		svsk->sk_reserved += rqstp->rq_reserved;
+		atomic_add(rqstp->rq_reserved, &svsk->sk_reserved);
 		wake_up(&rqstp->rq_wait);
 	} else {
 		dprintk("svc: socket %p put into queue\n", svsk->sk_sk);
@@ -271,10 +271,8 @@ void svc_reserve(struct svc_rqst *rqstp,
 
 	if (space < rqstp->rq_reserved) {
 		struct svc_sock *svsk = rqstp->rq_sock;
-		spin_lock_bh(&svsk->sk_server->sv_lock);
-		svsk->sk_reserved -= (rqstp->rq_reserved - space);
+		atomic_sub((rqstp->rq_reserved - space), &svsk->sk_reserved);
 		rqstp->rq_reserved = space;
-		spin_unlock_bh(&svsk->sk_server->sv_lock);
 
 		svc_sock_enqueue(svsk);
 	}
@@ -1236,7 +1234,7 @@ svc_recv(struct svc_rqst *rqstp, long ti
 		rqstp->rq_sock = svsk;
 		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;	
-		svsk->sk_reserved += rqstp->rq_reserved;
+		atomic_add(rqstp->rq_reserved, &svsk->sk_reserved);
 	} else {
 		/* No data pending. Go to sleep */
 		svc_serv_enqueue(serv, rqstp);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 005 of 11] knfsd: test and set SK_BUSY atomically
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (3 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 004 of 11] knfsd: convert sk_reserved to atomic_t NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  0:42 ` [PATCH 006 of 11] knfsd: split svc_serv into pools NeilBrown
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: The SK_BUSY bit in svc_sock->sk_flags ensures that we do not
attempt to enqueue a socket twice.  Currently, setting and clearing
the bit is protected by svc_serv->sv_lock.  As I intend to reduce the
data that the lock protects so it's not held when svc_sock_enqueue()
tests and sets SK_BUSY, that test and set needs to be atomic.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./net/sunrpc/svcsock.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 10:00:44.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 10:01:07.000000000 +1000
@@ -46,14 +46,13 @@
 
 /* SMP locking strategy:
  *
- * 	svc_serv->sv_lock protects most stuff for that service.
+ *	svc_serv->sv_lock protects most stuff for that service.
  *	svc_sock->sk_defer_lock protects the svc_sock->sk_deferred list
+ *	svc_sock->sk_flags.SK_BUSY prevents a svc_sock being enqueued multiply.
  *
  *	Some flags can be set to certain values at any time
  *	providing that certain rules are followed:
  *
- *	SK_BUSY  can be set to 0 at any time.  
- *		svc_sock_enqueue must be called afterwards
  *	SK_CONN, SK_DATA, can be set or cleared at any time.
  *		after a set, svc_sock_enqueue must be called.	
  *		after a clear, the socket must be read/accepted
@@ -170,8 +169,13 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		goto out_unlock;
 	}
 
-	if (test_bit(SK_BUSY, &svsk->sk_flags)) {
-		/* Don't enqueue socket while daemon is receiving */
+	/* Mark socket as busy. It will remain in this state until the
+	 * server has processed all pending data and put the socket back
+	 * on the idle list.  We update SK_BUSY atomically because
+	 * it also guards against trying to enqueue the svc_sock twice.
+	 */
+	if (test_and_set_bit(SK_BUSY, &svsk->sk_flags)) {
+		/* Don't enqueue socket while already enqueued */
 		dprintk("svc: socket %p busy, not enqueued\n", svsk->sk_sk);
 		goto out_unlock;
 	}
@@ -185,15 +189,11 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		dprintk("svc: socket %p  no space, %d*2 > %ld, not enqueued\n",
 			svsk->sk_sk, atomic_read(&svsk->sk_reserved)+serv->sv_bufsz,
 			svc_sock_wspace(svsk));
+		clear_bit(SK_BUSY, &svsk->sk_flags);
 		goto out_unlock;
 	}
 	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 
-	/* Mark socket as busy. It will remain in this state until the
-	 * server has processed all pending data and put the socket back
-	 * on the idle list.
-	 */
-	set_bit(SK_BUSY, &svsk->sk_flags);
 
 	if (!list_empty(&serv->sv_threads)) {
 		rqstp = list_entry(serv->sv_threads.next,

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 006 of 11] knfsd: split svc_serv into pools
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (4 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 005 of 11] knfsd: test and set SK_BUSY atomically NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  0:42 ` [PATCH 007 of 11] knfsd: add svc_get NeilBrown
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Split out the list of idle threads and pending sockets from
svc_serv into a new svc_pool structure, and allocate a fixed number
(in this patch, 1) of pools per svc_serv.  The new structure contains
a lock which takes over several of the duties of svc_serv->sv_lock,
which is now relegated to protecting only sv_tempsocks, sv_permsocks,
and sv_tmpcnt in svc_serv.

The point is to move the hottest fields out of svc_serv and into
svc_pool, allowing a following patch to arrange for a svc_pool per
NUMA node or per CPU.   This is a major step towards making the NFS
server NUMA-friendly.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svc.h     |   25 +++++++
 ./include/linux/sunrpc/svcsock.h |    1 
 ./net/sunrpc/svc.c               |   56 +++++++++++++++--
 ./net/sunrpc/svcsock.c           |  123 +++++++++++++++++++++++++--------------
 4 files changed, 152 insertions(+), 53 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h	2006-07-31 10:19:58.000000000 +1000
+++ ./include/linux/sunrpc/svc.h	2006-07-31 10:14:54.000000000 +1000
@@ -17,6 +17,25 @@
 #include <linux/wait.h>
 #include <linux/mm.h>
 
+
+/*
+ *
+ * RPC service thread pool.
+ *
+ * Pool of threads and temporary sockets.  Generally there is only
+ * a single one of these per RPC service, but on NUMA machines those
+ * services that can benefit from it (i.e. nfs but not lockd) will
+ * have one pool per NUMA node.  This optimisation reduces cross-
+ * node traffic on multi-node NUMA NFS servers.
+ */
+struct svc_pool {
+	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
+	spinlock_t		sp_lock;	/* protects all fields */
+	struct list_head	sp_threads;	/* idle server threads */
+	struct list_head	sp_sockets;	/* pending sockets */
+	unsigned int		sp_nrthreads;	/* # of threads in pool */
+} ____cacheline_aligned_in_smp;
+
 /*
  * RPC service.
  *
@@ -28,8 +47,6 @@
  * We currently do not support more than one RPC program per daemon.
  */
 struct svc_serv {
-	struct list_head	sv_threads;	/* idle server threads */
-	struct list_head	sv_sockets;	/* pending sockets */
 	struct svc_program *	sv_program;	/* RPC program */
 	struct svc_stat *	sv_stats;	/* RPC statistics */
 	spinlock_t		sv_lock;
@@ -44,6 +61,9 @@ struct svc_serv {
 
 	char *			sv_name;	/* service name */
 
+	unsigned int		sv_nrpools;	/* number of thread pools */
+	struct svc_pool *	sv_pools;	/* array of thread pools */
+
 	void			(*sv_shutdown)(struct svc_serv *serv);
 						/* Callback to use when last thread
 						 * exits.
@@ -121,6 +141,7 @@ struct svc_rqst {
 	int			rq_addrlen;
 
 	struct svc_serv *	rq_server;	/* RPC service definition */
+	struct svc_pool *	rq_pool;	/* thread pool */
 	struct svc_procedure *	rq_procinfo;	/* procedure info */
 	struct auth_ops *	rq_authop;	/* authentication flavour */
 	struct svc_cred		rq_cred;	/* auth info */

diff .prev/include/linux/sunrpc/svcsock.h ./include/linux/sunrpc/svcsock.h
--- .prev/include/linux/sunrpc/svcsock.h	2006-07-31 10:19:58.000000000 +1000
+++ ./include/linux/sunrpc/svcsock.h	2006-07-31 10:14:54.000000000 +1000
@@ -20,6 +20,7 @@ struct svc_sock {
 	struct socket *		sk_sock;	/* berkeley socket layer */
 	struct sock *		sk_sk;		/* INET layer */
 
+	struct svc_pool *	sk_pool;	/* current pool iff queued */
 	struct svc_serv *	sk_server;	/* service for this socket */
 	atomic_t		sk_inuse;	/* use count */
 	unsigned long		sk_flags;

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c	2006-07-31 10:19:58.000000000 +1000
+++ ./net/sunrpc/svc.c	2006-07-31 10:20:33.000000000 +1000
@@ -32,6 +32,7 @@ svc_create(struct svc_program *prog, uns
 	struct svc_serv	*serv;
 	int vers;
 	unsigned int xdrsize;
+	unsigned int i;
 
 	if (!(serv = kzalloc(sizeof(*serv), GFP_KERNEL)))
 		return NULL;
@@ -55,13 +56,33 @@ svc_create(struct svc_program *prog, uns
 		prog = prog->pg_next;
 	}
 	serv->sv_xdrsize   = xdrsize;
-	INIT_LIST_HEAD(&serv->sv_threads);
-	INIT_LIST_HEAD(&serv->sv_sockets);
 	INIT_LIST_HEAD(&serv->sv_tempsocks);
 	INIT_LIST_HEAD(&serv->sv_permsocks);
 	init_timer(&serv->sv_temptimer);
 	spin_lock_init(&serv->sv_lock);
 
+	serv->sv_nrpools = 1;
+	serv->sv_pools =
+		kcalloc(sizeof(struct svc_pool), serv->sv_nrpools,
+			GFP_KERNEL);
+	if (!serv->sv_pools) {
+		kfree(serv);
+		return NULL;
+	}
+
+	for (i = 0 ; i < serv->sv_nrpools ; i++) {
+		struct svc_pool *pool = &serv->sv_pools[i];
+
+		dprintk("initialising pool %u for %s\n",
+				i, serv->sv_name);
+
+		pool->sp_id = i;
+		INIT_LIST_HEAD(&pool->sp_threads);
+		INIT_LIST_HEAD(&pool->sp_sockets);
+		spin_lock_init(&pool->sp_lock);
+	}
+
+
 	/* Remove any stale portmap registrations */
 	svc_register(serv, 0, 0);
 
@@ -69,7 +90,7 @@ svc_create(struct svc_program *prog, uns
 }
 
 /*
- * Destroy an RPC service
+ * Destroy an RPC service.  Should be called with the BKL held
  */
 void
 svc_destroy(struct svc_serv *serv)
@@ -110,6 +131,7 @@ svc_destroy(struct svc_serv *serv)
 
 	/* Unregister service with the portmapper */
 	svc_register(serv, 0, 0);
+	kfree(serv->sv_pools);
 	kfree(serv);
 }
 
@@ -158,10 +180,11 @@ svc_release_buffer(struct svc_rqst *rqst
 }
 
 /*
- * Create a server thread
+ * Create a thread in the given pool.  Caller must hold BKL.
  */
-int
-svc_create_thread(svc_thread_fn func, struct svc_serv *serv)
+static int
+__svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
+		    struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
 	int		error = -ENOMEM;
@@ -178,7 +201,11 @@ svc_create_thread(svc_thread_fn func, st
 		goto out_thread;
 
 	serv->sv_nrthreads++;
+	spin_lock_bh(&pool->sp_lock);
+	pool->sp_nrthreads++;
+	spin_unlock_bh(&pool->sp_lock);
 	rqstp->rq_server = serv;
+	rqstp->rq_pool = pool;
 	error = kernel_thread((int (*)(void *)) func, rqstp, 0);
 	if (error < 0)
 		goto out_thread;
@@ -193,17 +220,32 @@ out_thread:
 }
 
 /*
- * Destroy an RPC server thread
+ * Create a thread in the default pool.  Caller must hold BKL.
+ */
+int
+svc_create_thread(svc_thread_fn func, struct svc_serv *serv)
+{
+	return __svc_create_thread(func, serv, &serv->sv_pools[0]);
+}
+
+/*
+ * Called from a server thread as it's exiting.  Caller must hold BKL.
  */
 void
 svc_exit_thread(struct svc_rqst *rqstp)
 {
 	struct svc_serv	*serv = rqstp->rq_server;
+	struct svc_pool	*pool = rqstp->rq_pool;
 
 	svc_release_buffer(rqstp);
 	kfree(rqstp->rq_resp);
 	kfree(rqstp->rq_argp);
 	kfree(rqstp->rq_auth_data);
+
+	spin_lock_bh(&pool->sp_lock);
+	pool->sp_nrthreads--;
+	spin_unlock_bh(&pool->sp_lock);
+
 	kfree(rqstp);
 
 	/* Release the server */

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 10:19:58.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 10:15:26.000000000 +1000
@@ -46,7 +46,10 @@
 
 /* SMP locking strategy:
  *
- *	svc_serv->sv_lock protects most stuff for that service.
+ *	svc_pool->sp_lock protects most of the fields of that pool.
+ * 	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
+ *	when both need to be taken (rare), svc_serv->sv_lock is first.
+ *	BKL protects svc_serv->sv_nrthread, svc_pool->sp_nrthread
  *	svc_sock->sk_defer_lock protects the svc_sock->sk_deferred list
  *	svc_sock->sk_flags.SK_BUSY prevents a svc_sock being enqueued multiply.
  *
@@ -82,22 +85,22 @@ static struct cache_deferred_req *svc_de
 static int svc_conn_age_period = 6*60;
 
 /*
- * Queue up an idle server thread.  Must have serv->sv_lock held.
+ * Queue up an idle server thread.  Must have pool->sp_lock held.
  * Note: this is really a stack rather than a queue, so that we only
- * use as many different threads as we need, and the rest don't polute
+ * use as many different threads as we need, and the rest don't pollute
  * the cache.
  */
 static inline void
-svc_serv_enqueue(struct svc_serv *serv, struct svc_rqst *rqstp)
+svc_thread_enqueue(struct svc_pool *pool, struct svc_rqst *rqstp)
 {
-	list_add(&rqstp->rq_list, &serv->sv_threads);
+	list_add(&rqstp->rq_list, &pool->sp_threads);
 }
 
 /*
- * Dequeue an nfsd thread.  Must have serv->sv_lock held.
+ * Dequeue an nfsd thread.  Must have pool->sp_lock held.
  */
 static inline void
-svc_serv_dequeue(struct svc_serv *serv, struct svc_rqst *rqstp)
+svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
 {
 	list_del(&rqstp->rq_list);
 }
@@ -148,6 +151,7 @@ static void
 svc_sock_enqueue(struct svc_sock *svsk)
 {
 	struct svc_serv	*serv = svsk->sk_server;
+	struct svc_pool *pool = &serv->sv_pools[0];
 	struct svc_rqst	*rqstp;
 
 	if (!(svsk->sk_flags &
@@ -156,10 +160,10 @@ svc_sock_enqueue(struct svc_sock *svsk)
 	if (test_bit(SK_DEAD, &svsk->sk_flags))
 		return;
 
-	spin_lock_bh(&serv->sv_lock);
+	spin_lock_bh(&pool->sp_lock);
 
-	if (!list_empty(&serv->sv_threads) && 
-	    !list_empty(&serv->sv_sockets))
+	if (!list_empty(&pool->sp_threads) &&
+	    !list_empty(&pool->sp_sockets))
 		printk(KERN_ERR
 			"svc_sock_enqueue: threads and sockets both waiting??\n");
 
@@ -179,6 +183,8 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		dprintk("svc: socket %p busy, not enqueued\n", svsk->sk_sk);
 		goto out_unlock;
 	}
+	BUG_ON(svsk->sk_pool != NULL);
+	svsk->sk_pool = pool;
 
 	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 	if (((atomic_read(&svsk->sk_reserved) + serv->sv_bufsz)*2
@@ -189,19 +195,20 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		dprintk("svc: socket %p  no space, %d*2 > %ld, not enqueued\n",
 			svsk->sk_sk, atomic_read(&svsk->sk_reserved)+serv->sv_bufsz,
 			svc_sock_wspace(svsk));
+		svsk->sk_pool = NULL;
 		clear_bit(SK_BUSY, &svsk->sk_flags);
 		goto out_unlock;
 	}
 	clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
 
 
-	if (!list_empty(&serv->sv_threads)) {
-		rqstp = list_entry(serv->sv_threads.next,
+	if (!list_empty(&pool->sp_threads)) {
+		rqstp = list_entry(pool->sp_threads.next,
 				   struct svc_rqst,
 				   rq_list);
 		dprintk("svc: socket %p served by daemon %p\n",
 			svsk->sk_sk, rqstp);
-		svc_serv_dequeue(serv, rqstp);
+		svc_thread_dequeue(pool, rqstp);
 		if (rqstp->rq_sock)
 			printk(KERN_ERR 
 				"svc_sock_enqueue: server %p, rq_sock=%p!\n",
@@ -210,28 +217,30 @@ svc_sock_enqueue(struct svc_sock *svsk)
 		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;
 		atomic_add(rqstp->rq_reserved, &svsk->sk_reserved);
+		BUG_ON(svsk->sk_pool != pool);
 		wake_up(&rqstp->rq_wait);
 	} else {
 		dprintk("svc: socket %p put into queue\n", svsk->sk_sk);
-		list_add_tail(&svsk->sk_ready, &serv->sv_sockets);
+		list_add_tail(&svsk->sk_ready, &pool->sp_sockets);
+		BUG_ON(svsk->sk_pool != pool);
 	}
 
 out_unlock:
-	spin_unlock_bh(&serv->sv_lock);
+	spin_unlock_bh(&pool->sp_lock);
 }
 
 /*
- * Dequeue the first socket.  Must be called with the serv->sv_lock held.
+ * Dequeue the first socket.  Must be called with the pool->sp_lock held.
  */
 static inline struct svc_sock *
-svc_sock_dequeue(struct svc_serv *serv)
+svc_sock_dequeue(struct svc_pool *pool)
 {
 	struct svc_sock	*svsk;
 
-	if (list_empty(&serv->sv_sockets))
+	if (list_empty(&pool->sp_sockets))
 		return NULL;
 
-	svsk = list_entry(serv->sv_sockets.next,
+	svsk = list_entry(pool->sp_sockets.next,
 			  struct svc_sock, sk_ready);
 	list_del_init(&svsk->sk_ready);
 
@@ -250,6 +259,7 @@ svc_sock_dequeue(struct svc_serv *serv)
 static inline void
 svc_sock_received(struct svc_sock *svsk)
 {
+	svsk->sk_pool = NULL;
 	clear_bit(SK_BUSY, &svsk->sk_flags);
 	svc_sock_enqueue(svsk);
 }
@@ -322,25 +332,33 @@ svc_sock_release(struct svc_rqst *rqstp)
 
 /*
  * External function to wake up a server waiting for data
+ * This really only makes sense for services like lockd
+ * which have exactly one thread anyway.
  */
 void
 svc_wake_up(struct svc_serv *serv)
 {
 	struct svc_rqst	*rqstp;
+	unsigned int i;
+	struct svc_pool *pool;
 
-	spin_lock_bh(&serv->sv_lock);
-	if (!list_empty(&serv->sv_threads)) {
-		rqstp = list_entry(serv->sv_threads.next,
-				   struct svc_rqst,
-				   rq_list);
-		dprintk("svc: daemon %p woken up.\n", rqstp);
-		/*
-		svc_serv_dequeue(serv, rqstp);
-		rqstp->rq_sock = NULL;
-		 */
-		wake_up(&rqstp->rq_wait);
+	for (i = 0 ; i < serv->sv_nrpools ; i++) {
+		pool = &serv->sv_pools[i];
+
+		spin_lock_bh(&pool->sp_lock);
+		if (!list_empty(&pool->sp_threads)) {
+			rqstp = list_entry(pool->sp_threads.next,
+					   struct svc_rqst,
+					   rq_list);
+			dprintk("svc: daemon %p woken up.\n", rqstp);
+			/*
+			svc_thread_dequeue(pool, rqstp);
+			rqstp->rq_sock = NULL;
+			 */
+			wake_up(&rqstp->rq_wait);
+		}
+		spin_unlock_bh(&pool->sp_lock);
 	}
-	spin_unlock_bh(&serv->sv_lock);
 }
 
 /*
@@ -606,7 +624,10 @@ svc_udp_recvfrom(struct svc_rqst *rqstp)
 	    /* udp sockets need large rcvbuf as all pending
 	     * requests are still in that buffer.  sndbuf must
 	     * also be large enough that there is enough space
-	     * for one reply per thread.
+	     * for one reply per thread.  We count all threads
+	     * rather than threads in a particular pool, which
+	     * provides an upper bound on the number of threads
+	     * which will access the socket.
 	     */
 	    svc_sock_setbufsize(svsk->sk_sock,
 				(serv->sv_nrthreads+3) * serv->sv_bufsz,
@@ -958,6 +979,11 @@ svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		/* sndbuf needs to have room for one request
 		 * per thread, otherwise we can stall even when the
 		 * network isn't a bottleneck.
+		 *
+		 * We count all threads rather than threads in a
+		 * particular pool, which provides an upper bound
+		 * on the number of threads which will access the socket.
+		 *
 		 * rcvbuf just needs to be able to hold a few requests.
 		 * Normally they will be removed from the queue 
 		 * as soon a a complete request arrives.
@@ -1173,13 +1199,16 @@ svc_sock_update_bufs(struct svc_serv *se
 }
 
 /*
- * Receive the next request on any socket.
+ * Receive the next request on any socket.  This code is carefully
+ * organised not to touch any cachelines in the shared svc_serv
+ * structure, only cachelines in the local svc_pool.
  */
 int
 svc_recv(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_sock		*svsk =NULL;
 	struct svc_serv		*serv = rqstp->rq_server;
+	struct svc_pool		*pool = rqstp->rq_pool;
 	int			len;
 	int 			pages;
 	struct xdr_buf		*arg;
@@ -1229,15 +1258,15 @@ svc_recv(struct svc_rqst *rqstp, long ti
 	if (signalled())
 		return -EINTR;
 
-	spin_lock_bh(&serv->sv_lock);
-	if ((svsk = svc_sock_dequeue(serv)) != NULL) {
+	spin_lock_bh(&pool->sp_lock);
+	if ((svsk = svc_sock_dequeue(pool)) != NULL) {
 		rqstp->rq_sock = svsk;
 		atomic_inc(&svsk->sk_inuse);
 		rqstp->rq_reserved = serv->sv_bufsz;	
 		atomic_add(rqstp->rq_reserved, &svsk->sk_reserved);
 	} else {
 		/* No data pending. Go to sleep */
-		svc_serv_enqueue(serv, rqstp);
+		svc_thread_enqueue(pool, rqstp);
 
 		/*
 		 * We have to be able to interrupt this wait
@@ -1245,26 +1274,26 @@ svc_recv(struct svc_rqst *rqstp, long ti
 		 */
 		set_current_state(TASK_INTERRUPTIBLE);
 		add_wait_queue(&rqstp->rq_wait, &wait);
-		spin_unlock_bh(&serv->sv_lock);
+		spin_unlock_bh(&pool->sp_lock);
 
 		schedule_timeout(timeout);
 
 		try_to_freeze();
 
-		spin_lock_bh(&serv->sv_lock);
+		spin_lock_bh(&pool->sp_lock);
 		remove_wait_queue(&rqstp->rq_wait, &wait);
 
 		if (!(svsk = rqstp->rq_sock)) {
-			svc_serv_dequeue(serv, rqstp);
-			spin_unlock_bh(&serv->sv_lock);
+			svc_thread_dequeue(pool, rqstp);
+			spin_unlock_bh(&pool->sp_lock);
 			dprintk("svc: server %p, no data yet\n", rqstp);
 			return signalled()? -EINTR : -EAGAIN;
 		}
 	}
-	spin_unlock_bh(&serv->sv_lock);
+	spin_unlock_bh(&pool->sp_lock);
 
-	dprintk("svc: server %p, socket %p, inuse=%d\n",
-		 rqstp, svsk, atomic_read(&svsk->sk_inuse));
+	dprintk("svc: server %p, pool %u, socket %p, inuse=%d\n",
+		 rqstp, pool->sp_id, svsk, atomic_read(&svsk->sk_inuse));
 	len = svsk->sk_recvfrom(rqstp);
 	dprintk("svc: got len=%d\n", len);
 
@@ -1565,7 +1594,13 @@ svc_delete_socket(struct svc_sock *svsk)
 
 	if (!test_and_set_bit(SK_DETACHED, &svsk->sk_flags))
 		list_del_init(&svsk->sk_list);
-	list_del_init(&svsk->sk_ready);
+    	/*
+	 * We used to delete the svc_sock from whichever list
+	 * it's sk_ready node was on, but we don't actually
+	 * need to.  This is because the only time we're called
+	 * while still attached to a queue, the queue itself
+	 * is about to be destroyed (in svc_destroy).
+	 */
 	if (!test_and_set_bit(SK_DEAD, &svsk->sk_flags))
 		if (test_bit(SK_TEMP, &svsk->sk_flags))
 			serv->sv_tmpcnt--;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 007 of 11] knfsd: add svc_get
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (5 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 006 of 11] knfsd: split svc_serv into pools NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  4:05   ` Andrew Morton
  2006-07-31  0:42 ` [PATCH 008 of 11] knfsd: add svc_set_num_threads NeilBrown
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: add svc_get() for those occasions when we need to temporarily
bump up svc_serv->sv_nrthreads as a pseudo refcount.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/nfsd/nfssvc.c           |    2 +-
 ./include/linux/sunrpc/svc.h |   11 +++++++++++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff .prev/fs/nfsd/nfssvc.c ./fs/nfsd/nfssvc.c
--- .prev/fs/nfsd/nfssvc.c	2006-07-28 15:01:38.000000000 +1000
+++ ./fs/nfsd/nfssvc.c	2006-07-31 10:20:49.000000000 +1000
@@ -200,7 +200,7 @@ int nfsd_create_serv(void)
 	int err = 0;
 	lock_kernel();
 	if (nfsd_serv) {
-		nfsd_serv->sv_nrthreads++;
+		svc_get(nfsd_serv);
 		unlock_kernel();
 		return 0;
 	}

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h	2006-07-31 10:14:54.000000000 +1000
+++ ./include/linux/sunrpc/svc.h	2006-07-31 10:20:49.000000000 +1000
@@ -71,6 +71,17 @@ struct svc_serv {
 };
 
 /*
+ * We use sv_nrthreads as a reference count.  svc_destroy() drops
+ * this refcount, so we need to bump it up around operations that
+ * change the number of threads.  Horrible, but there it is.
+ * Should be called with the BKL held.
+ */
+static inline void svc_get(struct svc_serv *serv)
+{
+	serv->sv_nrthreads++;
+}
+
+/*
  * Maximum payload size supported by a kernel RPC server.
  * This is use to determine the max number of pages nfsd is
  * willing to return in a single READ operation.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 008 of 11] knfsd: add svc_set_num_threads
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (6 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 007 of 11] knfsd: add svc_get NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  4:11   ` Andrew Morton
  2006-07-31  0:42 ` [PATCH 009 of 11] knfsd: use svc_set_num_threads to manage threads in knfsd NeilBrown
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd:  Currently knfsd keeps its own list of all nfsd threads
in nfssvc.c; add a new way of managing the list of all threads
in a svc_serv.  Add svc_create_pooled() to allow creation of
a svc_serv whose threads are managed by the sunrpc code.  Add
svc_set_num_threads() to manage the number of threads in a service,
either per-pool or globally across the service.


Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svc.h |   21 ++++--
 ./net/sunrpc/sunrpc_syms.c   |    2 
 ./net/sunrpc/svc.c           |  139 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 154 insertions(+), 8 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h	2006-07-31 10:29:38.000000000 +1000
+++ ./include/linux/sunrpc/svc.h	2006-07-31 10:30:30.000000000 +1000
@@ -17,6 +17,10 @@
 #include <linux/wait.h>
 #include <linux/mm.h>
 
+/*
+ * This is the RPC server thread function prototype
+ */
+typedef void		(*svc_thread_fn)(struct svc_rqst *);
 
 /*
  *
@@ -34,6 +38,7 @@ struct svc_pool {
 	struct list_head	sp_threads;	/* idle server threads */
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
+	struct list_head	sp_all_threads;	/* all server threads */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -68,6 +73,11 @@ struct svc_serv {
 						/* Callback to use when last thread
 						 * exits.
 						 */
+
+	struct module *		sv_module;	/* optional module to count when
+						 * adding threads */
+	svc_thread_fn		sv_function;	/* main function for threads */
+	int			sv_kill_signal;	/* signal to kill threads */
 };
 
 /*
@@ -147,6 +157,7 @@ static inline void svc_putu32(struct kve
  */
 struct svc_rqst {
 	struct list_head	rq_list;	/* idle list */
+	struct list_head	rq_all;		/* all threads list */
 	struct svc_sock *	rq_sock;	/* socket */
 	struct sockaddr_in	rq_addr;	/* peer address */
 	int			rq_addrlen;
@@ -201,6 +212,7 @@ struct svc_rqst {
 						 * to prevent encrypting page
 						 * cache pages */
 	wait_queue_head_t	rq_wait;	/* synchronization */
+	struct task_struct	*rq_task;	/* service thread */
 };
 
 /*
@@ -342,17 +354,16 @@ struct svc_procedure {
 };
 
 /*
- * This is the RPC server thread function prototype
- */
-typedef void		(*svc_thread_fn)(struct svc_rqst *);
-
-/*
  * Function prototypes.
  */
 struct svc_serv *  svc_create(struct svc_program *, unsigned int,
 			      void (*shutdown)(struct svc_serv*));
 int		   svc_create_thread(svc_thread_fn, struct svc_serv *);
 void		   svc_exit_thread(struct svc_rqst *);
+struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
+			void (*shutdown)(struct svc_serv*),
+			svc_thread_fn, int sig, struct module *);
+int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
 void		   svc_destroy(struct svc_serv *);
 int		   svc_process(struct svc_rqst *);
 int		   svc_register(struct svc_serv *, int, unsigned short);

diff .prev/net/sunrpc/sunrpc_syms.c ./net/sunrpc/sunrpc_syms.c
--- .prev/net/sunrpc/sunrpc_syms.c	2006-07-31 10:29:38.000000000 +1000
+++ ./net/sunrpc/sunrpc_syms.c	2006-07-31 10:30:30.000000000 +1000
@@ -73,6 +73,8 @@ EXPORT_SYMBOL(put_rpccred);
 /* RPC server stuff */
 EXPORT_SYMBOL(svc_create);
 EXPORT_SYMBOL(svc_create_thread);
+EXPORT_SYMBOL(svc_create_pooled);
+EXPORT_SYMBOL(svc_set_num_threads);
 EXPORT_SYMBOL(svc_exit_thread);
 EXPORT_SYMBOL(svc_destroy);
 EXPORT_SYMBOL(svc_drop);

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c	2006-07-31 10:29:38.000000000 +1000
+++ ./net/sunrpc/svc.c	2006-07-31 10:30:30.000000000 +1000
@@ -12,6 +12,8 @@
 #include <linux/net.h>
 #include <linux/in.h>
 #include <linux/mm.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
 
 #include <linux/sunrpc/types.h>
 #include <linux/sunrpc/xdr.h>
@@ -25,8 +27,8 @@
 /*
  * Create an RPC service
  */
-struct svc_serv *
-svc_create(struct svc_program *prog, unsigned int bufsize,
+static struct svc_serv *
+__svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 	   void (*shutdown)(struct svc_serv *serv))
 {
 	struct svc_serv	*serv;
@@ -61,7 +63,7 @@ svc_create(struct svc_program *prog, uns
 	init_timer(&serv->sv_temptimer);
 	spin_lock_init(&serv->sv_lock);
 
-	serv->sv_nrpools = 1;
+	serv->sv_nrpools = npools;
 	serv->sv_pools =
 		kcalloc(sizeof(struct svc_pool), serv->sv_nrpools,
 			GFP_KERNEL);
@@ -79,6 +81,7 @@ svc_create(struct svc_program *prog, uns
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_threads);
 		INIT_LIST_HEAD(&pool->sp_sockets);
+		INIT_LIST_HEAD(&pool->sp_all_threads);
 		spin_lock_init(&pool->sp_lock);
 	}
 
@@ -89,6 +92,31 @@ svc_create(struct svc_program *prog, uns
 	return serv;
 }
 
+struct svc_serv *
+svc_create(struct svc_program *prog, unsigned int bufsize,
+		void (*shutdown)(struct svc_serv *serv))
+{
+	return __svc_create(prog, bufsize, /*npools*/1, shutdown);
+}
+
+struct svc_serv *
+svc_create_pooled(struct svc_program *prog, unsigned int bufsize,
+		void (*shutdown)(struct svc_serv *serv),
+		  svc_thread_fn func, int sig, struct module *mod)
+{
+	struct svc_serv *serv;
+
+	serv = __svc_create(prog, bufsize, /*npools*/1, shutdown);
+
+	if (serv != NULL) {
+		serv->sv_function = func;
+		serv->sv_kill_signal = sig;
+		serv->sv_module = mod;
+	}
+
+	return serv;
+}
+
 /*
  * Destroy an RPC service.  Should be called with the BKL held
  */
@@ -203,6 +231,7 @@ __svc_create_thread(svc_thread_fn func, 
 	serv->sv_nrthreads++;
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads++;
+	list_add(&rqstp->rq_all, &pool->sp_all_threads);
 	spin_unlock_bh(&pool->sp_lock);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
@@ -229,6 +258,109 @@ svc_create_thread(svc_thread_fn func, st
 }
 
 /*
+ * Choose a pool in which to create a new thread, for svc_set_num_threads
+ */
+static inline struct svc_pool *
+choose_pool(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+{
+	if (pool != NULL)
+		return pool;
+
+ 	return &serv->sv_pools[(*state)++ % serv->sv_nrpools];
+}
+
+/*
+ * Choose a thread to kill, for svc_set_num_threads
+ */
+static inline struct task_struct *
+choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+{
+	unsigned int i;
+	struct task_struct *task = NULL;
+
+	if (pool != NULL) {
+		spin_lock_bh(&pool->sp_lock);
+	} else {
+		/* choose a pool in round-robin fashion */
+ 		for (i = 0 ; i < serv->sv_nrpools ; i++) {
+ 			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
+			spin_lock_bh(&pool->sp_lock);
+ 			if (!list_empty(&pool->sp_all_threads))
+ 				goto found_pool;
+			spin_unlock_bh(&pool->sp_lock);
+ 		}
+		return NULL;
+	}
+
+found_pool:
+	if (!list_empty(&pool->sp_all_threads)) {
+		struct svc_rqst *rqstp;
+
+		/*
+		 * Remove from the pool->sp_all_threads list
+		 * so we don't try to kill it again.
+		 */
+		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
+		list_del_init(&rqstp->rq_all);
+		task = rqstp->rq_task;
+    	}
+	spin_unlock_bh(&pool->sp_lock);
+
+	return task;
+}
+
+/*
+ * Create or destroy enough new threads to make the number
+ * of threads the given number.  If `pool' is non-NULL, applies
+ * only to threads in that pool, otherwise round-robins between
+ * all pools.  Must be called with a svc_get() reference and
+ * the BKL held.
+ *
+ * Destroying threads relies on the service threads filling in
+ * rqstp->rq_task, which only the nfs ones do.  Assumes the serv
+ * has been created using svc_create_pooled().
+ *
+ * Based on code that used to be in nfsd_svc() but tweaked
+ * to be pool-aware.
+ */
+int
+svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
+{
+	struct task_struct *victim;
+	int error = 0;
+	unsigned int state = serv->sv_nrthreads-1;
+
+	if (pool == NULL) {
+		/* The -1 assumes caller has done a svc_get() */
+		nrservs -= (serv->sv_nrthreads-1);
+	} else {
+		spin_lock_bh(&pool->sp_lock);
+		nrservs -= pool->sp_nrthreads;
+		spin_unlock_bh(&pool->sp_lock);
+	}
+
+	/* create new threads */
+	while (nrservs > 0) {
+		nrservs--;
+		__module_get(serv->sv_module);
+		error = __svc_create_thread(serv->sv_function, serv,
+					    choose_pool(serv, pool, &state));
+		if (error < 0) {
+			module_put(serv->sv_module);
+			break;
+		}
+	}
+	/* destroy old threads */
+	while (nrservs < 0 &&
+	       (victim = choose_victim(serv, pool, &state)) != NULL) {
+		send_sig(serv->sv_kill_signal, victim, 1);
+		nrservs++;
+	}
+
+	return error;
+}
+
+/*
  * Called from a server thread as it's exiting.  Caller must hold BKL.
  */
 void
@@ -244,6 +376,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
 
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads--;
+	list_del(&rqstp->rq_all);
 	spin_unlock_bh(&pool->sp_lock);
 
 	kfree(rqstp);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 009 of 11] knfsd: use svc_set_num_threads to manage threads in knfsd
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (7 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 008 of 11] knfsd: add svc_set_num_threads NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  0:42 ` [PATCH 010 of 11] knfsd: make rpc threads pools numa aware NeilBrown
  2006-07-31  0:42 ` [PATCH 011 of 11] knfsd: allow admin to set nthreads per node NeilBrown
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Replace the existing list of all nfsd threads with new
code using svc_create_pooled().

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/nfsd/nfssvc.c |   36 +++++-------------------------------
 1 file changed, 5 insertions(+), 31 deletions(-)

diff .prev/fs/nfsd/nfssvc.c ./fs/nfsd/nfssvc.c
--- .prev/fs/nfsd/nfssvc.c	2006-07-31 10:29:38.000000000 +1000
+++ ./fs/nfsd/nfssvc.c	2006-07-31 10:31:08.000000000 +1000
@@ -57,12 +57,6 @@ static atomic_t			nfsd_busy;
 static unsigned long		nfsd_last_call;
 static DEFINE_SPINLOCK(nfsd_call_lock);
 
-struct nfsd_list {
-	struct list_head 	list;
-	struct task_struct	*task;
-};
-static struct list_head nfsd_list = LIST_HEAD_INIT(nfsd_list);
-
 #if defined(CONFIG_NFSD_V2_ACL) || defined(CONFIG_NFSD_V3_ACL)
 static struct svc_stat	nfsd_acl_svcstats;
 static struct svc_version *	nfsd_acl_version[] = {
@@ -206,8 +200,9 @@ int nfsd_create_serv(void)
 	}
 
 	atomic_set(&nfsd_busy, 0);
-	nfsd_serv = svc_create(&nfsd_program, NFSD_BUFSIZE,
-			       nfsd_last_thread);
+	nfsd_serv = svc_create_pooled(&nfsd_program, NFSD_BUFSIZE,
+				      nfsd_last_thread,
+				      nfsd, SIG_NOCLEAN, THIS_MODULE);
 	if (nfsd_serv == NULL)
 		err = -ENOMEM;
 	unlock_kernel();
@@ -247,7 +242,6 @@ int
 nfsd_svc(unsigned short port, int nrservs)
 {
 	int	error;
-	struct list_head *victim;
 	
 	lock_kernel();
 	dprintk("nfsd: creating service\n");
@@ -275,24 +269,7 @@ nfsd_svc(unsigned short port, int nrserv
 	if (error)
 		goto failure;
 
-	nrservs -= (nfsd_serv->sv_nrthreads-1);
-	while (nrservs > 0) {
-		nrservs--;
-		__module_get(THIS_MODULE);
-		error = svc_create_thread(nfsd, nfsd_serv);
-		if (error < 0) {
-			module_put(THIS_MODULE);
-			break;
-		}
-	}
-	victim = nfsd_list.next;
-	while (nrservs < 0 && victim != &nfsd_list) {
-		struct nfsd_list *nl =
-			list_entry(victim,struct nfsd_list, list);
-		victim = victim->next;
-		send_sig(SIG_NOCLEAN, nl->task, 1);
-		nrservs++;
-	}
+	error = svc_set_num_threads(nfsd_serv, NULL, nrservs);
  failure:
 	svc_destroy(nfsd_serv);		/* Release server */
  out:
@@ -329,7 +306,6 @@ nfsd(struct svc_rqst *rqstp)
 {
 	struct fs_struct *fsp;
 	int		err;
-	struct nfsd_list me;
 	sigset_t shutdown_mask, allowed_mask;
 
 	/* Lock module and set up kernel thread */
@@ -353,8 +329,7 @@ nfsd(struct svc_rqst *rqstp)
 
 	nfsdstats.th_cnt++;
 
-	me.task = current;
-	list_add(&me.list, &nfsd_list);
+	rqstp->rq_task = current;
 
 	unlock_kernel();
 
@@ -413,7 +388,6 @@ nfsd(struct svc_rqst *rqstp)
 
 	lock_kernel();
 
-	list_del(&me.list);
 	nfsdstats.th_cnt --;
 
 out:

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (8 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 009 of 11] knfsd: use svc_set_num_threads to manage threads in knfsd NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  2006-07-31  4:14   ` Andrew Morton
  2006-08-06  9:47   ` Andrew Morton
  2006-07-31  0:42 ` [PATCH 011 of 11] knfsd: allow admin to set nthreads per node NeilBrown
  10 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Actually implement multiple pools.  On NUMA machines, allocate
a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
have their cpu mask set to limit them to the CPUs in the svc_pool that
owns them.

This is the patch that allows an Altix to scale NFS traffic linearly
beyond 4 CPUs and 4 NICs.

Incorporates changes and feedback from Neil Brown, Trond Myklebust,
and Christoph Hellwig.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./include/linux/sunrpc/svc.h |    1 
 ./net/sunrpc/svc.c           |  246 ++++++++++++++++++++++++++++++++++++++++++-
 ./net/sunrpc/svcsock.c       |    7 +
 3 files changed, 252 insertions(+), 2 deletions(-)

diff .prev/include/linux/sunrpc/svc.h ./include/linux/sunrpc/svc.h
--- .prev/include/linux/sunrpc/svc.h	2006-07-31 10:30:30.000000000 +1000
+++ ./include/linux/sunrpc/svc.h	2006-07-31 10:31:12.000000000 +1000
@@ -369,5 +369,6 @@ int		   svc_process(struct svc_rqst *);
 int		   svc_register(struct svc_serv *, int, unsigned short);
 void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
 
 #endif /* SUNRPC_SVC_H */

diff .prev/net/sunrpc/svc.c ./net/sunrpc/svc.c
--- .prev/net/sunrpc/svc.c	2006-07-31 10:30:30.000000000 +1000
+++ ./net/sunrpc/svc.c	2006-07-31 10:31:30.000000000 +1000
@@ -4,6 +4,10 @@
  * High-level RPC service routines
  *
  * Copyright (C) 1995, 1996 Olaf Kirch <okir@monad.swb.de>
+ *
+ * Multiple threads pools and NUMAisation
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * by Greg Banks <gnb@melbourne.sgi.com>
  */
 
 #include <linux/linkage.h>
@@ -25,6 +29,233 @@
 #define RPC_PARANOIA 1
 
 /*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+	SVC_POOL_NONE = -1,	/* uninitialised, choose one of the others */
+	SVC_POOL_GLOBAL,	/* no mapping, just a single global pool
+				 * (legacy & UP mode) */
+	SVC_POOL_PERCPU,	/* one pool per cpu */
+	SVC_POOL_PERNODE	/* one pool per numa node */
+};
+
+/*
+ * Structure for mapping cpus to pools and vice versa.
+ * Setup once during sunrpc initialisation.
+ */
+static struct svc_pool_map {
+	int mode;			/* Note: int not enum to avoid
+					 * warnings about "enumeration value
+					 * not handled in switch" */
+	unsigned int npools;
+	unsigned int *pool_to;		/* maps pool id to cpu or node */
+	unsigned int *to_pool;		/* maps cpu or node to pool id */
+} svc_pool_map = {
+	.mode = SVC_POOL_NONE
+};
+
+
+/*
+ * Detect best pool mapping mode heuristically,
+ * according to the machine's topology.
+ */
+static int
+svc_pool_map_choose_mode(void)
+{
+	unsigned int node;
+
+	if (num_online_nodes() > 1) {
+		/*
+		 * Actually have multiple NUMA nodes,
+		 * so split pools on NUMA node boundaries
+		 */
+		return SVC_POOL_PERNODE;
+	}
+
+	node = any_online_node(node_online_map);
+	if (nr_cpus_node(node) > 2) {
+		/*
+		 * Non-trivial SMP, or CONFIG_NUMA on
+		 * non-NUMA hardware, e.g. with a generic
+		 * x86_64 kernel on Xeons.  In this case we
+		 * want to divide the pools on cpu boundaries.
+		 */
+		return SVC_POOL_PERCPU;
+	}
+
+	/* default: one global pool */
+	return SVC_POOL_GLOBAL;
+}
+
+/*
+ * Allocate the to_pool[] and pool_to[] arrays.
+ * Returns 0 on success or an errno.
+ */
+static int
+svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
+{
+	m->to_pool = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+	if (!m->to_pool)
+		goto fail;
+	m->pool_to = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+	if (!m->pool_to)
+		goto fail_free;
+
+	return 0;
+
+fail_free:
+	kfree(m->to_pool);
+fail:
+	return -ENOMEM;
+}
+
+/*
+ * Initialise the pool map for SVC_POOL_PERCPU mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_percpu(struct svc_pool_map *m)
+{
+	unsigned int maxpools = num_possible_cpus();
+	unsigned int pidx = 0;
+	unsigned int cpu;
+	int err;
+
+	err = svc_pool_map_alloc_arrays(m, maxpools);
+	if (err)
+		return err;
+
+	for_each_online_cpu(cpu) {
+		BUG_ON(pidx > maxpools);
+		m->to_pool[cpu] = pidx;
+		m->pool_to[pidx] = cpu;
+		pidx++;
+	}
+	/* cpus brought online later all get mapped to pool0, sorry */
+
+	return pidx;
+};
+
+
+/*
+ * Initialise the pool map for SVC_POOL_PERNODE mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_pernode(struct svc_pool_map *m)
+{
+	unsigned int maxpools = num_possible_nodes();
+	unsigned int pidx = 0;
+	unsigned int node;
+	int err;
+
+	err = svc_pool_map_alloc_arrays(m, maxpools);
+	if (err)
+		return err;
+
+	for_each_node_with_cpus(node) {
+		/* some architectures (e.g. SN2) have cpuless nodes */
+		BUG_ON(pidx > maxpools);
+		m->to_pool[node] = pidx;
+		m->pool_to[pidx] = node;
+		pidx++;
+	}
+	/* nodes brought online later all get mapped to pool0, sorry */
+
+	return pidx;
+}
+
+
+/*
+ * Build the global map of cpus to pools and vice versa.
+ */
+static unsigned int
+svc_pool_map_init(void)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	int npools = -1;
+
+	if (m->mode != SVC_POOL_NONE)
+		return m->npools;
+
+	m->mode = svc_pool_map_choose_mode();
+
+	switch (m->mode) {
+	case SVC_POOL_PERCPU:
+		npools = svc_pool_map_init_percpu(m);
+		break;
+	case SVC_POOL_PERNODE:
+		npools = svc_pool_map_init_pernode(m);
+		break;
+	}
+
+	if (npools < 0) {
+		/* default, or memory allocation failure */
+		npools = 1;
+		m->mode = SVC_POOL_GLOBAL;
+	}
+	m->npools = npools;
+
+	return m->npools;
+}
+
+/*
+ * Set the current thread's cpus_allowed mask so that it
+ * will only run on cpus in the given pool.
+ *
+ * Returns 1 and fills in oldmask iff a cpumask was applied.
+ */
+static inline int
+svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	unsigned int node; /* or cpu */
+
+	BUG_ON(m->mode == SVC_POOL_NONE);
+
+	switch (m->mode)
+	{
+	default:
+		return 0;
+	case SVC_POOL_PERCPU:
+		node = m->pool_to[pidx];
+		*oldmask = current->cpus_allowed;
+		set_cpus_allowed(current, cpumask_of_cpu(node));
+		return 1;
+	case SVC_POOL_PERNODE:
+		node = m->pool_to[pidx];
+		*oldmask = current->cpus_allowed;
+		set_cpus_allowed(current, node_to_cpumask(node));
+		return 1;
+	}
+}
+
+/*
+ * Use the mapping mode to choose a pool for a given CPU.
+ * Used when enqueueing an incoming RPC.  Always returns
+ * a non-NULL pool pointer.
+ */
+struct svc_pool *
+svc_pool_for_cpu(struct svc_serv *serv, int cpu)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	unsigned int pidx = 0;
+
+	BUG_ON(m->mode == SVC_POOL_NONE);
+
+	switch (m->mode) {
+	case SVC_POOL_PERCPU:
+		pidx = m->to_pool[cpu];
+		break;
+	case SVC_POOL_PERNODE:
+		pidx = m->to_pool[cpu_to_node(cpu)];
+		break;
+	}
+	return &serv->sv_pools[pidx % serv->sv_nrpools];
+}
+
+
+/*
  * Create an RPC service
  */
 static struct svc_serv *
@@ -105,8 +336,9 @@ svc_create_pooled(struct svc_program *pr
 		  svc_thread_fn func, int sig, struct module *mod)
 {
 	struct svc_serv *serv;
+	unsigned int npools = svc_pool_map_init();
 
-	serv = __svc_create(prog, bufsize, /*npools*/1, shutdown);
+	serv = __svc_create(prog, bufsize, npools, shutdown);
 
 	if (serv != NULL) {
 		serv->sv_function = func;
@@ -209,6 +441,8 @@ svc_release_buffer(struct svc_rqst *rqst
 
 /*
  * Create a thread in the given pool.  Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
  */
 static int
 __svc_create_thread(svc_thread_fn func, struct svc_serv *serv,
@@ -216,6 +450,8 @@ __svc_create_thread(svc_thread_fn func, 
 {
 	struct svc_rqst	*rqstp;
 	int		error = -ENOMEM;
+	int		have_oldmask = 0;
+	cpumask_t	oldmask;
 
 	rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
 	if (!rqstp)
@@ -235,7 +471,15 @@ __svc_create_thread(svc_thread_fn func, 
 	spin_unlock_bh(&pool->sp_lock);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
+
+	if (serv->sv_nrpools > 1)
+		have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
+
 	error = kernel_thread((int (*)(void *)) func, rqstp, 0);
+
+	if (have_oldmask)
+		set_cpus_allowed(current, oldmask);
+
 	if (error < 0)
 		goto out_thread;
 	svc_sock_update_bufs(serv);

diff .prev/net/sunrpc/svcsock.c ./net/sunrpc/svcsock.c
--- .prev/net/sunrpc/svcsock.c	2006-07-31 10:29:38.000000000 +1000
+++ ./net/sunrpc/svcsock.c	2006-07-31 10:31:12.000000000 +1000
@@ -151,8 +151,9 @@ static void
 svc_sock_enqueue(struct svc_sock *svsk)
 {
 	struct svc_serv	*serv = svsk->sk_server;
-	struct svc_pool *pool = &serv->sv_pools[0];
+	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
+	int cpu;
 
 	if (!(svsk->sk_flags &
 	      ( (1<<SK_CONN)|(1<<SK_DATA)|(1<<SK_CLOSE)|(1<<SK_DEFERRED)) ))
@@ -160,6 +161,10 @@ svc_sock_enqueue(struct svc_sock *svsk)
 	if (test_bit(SK_DEAD, &svsk->sk_flags))
 		return;
 
+	cpu = get_cpu();
+	pool = svc_pool_for_cpu(svsk->sk_server, cpu);
+	put_cpu();
+
 	spin_lock_bh(&pool->sp_lock);
 
 	if (!list_empty(&pool->sp_threads) &&

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 011 of 11] knfsd: allow admin to set nthreads per node
  2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
                   ` (9 preceding siblings ...)
  2006-07-31  0:42 ` [PATCH 010 of 11] knfsd: make rpc threads pools numa aware NeilBrown
@ 2006-07-31  0:42 ` NeilBrown
  10 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2006-07-31  0:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel


From: Greg Banks <gnb@melbourne.sgi.com>

knfsd: Add /proc/fs/nfsd/pool_threads which allows the sysadmin (or
a userspace daemon) to read and change the number of nfsd threads
in each pool.  The format is a list of space-separated integers,
one per pool.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./fs/nfsd/nfsctl.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 ./fs/nfsd/nfssvc.c |   74 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)

diff .prev/fs/nfsd/nfsctl.c ./fs/nfsd/nfsctl.c
--- .prev/fs/nfsd/nfsctl.c	2006-07-31 10:29:37.000000000 +1000
+++ ./fs/nfsd/nfsctl.c	2006-07-31 10:31:46.000000000 +1000
@@ -54,6 +54,7 @@ enum {
 	NFSD_List,
 	NFSD_Fh,
 	NFSD_Threads,
+	NFSD_Pool_Threads,
 	NFSD_Versions,
 	NFSD_Ports,
 	/*
@@ -78,6 +79,7 @@ static ssize_t write_getfd(struct file *
 static ssize_t write_getfs(struct file *file, char *buf, size_t size);
 static ssize_t write_filehandle(struct file *file, char *buf, size_t size);
 static ssize_t write_threads(struct file *file, char *buf, size_t size);
+static ssize_t write_pool_threads(struct file *file, char *buf, size_t size);
 static ssize_t write_versions(struct file *file, char *buf, size_t size);
 static ssize_t write_ports(struct file *file, char *buf, size_t size);
 #ifdef CONFIG_NFSD_V4
@@ -95,6 +97,7 @@ static ssize_t (*write_op[])(struct file
 	[NFSD_Getfs] = write_getfs,
 	[NFSD_Fh] = write_filehandle,
 	[NFSD_Threads] = write_threads,
+	[NFSD_Pool_Threads] = write_pool_threads,
 	[NFSD_Versions] = write_versions,
 	[NFSD_Ports] = write_ports,
 #ifdef CONFIG_NFSD_V4
@@ -363,6 +366,72 @@ static ssize_t write_threads(struct file
 	return strlen(buf);
 }
 
+extern int nfsd_nrpools(void);
+extern int nfsd_get_nrthreads(int n, int *);
+extern int nfsd_set_nrthreads(int n, int *);
+
+static ssize_t write_pool_threads(struct file *file, char *buf, size_t size)
+{
+	/* if size > 0, look for an array of number of threads per node
+	 * and apply them  then write out number of threads per node as reply
+	 */
+	char *mesg = buf;
+	int i;
+	int rv;
+	int len;
+    	int npools = nfsd_nrpools();
+	int *nthreads;
+
+	if (npools == 0) {
+		/*
+		 * NFS is shut down.  The admin can start it by
+		 * writing to the threads file but NOT the pool_threads
+		 * file, sorry.  Report zero threads.
+		 */
+		strcpy(buf, "0\n");
+		return strlen(buf);
+	}
+
+	nthreads = kcalloc(npools, sizeof(int), GFP_KERNEL);
+	if (nthreads == NULL)
+		return -ENOMEM;
+
+	if (size > 0) {
+		for (i = 0 ; i < npools ; i++) {
+			rv = get_int(&mesg, &nthreads[i]);
+			if (rv == -ENOENT)
+				break;		/* fewer numbers than pools */
+			if (rv)
+				goto out_free;	/* syntax error */
+			rv = -EINVAL;
+			if (nthreads[i] < 0)
+				goto out_free;
+		}
+		rv = nfsd_set_nrthreads(i, nthreads);
+		if (rv)
+			goto out_free;
+	}
+
+	rv = nfsd_get_nrthreads(npools, nthreads);
+	if (rv)
+		goto out_free;
+
+	mesg = buf;
+	size = SIMPLE_TRANSACTION_LIMIT;
+	for (i = 0 ; i < npools && size > 0 ; i++) {
+		snprintf(mesg, size, "%d%c", nthreads[i], (i == npools-1 ? '\n' : ' '));
+		len = strlen(mesg);
+		size -= len;
+		mesg += len;
+	}
+
+	return (mesg-buf);
+
+out_free:
+	kfree(nthreads);
+	return rv;
+}
+
 static ssize_t write_versions(struct file *file, char *buf, size_t size)
 {
 	/*
@@ -544,6 +613,7 @@ static int nfsd_fill_super(struct super_
 		[NFSD_List] = {"exports", &exports_operations, S_IRUGO},
 		[NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
+		[NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
 #ifdef CONFIG_NFSD_V4

diff .prev/fs/nfsd/nfssvc.c ./fs/nfsd/nfssvc.c
--- .prev/fs/nfsd/nfssvc.c	2006-07-31 10:31:08.000000000 +1000
+++ ./fs/nfsd/nfssvc.c	2006-07-31 10:31:46.000000000 +1000
@@ -238,6 +238,80 @@ static int nfsd_init_socks(int port)
 	return 0;
 }
 
+int nfsd_nrpools(void)
+{
+	if (nfsd_serv == NULL)
+		return 0;
+	else
+		return nfsd_serv->sv_nrpools;
+}
+
+int nfsd_get_nrthreads(int n, int *nthreads)
+{
+	int i = 0;
+
+	if (nfsd_serv != NULL) {
+		for (i = 0 ; i < nfsd_serv->sv_nrpools && i < n ; i++)
+			nthreads[i] = nfsd_serv->sv_pools[i].sp_nrthreads;
+	}
+
+	return 0;
+}
+
+int nfsd_set_nrthreads(int n, int *nthreads)
+{
+	int i = 0;
+	int tot = 0;
+	int err = 0;
+
+	if (nfsd_serv == NULL || n <= 0)
+		return 0;
+
+	if (n > nfsd_serv->sv_nrpools)
+		n = nfsd_serv->sv_nrpools;
+
+	/* enforce a global maximum number of threads */
+	tot = 0;
+	for (i = 0 ; i < n ; i++) {
+		if (nthreads[i] > NFSD_MAXSERVS)
+			nthreads[i] = NFSD_MAXSERVS;
+		tot += nthreads[i];
+	}
+	if (tot > NFSD_MAXSERVS) {
+		/* total too large: scale down requested numbers */
+		for (i = 0 ; i < n && tot > 0 ; i++) {
+		    	int new = nthreads[i] * NFSD_MAXSERVS / tot;
+			tot -= (nthreads[i] - new);
+			nthreads[i] = new;
+		}
+		for (i = 0 ; i < n && tot > 0 ; i++) {
+			nthreads[i]--;
+			tot--;
+		}
+	}
+
+	/*
+	 * There must always be a thread in pool 0; the admin
+	 * can't shut down NFS completely using pool_threads.
+	 */
+	if (nthreads[0] == 0)
+		nthreads[0] = 1;
+
+	/* apply the new numbers */
+	lock_kernel();
+	svc_get(nfsd_serv);
+	for (i = 0 ; i < n ; i++) {
+		err = svc_set_num_threads(nfsd_serv, &nfsd_serv->sv_pools[i],
+				    	  nthreads[i]);
+		if (err)
+			break;
+	}
+	svc_destroy(nfsd_serv);
+	unlock_kernel();
+
+	return err;
+}
+
 int
 nfsd_svc(unsigned short port, int nrservs)
 {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 007 of 11] knfsd: add svc_get
  2006-07-31  0:42 ` [PATCH 007 of 11] knfsd: add svc_get NeilBrown
@ 2006-07-31  4:05   ` Andrew Morton
  2006-07-31  4:16     ` Neil Brown
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-07-31  4:05 UTC (permalink / raw)
  To: NeilBrown; +Cc: nfs, linux-kernel

On Mon, 31 Jul 2006 10:42:19 +1000
NeilBrown <neilb@suse.de> wrote:

>  /*
> + * We use sv_nrthreads as a reference count.  svc_destroy() drops
> + * this refcount, so we need to bump it up around operations that
> + * change the number of threads.  Horrible, but there it is.
> + * Should be called with the BKL held.
> + */
> +static inline void svc_get(struct svc_serv *serv)
> +{
> +	serv->sv_nrthreads++;
> +}

It's a bit odd for a numa-scalability patch to be increasing our dependency
upon lock_kernel()...


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 008 of 11] knfsd: add svc_set_num_threads
  2006-07-31  0:42 ` [PATCH 008 of 11] knfsd: add svc_set_num_threads NeilBrown
@ 2006-07-31  4:11   ` Andrew Morton
  2006-07-31  4:24     ` [NFS] " Neil Brown
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-07-31  4:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: nfs, linux-kernel

On Mon, 31 Jul 2006 10:42:23 +1000
NeilBrown <neilb@suse.de> wrote:

> +	/* destroy old threads */
> +	while (nrservs < 0 &&
> +	       (victim = choose_victim(serv, pool, &state)) != NULL) {
> +		send_sig(serv->sv_kill_signal, victim, 1);
> +		nrservs++;

Using signals to communicate with kernel threads is rather baroque - we
have a range of less klunky ways of controlling kernel threads in-kernel.

The containers guys are going through converting lots of these things over
to the kthread API - I believe it's a requirement for containerisation.

nfsd/rpc is going to be one of the hard ones to convert, but it's going to
happen.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  0:42 ` [PATCH 010 of 11] knfsd: make rpc threads pools numa aware NeilBrown
@ 2006-07-31  4:14   ` Andrew Morton
  2006-07-31  4:36     ` Neil Brown
  2006-08-06  9:47   ` Andrew Morton
  1 sibling, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-07-31  4:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: nfs, linux-kernel

On Mon, 31 Jul 2006 10:42:34 +1000
NeilBrown <neilb@suse.de> wrote:

> +static int
> +svc_pool_map_init_percpu(struct svc_pool_map *m)
> +{
> +	unsigned int maxpools = num_possible_cpus();
> +	unsigned int pidx = 0;
> +	unsigned int cpu;
> +	int err;
> +
> +	err = svc_pool_map_alloc_arrays(m, maxpools);
> +	if (err)
> +		return err;
> +
> +	for_each_online_cpu(cpu) {
> +		BUG_ON(pidx > maxpools);
> +		m->to_pool[cpu] = pidx;
> +		m->pool_to[pidx] = cpu;
> +		pidx++;
> +	}

That isn't right - it assumes that cpu_possible_map is not sparse.  If it
is sparse, we allocate undersized pools and then overindex them.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 007 of 11] knfsd: add svc_get
  2006-07-31  4:05   ` Andrew Morton
@ 2006-07-31  4:16     ` Neil Brown
  0 siblings, 0 replies; 25+ messages in thread
From: Neil Brown @ 2006-07-31  4:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel

On Sunday July 30, akpm@osdl.org wrote:
> On Mon, 31 Jul 2006 10:42:19 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> >  /*
> > + * We use sv_nrthreads as a reference count.  svc_destroy() drops
> > + * this refcount, so we need to bump it up around operations that
> > + * change the number of threads.  Horrible, but there it is.
> > + * Should be called with the BKL held.
> > + */
> > +static inline void svc_get(struct svc_serv *serv)
> > +{
> > +	serv->sv_nrthreads++;
> > +}
> 
> It's a bit odd for a numa-scalability patch to be increasing our dependency
> upon lock_kernel()...

It just looks odd.
This patch doesn't change generated code - it just puts a
post-increment into an inline function so that it can be documented
and well understood.

We don't change the number of threads very often (mainly at bootup and
shutdown) so having that under the BKL isn't a big cost.

NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] [PATCH 008 of 11] knfsd: add svc_set_num_threads
  2006-07-31  4:11   ` Andrew Morton
@ 2006-07-31  4:24     ` Neil Brown
  0 siblings, 0 replies; 25+ messages in thread
From: Neil Brown @ 2006-07-31  4:24 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel

On Sunday July 30, akpm@osdl.org wrote:
> On Mon, 31 Jul 2006 10:42:23 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > +	/* destroy old threads */
> > +	while (nrservs < 0 &&
> > +	       (victim = choose_victim(serv, pool, &state)) != NULL) {
> > +		send_sig(serv->sv_kill_signal, victim, 1);
> > +		nrservs++;
> 
> Using signals to communicate with kernel threads is rather baroque - we
> have a range of less klunky ways of controlling kernel threads in-kernel.

True.

> 
> The containers guys are going through converting lots of these things over
> to the kthread API - I believe it's a requirement for containerisation.
> 

Yes - hch has suggested that this needs to be a done.  I had a try and
there were enough complications that I decided to leave it to asked
Greg's NUMA stuff.

> nfsd/rpc is going to be one of the hard ones to convert, but it's going to
> happen.

yep.

NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  4:14   ` Andrew Morton
@ 2006-07-31  4:36     ` Neil Brown
  2006-07-31  4:42       ` [NFS] " Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: Neil Brown @ 2006-07-31  4:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nfs, linux-kernel

On Sunday July 30, akpm@osdl.org wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > +static int
> > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > +{
> > +	unsigned int maxpools = num_possible_cpus();
> > +	unsigned int pidx = 0;
> > +	unsigned int cpu;
> > +	int err;
> > +
> > +	err = svc_pool_map_alloc_arrays(m, maxpools);
> > +	if (err)
> > +		return err;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		BUG_ON(pidx > maxpools);
> > +		m->to_pool[cpu] = pidx;
> > +		m->pool_to[pidx] = cpu;
> > +		pidx++;
> > +	}
> 
> That isn't right - it assumes that cpu_possible_map is not sparse.  If it
> is sparse, we allocate undersized pools and then overindex them.

I don't think so.

At this point we are largely counting the number of online cpus
(in pidx (pool index) - this is returned). The two-way mapping
to_pool and pool_to provides a mapping between the possible-sparse cpu
list and a dense list of pool indexes.

If further cpus come on line they will be automatically included in
pool-0. (as to_pool[n] will still be zero).

Does that make it at all clearer?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  4:36     ` Neil Brown
@ 2006-07-31  4:42       ` Greg Banks
  2006-07-31  5:54         ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2006-07-31  4:42 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linux NFS Mailing List, Linux Kernel Mailing List

On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> On Sunday July 30, akpm@osdl.org wrote:
> > On Mon, 31 Jul 2006 10:42:34 +1000
> > NeilBrown <neilb@suse.de> wrote:
> > 
> > > +static int
> > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > +{
> > > +	unsigned int maxpools = num_possible_cpus();
> > > +	unsigned int pidx = 0;
> > > +	unsigned int cpu;
> > > +	int err;
> > > +
> > > +	err = svc_pool_map_alloc_arrays(m, maxpools);
> > > +	if (err)
> > > +		return err;
> > > +
> > > +	for_each_online_cpu(cpu) {
> > > +		BUG_ON(pidx > maxpools);
> > > +		m->to_pool[cpu] = pidx;
> > > +		m->pool_to[pidx] = cpu;
> > > +		pidx++;
> > > +	}
> > 
> > That isn't right - it assumes that cpu_possible_map is not sparse.  If it
> > is sparse, we allocate undersized pools and then overindex them.
> 
> I don't think so.
> 
> At this point we are largely counting the number of online cpus
> (in pidx (pool index) - this is returned). The two-way mapping
> to_pool and pool_to provides a mapping between the possible-sparse cpu
> list and a dense list of pool indexes.
> 
> If further cpus come on line they will be automatically included in
> pool-0. (as to_pool[n] will still be zero).
> 
> Does that make it at all clearer?

Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
If there's a value of `cpu' > num_possible_cpus(), which would happen
if the cpu numbers weren't contiguous, then m->to_pool[] will be
overflowed.  My bad, I didn't even consider the case of non-contiguous
CPU numbers and none of the machines available for testing had that
property.


Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  4:42       ` [NFS] " Greg Banks
@ 2006-07-31  5:54         ` Greg Banks
  2006-08-01  4:43           ` Andrew Morton
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2006-07-31  5:54 UTC (permalink / raw)
  To: Neil Brown
  Cc: Andrew Morton, Linux NFS Mailing List, Linux Kernel Mailing List

On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > On Sunday July 30, akpm@osdl.org wrote:
> > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > NeilBrown <neilb@suse.de> wrote:
> > > 
> > > > +static int
> > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > +{
> > > > +	unsigned int maxpools = num_possible_cpus();
> > > > +	unsigned int pidx = 0;
> > > > +	unsigned int cpu;
> > > > +	int err;
> > > > +
> > > > +
> > > 
> > > That isn't right - it assumes that cpu_possible_map is not sparse.  If it
> > > is sparse, we allocate undersized pools and then overindex them.
> 
> Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.

How about this version of the patch?  It replaces num_possible_cpus()
with highest_possible_processor_id()+1 and similarly for nodes.
--

knfsd: Actually implement multiple pools.  On NUMA machines, allocate
a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
have their cpu mask set to limit them to the CPUs in the svc_pool that
owns them.

This is the patch that allows an Altix to scale NFS traffic linearly
beyond 4 CPUs and 4 NICs.

Incorporates changes and feedback from Neil Brown, Trond Myklebust,
Christoph Hellwig and Andrew Morton.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
---

 include/linux/sunrpc/svc.h |    1 
 net/sunrpc/svc.c           |  258 +++++++++++++++++++++++++++++++++-
 net/sunrpc/svcsock.c       |    7 
 3 files changed, 264 insertions(+), 2 deletions(-)

Index: linus-git/net/sunrpc/svc.c
===================================================================
--- linus-git.orig/net/sunrpc/svc.c	2006-07-26 16:50:25.802187916 +1000
+++ linus-git/net/sunrpc/svc.c	2006-07-31 15:08:59.999800051 +1000
@@ -4,6 +4,10 @@
  * High-level RPC service routines
  *
  * Copyright (C) 1995, 1996 Olaf Kirch <okir@monad.swb.de>
+ *
+ * Multiple threads pools and NUMAisation
+ * Copyright (c) 2006 Silicon Graphics, Inc.
+ * by Greg Banks <gnb@melbourne.sgi.com>
  */
 
 #include <linux/linkage.h>
@@ -25,6 +29,245 @@
 #define RPC_PARANOIA 1
 
 /*
+ * Mode for mapping cpus to pools.
+ */
+enum {
+	SVC_POOL_NONE = -1,	/* uninitialised, choose one of the others */
+	SVC_POOL_GLOBAL,	/* no mapping, just a single global pool
+				 * (legacy & UP mode) */
+	SVC_POOL_PERCPU,	/* one pool per cpu */
+	SVC_POOL_PERNODE	/* one pool per numa node */
+};
+
+/*
+ * Structure for mapping cpus to pools and vice versa.
+ * Setup once during sunrpc initialisation.
+ */
+static struct svc_pool_map {
+	int mode;			/* Note: int not enum to avoid
+					 * warnings about "enumeration value
+					 * not handled in switch" */
+	unsigned int npools;
+	unsigned int *pool_to;		/* maps pool id to cpu or node */
+	unsigned int *to_pool;		/* maps cpu or node to pool id */
+} svc_pool_map = {
+	.mode = SVC_POOL_NONE
+};
+
+
+/*
+ * Detect best pool mapping mode heuristically,
+ * according to the machine's topology.
+ */
+static int
+svc_pool_map_choose_mode(void)
+{
+	unsigned int node;
+
+	if (num_online_nodes() > 1) {
+		/*
+		 * Actually have multiple NUMA nodes,
+		 * so split pools on NUMA node boundaries
+		 */
+		return SVC_POOL_PERNODE;
+	}
+
+	node = any_online_node(node_online_map);
+	if (nr_cpus_node(node) > 2) {
+		/*
+		 * Non-trivial SMP, or CONFIG_NUMA on
+		 * non-NUMA hardware, e.g. with a generic
+		 * x86_64 kernel on Xeons.  In this case we
+		 * want to divide the pools on cpu boundaries.
+		 */
+		return SVC_POOL_PERCPU;
+	}
+
+	/* default: one global pool */
+	return SVC_POOL_GLOBAL;
+}
+
+/*
+ * Allocate the to_pool[] and pool_to[] arrays.
+ * Returns 0 on success or an errno.
+ */
+static int
+svc_pool_map_alloc_arrays(struct svc_pool_map *m, unsigned int maxpools)
+{
+	m->to_pool = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+	if (!m->to_pool)
+		goto fail;
+	m->pool_to = kcalloc(maxpools, sizeof(unsigned int), GFP_KERNEL);
+	if (!m->pool_to)
+		goto fail_free;
+
+	return 0;
+
+fail_free:
+	kfree(m->to_pool);
+fail:
+	return -ENOMEM;
+}
+
+/*
+ * Initialise the pool map for SVC_POOL_PERCPU mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_percpu(struct svc_pool_map *m)
+{
+	unsigned int maxpools = highest_possible_processor_id()+1;
+	unsigned int pidx = 0;
+	unsigned int cpu;
+	int err;
+
+	err = svc_pool_map_alloc_arrays(m, maxpools);
+	if (err)
+		return err;
+
+	for_each_online_cpu(cpu) {
+		BUG_ON(pidx > maxpools);
+		m->to_pool[cpu] = pidx;
+		m->pool_to[pidx] = cpu;
+		pidx++;
+	}
+	/* cpus brought online later all get mapped to pool0, sorry */
+
+	return pidx;
+};
+
+static int
+highest_possible_node_id(void)
+{
+	unsigned int node;
+	unsigned int highest = 0;
+
+	for_each_node(node)
+		highest = node;
+
+	return highest;
+}
+
+
+/*
+ * Initialise the pool map for SVC_POOL_PERNODE mode.
+ * Returns number of pools or <0 on error.
+ */
+static int
+svc_pool_map_init_pernode(struct svc_pool_map *m)
+{
+	unsigned int maxpools = highest_possible_node_id()+1;
+	unsigned int pidx = 0;
+	unsigned int node;
+	int err;
+
+	err = svc_pool_map_alloc_arrays(m, maxpools);
+	if (err)
+		return err;
+
+	for_each_node_with_cpus(node) {
+		/* some architectures (e.g. SN2) have cpuless nodes */
+		BUG_ON(pidx > maxpools);
+		m->to_pool[node] = pidx;
+		m->pool_to[pidx] = node;
+		pidx++;
+	}
+	/* nodes brought online later all get mapped to pool0, sorry */
+
+	return pidx;
+}
+
+
+/*
+ * Build the global map of cpus to pools and vice versa.
+ */
+static unsigned int
+svc_pool_map_init(void)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	int npools = -1;
+
+	if (m->mode != SVC_POOL_NONE)
+		return m->npools;
+
+	m->mode = svc_pool_map_choose_mode();
+
+	switch (m->mode) {
+	case SVC_POOL_PERCPU:
+		npools = svc_pool_map_init_percpu(m);
+		break;
+	case SVC_POOL_PERNODE:
+		npools = svc_pool_map_init_pernode(m);
+		break;
+	}
+
+	if (npools < 0) {
+		/* default, or memory allocation failure */
+		npools = 1;
+		m->mode = SVC_POOL_GLOBAL;
+	}
+	m->npools = npools;
+
+	return m->npools;
+}
+
+/*
+ * Set the current thread's cpus_allowed mask so that it
+ * will only run on cpus in the given pool.
+ *
+ * Returns 1 and fills in oldmask iff a cpumask was applied.
+ */
+static inline int
+svc_pool_map_set_cpumask(unsigned int pidx, cpumask_t *oldmask)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	unsigned int node; /* or cpu */
+
+	BUG_ON(m->mode == SVC_POOL_NONE);
+
+	switch (m->mode)
+	{
+	default:
+		return 0;
+	case SVC_POOL_PERCPU:
+		node = m->pool_to[pidx];
+		*oldmask = current->cpus_allowed;
+		set_cpus_allowed(current, cpumask_of_cpu(node));
+		return 1;
+	case SVC_POOL_PERNODE:
+		node = m->pool_to[pidx];
+		*oldmask = current->cpus_allowed;
+		set_cpus_allowed(current, node_to_cpumask(node));
+		return 1;
+	}
+}
+
+/*
+ * Use the mapping mode to choose a pool for a given CPU.
+ * Used when enqueueing an incoming RPC.  Always returns
+ * a non-NULL pool pointer.
+ */
+struct svc_pool *
+svc_pool_for_cpu(struct svc_serv *serv, int cpu)
+{
+	struct svc_pool_map *m = &svc_pool_map;
+	unsigned int pidx = 0;
+
+	BUG_ON(m->mode == SVC_POOL_NONE);
+
+	switch (m->mode) {
+	case SVC_POOL_PERCPU:
+		pidx = m->to_pool[cpu];
+		break;
+	case SVC_POOL_PERNODE:
+		pidx = m->to_pool[cpu_to_node(cpu)];
+		break;
+	}
+	return &serv->sv_pools[pidx % serv->sv_nrpools];
+}
+
+
+/*
  * Create an RPC service
  */
 static struct svc_serv *
@@ -103,8 +346,9 @@ svc_create_pooled(struct svc_program *pr
 		  svc_thread_fn func, int sig, struct module *mod)
 {
 	struct svc_serv *serv;
+	unsigned int npools = svc_pool_map_init();
 
-	serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
+	serv = __svc_create(prog, bufsize, shutdown, npools);
 
 	if (serv != NULL) {
 		serv->sv_function = func;
@@ -207,12 +451,16 @@ svc_release_buffer(struct svc_rqst *rqst
 
 /*
  * Create a thread in the given pool.  Caller must hold BKL.
+ * On a NUMA or SMP machine, with a multi-pool serv, the thread
+ * will be restricted to run on the cpus belonging to the pool.
  */
 static int
 __svc_create_thread(svc_thread_fn func, struct svc_serv *serv, struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
 	int		error = -ENOMEM;
+	int		have_oldmask = 0;
+	cpumask_t	oldmask;
 
 	rqstp = kzalloc(sizeof(*rqstp), GFP_KERNEL);
 	if (!rqstp)
@@ -232,7 +480,15 @@ __svc_create_thread(svc_thread_fn func, 
 	spin_unlock_bh(&pool->sp_lock);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
+
+	if (serv->sv_nrpools > 1)
+		have_oldmask = svc_pool_map_set_cpumask(pool->sp_id, &oldmask);
+
 	error = kernel_thread((int (*)(void *)) func, rqstp, 0);
+
+	if (have_oldmask)
+		set_cpus_allowed(current, oldmask);
+
 	if (error < 0)
 		goto out_thread;
 	svc_sock_update_bufs(serv);
Index: linus-git/net/sunrpc/svcsock.c
===================================================================
--- linus-git.orig/net/sunrpc/svcsock.c	2006-07-26 16:47:46.018822248 +1000
+++ linus-git/net/sunrpc/svcsock.c	2006-07-31 15:00:27.710150626 +1000
@@ -151,8 +151,9 @@ static void
 svc_sock_enqueue(struct svc_sock *svsk)
 {
 	struct svc_serv	*serv = svsk->sk_server;
-	struct svc_pool *pool = &serv->sv_pools[0];
+	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
+	int cpu;
 
 	if (!(svsk->sk_flags &
 	      ( (1<<SK_CONN)|(1<<SK_DATA)|(1<<SK_CLOSE)|(1<<SK_DEFERRED)) ))
@@ -160,6 +161,10 @@ svc_sock_enqueue(struct svc_sock *svsk)
 	if (test_bit(SK_DEAD, &svsk->sk_flags))
 		return;
 
+	cpu = get_cpu();
+	pool = svc_pool_for_cpu(svsk->sk_server, cpu);
+	put_cpu();
+
 	spin_lock_bh(&pool->sp_lock);
 
 	if (!list_empty(&pool->sp_threads) &&
Index: linus-git/include/linux/sunrpc/svc.h
===================================================================
--- linus-git.orig/include/linux/sunrpc/svc.h	2006-07-26 16:49:47.615119695 +1000
+++ linus-git/include/linux/sunrpc/svc.h	2006-07-31 15:00:28.946000862 +1000
@@ -368,5 +368,6 @@ int		   svc_process(struct svc_serv *, s
 int		   svc_register(struct svc_serv *, int, unsigned short);
 void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_pool *  svc_pool_for_cpu(struct svc_serv *serv, int cpu);
 
 #endif /* SUNRPC_SVC_H */




Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  5:54         ` Greg Banks
@ 2006-08-01  4:43           ` Andrew Morton
  2006-08-01  5:22             ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: Andrew Morton @ 2006-08-01  4:43 UTC (permalink / raw)
  To: Greg Banks; +Cc: neilb, nfs, linux-kernel

On Mon, 31 Jul 2006 15:54:57 +1000
Greg Banks <gnb@melbourne.sgi.com> wrote:

> On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> > On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > > On Sunday July 30, akpm@osdl.org wrote:
> > > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > > NeilBrown <neilb@suse.de> wrote:
> > > > 
> > > > > +static int
> > > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > > +{
> > > > > +	unsigned int maxpools = num_possible_cpus();
> > > > > +	unsigned int pidx = 0;
> > > > > +	unsigned int cpu;
> > > > > +	int err;
> > > > > +
> > > > > +
> > > > 
> > > > That isn't right - it assumes that cpu_possible_map is not sparse.  If it
> > > > is sparse, we allocate undersized pools and then overindex them.
> > 
> > Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
> 
> How about this version of the patch?  It replaces num_possible_cpus()
> with highest_possible_processor_id()+1 and similarly for nodes.
> --
> 
> knfsd: Actually implement multiple pools.  On NUMA machines, allocate
> a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
> on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
> have their cpu mask set to limit them to the CPUs in the svc_pool that
> owns them.
> 
> This is the patch that allows an Altix to scale NFS traffic linearly
> beyond 4 CPUs and 4 NICs.
> 
> Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> Christoph Hellwig and Andrew Morton.
> 

Something has gone rather wrong here.

> -	serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
> +	serv = __svc_create(prog, bufsize, shutdown, npools);

__svc_create() is:

__svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
	   void (*shutdown)(struct svc_serv *serv))

so heaven knows what tree you're patching.

Incremental patches really are preferred.  So we can see what people are
monkeying with ;)

After fixing the rejects and cleaning a few things up, your proposed change
amounts to:

--- a/net/sunrpc/svc.c~knfsd-make-rpc-threads-pools-numa-aware-fix
+++ a/net/sunrpc/svc.c
@@ -116,7 +116,7 @@ fail:
 static int
 svc_pool_map_init_percpu(struct svc_pool_map *m)
 {
-	unsigned int maxpools = num_possible_cpus();
+	unsigned int maxpools = highest_possible_processor_id() + 1;
 	unsigned int pidx = 0;
 	unsigned int cpu;
 	int err;
@@ -136,6 +136,18 @@ svc_pool_map_init_percpu(struct svc_pool
 	return pidx;
 };
 
+static int
+highest_possible_node_id(void)
+{
+	unsigned int node;
+	unsigned int highest = 0;
+
+	for_each_node(node)
+		highest = node;
+
+	return highest;
+}
+
 
 /*
  * Initialise the pool map for SVC_POOL_PERNODE mode.
@@ -144,7 +156,7 @@ svc_pool_map_init_percpu(struct svc_pool
 static int
 svc_pool_map_init_pernode(struct svc_pool_map *m)
 {
-	unsigned int maxpools = num_possible_nodes();
+	unsigned int maxpools = highest_possible_node_id() + 1;
 	unsigned int pidx = 0;
 	unsigned int node;
 	int err;
_


Which shouldn't have compiled, due to the missing forward declaration.  And
I'd be surprised if it worked very well with CONFIG_NUMA=n.  And it's
naughty to be sneaking general library functions into the sunrpc code
anyway.

Please,

- Write a standalone patch which adds highest_possible_node_id() to
  lib/cpumask.c(?)

  Make sure it's inside #ifdef CONFIG_NUMA

  Remember to export it to modules.

  Provide a !CONFIG_NUMA version in include/linux/nodemask.h which just
  returns constant zero.

  Consider doing something more efficient than the for_each_node() loop. 
  Although I'm not sure what that would be, given that we don't have
  find_last_bit().

- Provide an incremental patch against
  knfsd-make-rpc-threads-pools-numa-aware.patch which utilises
  highest_possible_node_id().

  A replacement patch will be grudgingly accepted, but I'll only go and
  turn it into an incremental one, so you can't hide ;)

- Test it real good.  Modular, non-modular, NUMA, non-NUMA, !SMP.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [NFS] [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-08-01  4:43           ` Andrew Morton
@ 2006-08-01  5:22             ` Greg Banks
  0 siblings, 0 replies; 25+ messages in thread
From: Greg Banks @ 2006-08-01  5:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Neil Brown, Linux NFS Mailing List, Linux Kernel Mailing List

On Tue, 2006-08-01 at 14:43, Andrew Morton wrote:
> On Mon, 31 Jul 2006 15:54:57 +1000
> Greg Banks <gnb@melbourne.sgi.com> wrote:
> 
> > On Mon, 2006-07-31 at 14:42, Greg Banks wrote:
> > > On Mon, 2006-07-31 at 14:36, Neil Brown wrote:
> > > > On Sunday July 30, akpm@osdl.org wrote:
> > > > > On Mon, 31 Jul 2006 10:42:34 +1000
> > > > > NeilBrown <neilb@suse.de> wrote:
> > > > > 
> > > > > > +static int
> > > > > > +svc_pool_map_init_percpu(struct svc_pool_map *m)
> > > > > > +{
> > > > > > +	unsigned int maxpools = num_possible_cpus();
> > > > > > +	unsigned int pidx = 0;
> > > > > > +	unsigned int cpu;
> > > > > > +	int err;
> > > > > > +
> > > > > > +
> > > > > 
> > > > > That isn't right - it assumes that cpu_possible_map is not sparse.  If it
> > > > > is sparse, we allocate undersized pools and then overindex them.
> > > 
> > > Umm, I think Andrew's right, num_possible_cpus() should be NR_CPUS.
> > 
> > How about this version of the patch?  It replaces num_possible_cpus()
> > with highest_possible_processor_id()+1 and similarly for nodes.
> > --
> > 
> > knfsd: Actually implement multiple pools.  On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> > 
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> > 
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > Christoph Hellwig and Andrew Morton.
> > 
> 
> Something has gone rather wrong here.
> 
> > -	serv = __svc_create(prog, bufsize, shutdown, /*npools*/1);
> > +	serv = __svc_create(prog, bufsize, shutdown, npools);
> 
> __svc_create() is:
> 
> __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> 	   void (*shutdown)(struct svc_serv *serv))
> 
> so heaven knows what tree you're patching.

Hmm.  Neil introduced the `shutdown' argument a few days ago, and I
thought I had his patches in my quilt tree, but apparently I screwed
something up.  Sorry about that.

> Incremental patches really are preferred.  So we can see what people are
> monkeying with ;)

Righto.

> After fixing the rejects and cleaning a few things up, your proposed change
> amounts to:
> 
> --- a/net/sunrpc/svc.c~knfsd-make-rpc-threads-pools-numa-aware-fix
> +++ a/net/sunrpc/svc.c
> @@ -116,7 +116,7 @@ fail:
>  static int
>  svc_pool_map_init_percpu(struct svc_pool_map *m)
>  {
> -	unsigned int maxpools = num_possible_cpus();
> +	unsigned int maxpools = highest_possible_processor_id() + 1;
>  	unsigned int pidx = 0;
>  	unsigned int cpu;
>  	int err;
> @@ -136,6 +136,18 @@ svc_pool_map_init_percpu(struct svc_pool
>  	return pidx;
>  };
>  
> +static int
> +highest_possible_node_id(void)
> +{
> +	unsigned int node;
> +	unsigned int highest = 0;
> +
> +	for_each_node(node)
> +		highest = node;
> +
> +	return highest;
> +}
> +
>  
>  /*
>   * Initialise the pool map for SVC_POOL_PERNODE mode.
> @@ -144,7 +156,7 @@ svc_pool_map_init_percpu(struct svc_pool
>  static int
>  svc_pool_map_init_pernode(struct svc_pool_map *m)
>  {
> -	unsigned int maxpools = num_possible_nodes();
> +	unsigned int maxpools = highest_possible_node_id() + 1;
>  	unsigned int pidx = 0;
>  	unsigned int node;
>  	int err;
> _
> 
> 

Yes.

> Which shouldn't have compiled, due to the missing forward declaration.

? It compiled, booted, and ran traffic.

gnb@chook 1965> quilt push 2
Applying patch knfsd-make-pools-numa-aware-5
patching file net/sunrpc/svc.c
patching file net/sunrpc/svcsock.c
patching file include/linux/sunrpc/svc.h

Applying patch knfsd-allow-admin-to-set-nthreads-per-node-2
patching file fs/nfsd/nfsctl.c
patching file fs/nfsd/nfssvc.c

Now at patch knfsd-allow-admin-to-set-nthreads-per-node-2

gnb@chook 1966> make  compressed modules
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  CHK     include/linux/compile.h
  CC      arch/ia64/ia32/sys_ia32.o
  LD      arch/ia64/ia32/built-in.o
  CC      fs/compat.o
  CC      fs/nfsctl.o
  CC [M]  fs/lockd/clntlock.o
  CC [M]  fs/lockd/clntproc.o
  CC [M]  fs/lockd/host.o
  CC [M]  fs/lockd/svc.o
  CC [M]  fs/lockd/svclock.o
  CC [M]  fs/lockd/svcshare.o
  CC [M]  fs/lockd/svcproc.o
  CC [M]  fs/lockd/svcsubs.o
  CC [M]  fs/lockd/mon.o
  CC [M]  fs/lockd/xdr.o
  CC [M]  fs/lockd/xdr4.o
  CC [M]  fs/lockd/svc4proc.o
  LD [M]  fs/lockd/lockd.o
  CC [M]  fs/nfs/callback.o
  CC [M]  fs/nfs/callback_xdr.o
  LD [M]  fs/nfs/nfs.o
  CC [M]  fs/nfsd/nfssvc.o
  CC [M]  fs/nfsd/nfsctl.o
  CC [M]  fs/nfsd/nfsproc.o
  CC [M]  fs/nfsd/nfsfh.o
  CC [M]  fs/nfsd/vfs.o
  CC [M]  fs/nfsd/export.o
  CC [M]  fs/nfsd/auth.o
  CC [M]  fs/nfsd/lockd.o
  CC [M]  fs/nfsd/nfscache.o
  CC [M]  fs/nfsd/nfsxdr.o
  CC [M]  fs/nfsd/stats.o
  CC [M]  fs/nfsd/nfs2acl.o
  CC [M]  fs/nfsd/nfs3proc.o
  CC [M]  fs/nfsd/nfs3xdr.o
  CC [M]  fs/nfsd/nfs3acl.o
  CC [M]  fs/nfsd/nfs4proc.o
  CC [M]  fs/nfsd/nfs4xdr.o
  CC [M]  fs/nfsd/nfs4state.o
  CC [M]  fs/nfsd/nfs4idmap.o
  CC [M]  fs/nfsd/nfs4callback.o
  CC [M]  fs/nfsd/nfs4recover.o
  LD [M]  fs/nfsd/nfsd.o
  LD      fs/built-in.o
  CC [M]  net/sunrpc/svc.o
  CC [M]  net/sunrpc/svcsock.o
  CC [M]  net/sunrpc/svcauth.o
  CC [M]  net/sunrpc/svcauth_unix.o
  CC [M]  net/sunrpc/sunrpc_syms.o
  CC [M]  net/sunrpc/stats.o
  LD [M]  net/sunrpc/sunrpc.o
  CC [M]  net/sunrpc/auth_gss/auth_gss.o
  CC [M]  net/sunrpc/auth_gss/gss_mech_switch.o
  CC [M]  net/sunrpc/auth_gss/svcauth_gss.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_crypto.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_mech.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_seal.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_unseal.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_seqnum.o
  CC [M]  net/sunrpc/auth_gss/gss_krb5_wrap.o
  CC [M]  net/sunrpc/auth_gss/gss_spkm3_mech.o
  CC [M]  net/sunrpc/auth_gss/gss_spkm3_seal.o
  CC [M]  net/sunrpc/auth_gss/gss_spkm3_unseal.o
  CC [M]  net/sunrpc/auth_gss/gss_spkm3_token.o
  LD [M]  net/sunrpc/auth_gss/auth_rpcgss.o
  LD [M]  net/sunrpc/auth_gss/rpcsec_gss_krb5.o
  LD [M]  net/sunrpc/auth_gss/rpcsec_gss_spkm3.o
  GEN     .version
  CHK     include/linux/compile.h
  UPD     include/linux/compile.h
  CC      init/version.o
  LD      init/built-in.o
  LD      .tmp_vmlinux1
  KSYM    .tmp_kallsyms1.S
  AS      .tmp_kallsyms1.o
  LD      .tmp_vmlinux2
  KSYM    .tmp_kallsyms2.S
  AS      .tmp_kallsyms2.o
  LD      vmlinux
  SYSMAP System.map
  SYSMAP .tmp_System.map
  OBJCOPY arch/ia64/hp/sim/boot/vmlinux.bin
  GZIP    arch/ia64/hp/sim/boot/vmlinux.gz
  LN      vmlinux.gz
  Kernel: vmlinux.gz is ready
  Building modules, stage 2.
  MODPOST
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_data from .data.rel.ro between 'fore200e_bus' (at offset 0x20) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.data:_fore200e_pca_fw_size from .data.rel.ro between 'fore200e_bus' (at offset 0x28) and 'fore200e_ops'
WARNING: drivers/atm/fore_200e.o - Section mismatch: reference to .init.text:fore200e_pca_prom_read from .data.rel.ro between 'fore200e_bus' (at offset 0x90) and 'fore200e_ops'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_irq from .text between 'he_start' (at offset 0x2ca2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_cs_block_rcm from .text between 'he_start' (at offset 0x81c2) and 'he_irq_handler'WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text: from .text between 'he_start' (at offset 0x81e2) and 'he_irq_handler'
WARNING: drivers/atm/he.o - Section mismatch: reference to .init.text:he_init_group from .text between 'he_start' (at offset 0x82a2) and 'he_irq_handler'
WARNING: drivers/block/cpqarray.o - Section mismatch: reference to .init.text:cpqarray_init_one from .data.rel.local between 'cpqarray_pci_driver' (at offset 0x20) and 'smart1_access'
WARNING: drivers/net/dgrs.o - Section mismatch: reference to .init.text:dgrs_pci_probe from .data.rel.local after 'dgrs_pci_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .init.text:de_init_one from .data.rel.local after 'de_driver' (at offset 0x20)
WARNING: drivers/net/tulip/de2104x.o - Section mismatch: reference to .exit.text:de_remove_one from .data.rel.local after 'de_driver' (at offset 0x28)
WARNING: drivers/usb/host/isp116x-hcd.o - Section mismatch: reference to .init.text:isp116x_probe from .data.rel.local between 'isp116x_driver' (at offset 0x0) and 'isp116x_hc_driver'
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_get_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x18)
WARNING: drivers/video/aty/atyfb.o - Section mismatch: reference to .init.text:aty_init_pll_ct from .data.rel.ro after 'aty_pll_ct' (at offset 0x20)
WARNING: drivers/video/nvidia/nvidiafb.o - Section mismatch: reference to .exit.text:nvidiafb_remove from .data.rel.local after 'nvidiafb_driver' (at offset 0x28)
WARNING: drivers/video/riva/rivafb.o - Section mismatch: reference to .exit.text:rivafb_remove from .data.rel.local after 'rivafb_driver' (at offset 0x28)
WARNING: sound/drivers/snd-dummy.o - Section mismatch: reference to .init.text:snd_dummy_probe from .data.rel.local between 'snd_dummy_driver' (at offset 0x0) and 'snd_dummy_controls'
WARNING: sound/drivers/snd-mtpav.o - Section mismatch: reference to .init.text:snd_mtpav_probe from .data.rel.local between 'snd_mtpav_driver' (at offset 0x0) and 'snd_mtpav_input'
WARNING: sound/drivers/snd-serial-u16550.o - Section mismatch: reference to .init.text:snd_serial_probe from .data.rel.local between 'snd_serial_driver' (at offset 0x0) and 'ops.15504'
WARNING: sound/drivers/snd-virmidi.o - Section mismatch: reference to .init.text:snd_virmidi_probe from .data.rel.local after 'snd_virmidi_driver' (at offset 0x0)
  LD [M]  fs/lockd/lockd.ko
  LD [M]  fs/nfs/nfs.ko
  LD [M]  fs/nfsd/nfsd.ko
  LD [M]  net/sunrpc/auth_gss/auth_rpcgss.ko
  LD [M]  net/sunrpc/auth_gss/rpcsec_gss_krb5.ko
  LD [M]  net/sunrpc/auth_gss/rpcsec_gss_spkm3.ko
  CC      net/sunrpc/sunrpc.mod.o
  LD [M]  net/sunrpc/sunrpc.ko

gnb@chook 1967> gcc --version
gcc (GCC) 4.1.0 (SUSE Linux)
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

>   And
> I'd be surprised if it worked very well with CONFIG_NUMA=n.

Why?  The only new code which is sensitive to CONFIG_NUMA is
for_each_node(node) and that has a definition for !NUMA that
appears (by inspection) to do the right thing.

>   And it's
> naughty to be sneaking general library functions into the sunrpc code
> anyway.

Fair enough, my bad.

> 
> Please,
> 
> - Write a standalone patch which adds highest_possible_node_id() to
>   lib/cpumask.c(?)
> 
>   Make sure it's inside #ifdef CONFIG_NUMA
> 
>   Remember to export it to modules.
> 
>   Provide a !CONFIG_NUMA version in include/linux/nodemask.h which just
>   returns constant zero.

Will do.

>   Consider doing something more efficient than the for_each_node() loop. 
>   Although I'm not sure what that would be, given that we don't have
>   find_last_bit().

That code is a copy-n-paste of highest_possible_processor_id().

> - Provide an incremental patch against
>   knfsd-make-rpc-threads-pools-numa-aware.patch which utilises
>   highest_possible_node_id().
> 
>   A replacement patch will be grudgingly accepted, but I'll only go and
>   turn it into an incremental one, so you can't hide ;)

Ok, an incremental patch it is then.  Let it not be said I
don't take a hint the fourth time.

> - Test it real good.  Modular, non-modular, NUMA, non-NUMA, !SMP.
> 

Right.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-07-31  0:42 ` [PATCH 010 of 11] knfsd: make rpc threads pools numa aware NeilBrown
  2006-07-31  4:14   ` Andrew Morton
@ 2006-08-06  9:47   ` Andrew Morton
  2006-08-07  3:16     ` Greg Banks
  2006-08-07 11:25     ` Greg Banks
  1 sibling, 2 replies; 25+ messages in thread
From: Andrew Morton @ 2006-08-06  9:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: nfs, linux-kernel, Greg Banks

On Mon, 31 Jul 2006 10:42:34 +1000
NeilBrown <neilb@suse.de> wrote:

> knfsd: Actually implement multiple pools.  On NUMA machines, allocate
> a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
> on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
> have their cpu mask set to limit them to the CPUs in the svc_pool that
> owns them.
> 
> This is the patch that allows an Altix to scale NFS traffic linearly
> beyond 4 CPUs and 4 NICs.
> 
> Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> and Christoph Hellwig.

This makes the NFS client go BUG.  Simple nfsv3 workload (ie: mount, read
stuff).  Uniproc, FC5.  

+	BUG_ON(m->mode == SVC_POOL_NONE);


kernel BUG at net/sunrpc/svc.c:244!
invalid opcode: 0000 [#1]
4K_STACKS 
last sysfs file: /class/net/eth1/flags
Modules linked in: nfs lockd nfs_acl ipw2200 sonypi autofs4 hidp l2cap bluetooth sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables video sony_acpi sbs i2c_ec button battery asus_acpi ac nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device ieee80211 snd_pcm_oss snd_mixer_oss ieee80211_crypt snd_pcm snd_timer snd i2c_i801 soundcore i2c_core piix snd_page_alloc pcspkr generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<f8d6d308>]    Not tainted VLI
EFLAGS: 00210246   (2.6.18-rc3-mm1 #21) 
EIP is at svc_pool_for_cpu+0xc/0x43 [sunrpc]
eax: ffffffff   ebx: f59a75c0   ecx: f59a76c0   edx: 00000000
esi: f59cbc20   edi: f59a75c0   ebp: f582d5c0   esp: f59cbb98
ds: 007b   es: 007b   ss: 0068
Process mount (pid: 2599, ti=f59cb000 task=f59a5550 task.ti=f59cb000)
Stack: f8d6e506 f59cbbb0 00000000 00200282 00000014 00000006 f59a76c0 f59a75c0 
       f59cbc20 f59a75c0 f582d5c0 f8d6ea83 00200286 c0270376 f582d5c0 f59a76c0 
       f59a75c0 c02dbbc0 00000006 f59a76c0 f5956c80 f8d6ebd7 00000001 00000000 
Call Trace:
 [<f8d6e506>] svc_sock_enqueue+0x33/0x294 [sunrpc]
 [<f8d6ea83>] svc_setup_socket+0x31c/0x326 [sunrpc]
 [<c0270376>] release_sock+0xc/0x83
 [<f8d6ebd7>] svc_makesock+0x14a/0x185 [sunrpc]
 [<f8ca3b10>] make_socks+0x72/0xae [lockd]
 [<f8ca3bce>] lockd_up+0x82/0xd9 [lockd]
 [<c01169a6>] __wake_up+0x11/0x1a
 [<f9227743>] nfs_start_lockd+0x26/0x43 [nfs]
 [<f9228264>] nfs_create_server+0x1dc/0x3da [nfs]
 [<c02c4298>] wait_for_completion+0x70/0x99
 [<c0116293>] default_wake_function+0x0/0xc
 [<c0124918>] call_usermodehelper_keys+0xc4/0xd3
 [<f922e348>] nfs_get_sb+0x398/0x3b4 [nfs]
 [<c0124927>] __call_usermodehelper+0x0/0x43
 [<c0158d68>] vfs_kern_mount+0x83/0xf6
 [<c0158e1d>] do_kern_mount+0x2d/0x3e
 [<c016a8ac>] do_mount+0x5b2/0x625
 [<c019facb>] task_has_capability+0x56/0x5e
 [<c029479e>] inet_bind_bucket_create+0x11/0x3c
 [<c0295e57>] inet_csk_get_port+0x196/0x1a0
 [<c0270376>] release_sock+0xc/0x83
 [<c02add33>] inet_bind+0x1c6/0x1d0
 [<c01397fe>] handle_IRQ_event+0x23/0x49
 [<c013ec5e>] __alloc_pages+0x5e/0x28d
 [<c01034d2>] common_interrupt+0x1a/0x20
 [<c0169815>] copy_mount_options+0x26/0x109
 [<c016a991>] sys_mount+0x72/0xa4
 [<c0102b4b>] syscall_call+0x7/0xb
Code: 31 c0 eb 15 8b 40 10 89 d1 c1 e9 02 8b 50 1c 8d 41 02 89 42 04 8d 44 8b 08 5a 59 5b c3 90 90 89 c1 a1 88 86 d8 f8 83 f8 ff 75 0a <0f> 0b f4 00 2c 6f d7 f8 eb 0a 83 f8 01 74 09 83 f8 02 74 0e 31 
EIP: [<f8d6d308>] svc_pool_for_cpu+0xc/0x43 [sunrpc] SS:ESP 0068:f59cbb98
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-08-06  9:47   ` Andrew Morton
@ 2006-08-07  3:16     ` Greg Banks
  2006-08-07 11:25     ` Greg Banks
  1 sibling, 0 replies; 25+ messages in thread
From: Greg Banks @ 2006-08-07  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Neil Brown, Linux NFS Mailing List, Linux Kernel Mailing List

On Sun, 2006-08-06 at 19:47, Andrew Morton wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > knfsd: Actually implement multiple pools.  On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> > 
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> > 
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > and Christoph Hellwig.
> 
> This makes the NFS client go BUG.  Simple nfsv3 workload (ie: mount, read
> stuff).  Uniproc, FC5.  
> 
> +	BUG_ON(m->mode == SVC_POOL_NONE);

Aha, I see what I b0rked up.  On the client, lockd starts an RPC
service via the old svc_create() interface, which avoids calling
svc_pool_map_init().  When the first NLM callback arrives,
svc_sock_enqueue() calls svc_pool_for_cpu() which BUGs out because
the map is not initialised.  The BUG_ON() was introduced in one
of the rewrites in response to review feedback in the last few
days; previously the code was simpler and would trivially return
pool 0, which is the right thing to do in this case.  The bug was
hidden on my test machines because they have SLES userspaces,
where lockd is broken because both the kernel and userspace think
the other one is doing the rpc.statd functionality.

A simple patch should fix this, coming up as soon as I can find
a non-SLES machine and run some client tests.

Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 010 of 11] knfsd: make rpc threads pools numa aware
  2006-08-06  9:47   ` Andrew Morton
  2006-08-07  3:16     ` Greg Banks
@ 2006-08-07 11:25     ` Greg Banks
  1 sibling, 0 replies; 25+ messages in thread
From: Greg Banks @ 2006-08-07 11:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Neil Brown, Linux NFS Mailing List, Linux Kernel Mailing List

On Sun, 2006-08-06 at 19:47, Andrew Morton wrote:
> On Mon, 31 Jul 2006 10:42:34 +1000
> NeilBrown <neilb@suse.de> wrote:
> 
> > knfsd: Actually implement multiple pools.  On NUMA machines, allocate
> > a svc_pool per NUMA node; on SMP a svc_pool per CPU; otherwise a single
> > global pool.  Enqueue sockets on the svc_pool corresponding to the CPU
> > on which the socket bh is run (i.e. the NIC interrupt CPU).  Threads
> > have their cpu mask set to limit them to the CPUs in the svc_pool that
> > owns them.
> > 
> > This is the patch that allows an Altix to scale NFS traffic linearly
> > beyond 4 CPUs and 4 NICs.
> > 
> > Incorporates changes and feedback from Neil Brown, Trond Myklebust,
> > and Christoph Hellwig.
> 
> This makes the NFS client go BUG.  Simple nfsv3 workload (ie: mount, read
> stuff).  Uniproc, FC5.  
> 
> +	BUG_ON(m->mode == SVC_POOL_NONE);
> 

Reproduced on RHAS4; this patch fixes it for me.
--

knfsd: Fix a regression on an NFS client where mounting an
NFS filesystem trips a spurious BUG_ON() in the server code.
Tested using cthon04 lock tests on RHAS4-U2 userspace.

Signed-off-by: Greg Banks <gnb@melbourne.sgi.com>
---

 net/sunrpc/svc.c |   11 ++++++++++-
 1 files changed, 10 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc2/net/sunrpc/svc.c
===================================================================
--- linux-2.6.18-rc2.orig/net/sunrpc/svc.c
+++ linux-2.6.18-rc2/net/sunrpc/svc.c
@@ -211,6 +211,11 @@ svc_pool_map_set_cpumask(unsigned int pi
 	struct svc_pool_map *m = &svc_pool_map;
 	unsigned int node; /* or cpu */
 
+	/*
+	 * The caller checks for sv_nrpools > 1, which
+	 * implies that we've been initialized and the
+	 * map mode is not NONE.
+	 */
 	BUG_ON(m->mode == SVC_POOL_NONE);
 
 	switch (m->mode)
@@ -241,7 +246,11 @@ svc_pool_for_cpu(struct svc_serv *serv, 
 	struct svc_pool_map *m = &svc_pool_map;
 	unsigned int pidx = 0;
 
-	BUG_ON(m->mode == SVC_POOL_NONE);
+	/*
+	 * SVC_POOL_NONE happens in a pure client when
+	 * lockd is brought up, so silently treat it the
+	 * same as SVC_POOL_GLOBAL.
+	 */
 
 	switch (m->mode) {
 	case SVC_POOL_PERCPU:





Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.



^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2006-08-07 11:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-31  0:41 [PATCH 000 of 11] knfsd: Introduction - Make knfsd more NUMA-aware NeilBrown
2006-07-31  0:41 ` [PATCH 001 of 11] knfsd: move tempsock aging to a timer NeilBrown
2006-07-31  0:41 ` [PATCH 002 of 11] knfsd: convert sk_inuse to atomic_t NeilBrown
2006-07-31  0:41 ` [PATCH 003 of 11] knfsd: use new lock for svc_sock deferred list NeilBrown
2006-07-31  0:42 ` [PATCH 004 of 11] knfsd: convert sk_reserved to atomic_t NeilBrown
2006-07-31  0:42 ` [PATCH 005 of 11] knfsd: test and set SK_BUSY atomically NeilBrown
2006-07-31  0:42 ` [PATCH 006 of 11] knfsd: split svc_serv into pools NeilBrown
2006-07-31  0:42 ` [PATCH 007 of 11] knfsd: add svc_get NeilBrown
2006-07-31  4:05   ` Andrew Morton
2006-07-31  4:16     ` Neil Brown
2006-07-31  0:42 ` [PATCH 008 of 11] knfsd: add svc_set_num_threads NeilBrown
2006-07-31  4:11   ` Andrew Morton
2006-07-31  4:24     ` [NFS] " Neil Brown
2006-07-31  0:42 ` [PATCH 009 of 11] knfsd: use svc_set_num_threads to manage threads in knfsd NeilBrown
2006-07-31  0:42 ` [PATCH 010 of 11] knfsd: make rpc threads pools numa aware NeilBrown
2006-07-31  4:14   ` Andrew Morton
2006-07-31  4:36     ` Neil Brown
2006-07-31  4:42       ` [NFS] " Greg Banks
2006-07-31  5:54         ` Greg Banks
2006-08-01  4:43           ` Andrew Morton
2006-08-01  5:22             ` Greg Banks
2006-08-06  9:47   ` Andrew Morton
2006-08-07  3:16     ` Greg Banks
2006-08-07 11:25     ` Greg Banks
2006-07-31  0:42 ` [PATCH 011 of 11] knfsd: allow admin to set nthreads per node NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox