public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3] First tranche of SGI Enhanced NFS patches
@ 2009-01-13 10:26 Greg Banks
  2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

G'day,

This is the first tranche of patches from SGI's Enhanced NFS product.
These are forward-ported (and sometimes updated) versions of patches
which have been shipping in SGI's NAS server products since around
2006.  Testing after porting has been comprised running cthon04.  The
patches in this particular group were posted before, in August 2006.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 1/3] knfsd: remove the nfsd thread busy histogram
  2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
@ 2009-01-13 10:26 ` Greg Banks
  2009-01-13 16:41   ` Chuck Lever
  2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
because the questionable data provided is not worth the scalability
impact of calculating it.  Instead, always report zeroes.  The current
approach suffers from three major issues:

1. update_thread_usage() increments buckets by call service
   time or call arrival time...in jiffies.  On lightly loaded
   machines, call service times are usually < 1 jiffy; on
   heavily loaded machines call arrival times will be << 1 jiffy.
   So a large portion of the updates to the buckets are rounded
   down to zero, and the histogram is undercounting.

2. As seen previously on the nfs mailing list, the format in which
   the histogram is presented is cryptic, difficult to explain,
   and difficult to use.

3. Updating the histogram requires taking a global spinlock and
   dirtying the global variables nfsd_last_call, nfsd_busy, and
   nfsdstats *twice* on every RPC call, which is a significant
   scaling limitation.

Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
1K streaming reads at full line rate, shows the stats update code
(inlined into nfsd()) takes about 1.7% of each CPU.  This patch drops
the contribution from nfsd() into the profile noise.

This patch is a forward-ported version of knfsd-remove-nfsd-threadstats
which has been shipping in the SGI "Enhanced NFS" product since 2006.
In that time, exactly one customer has noticed that the threadstats
were missing.  It has been previously posted:

http://article.gmane.org/gmane.linux.nfs/10376

and more recently requested to be posted again.

Signed-off-by: Greg Banks <gnb@sgi.com>
---

 fs/nfsd/nfssvc.c |   28 ----------------------------
 1 file changed, 28 deletions(-)

Index: bfields/fs/nfsd/nfssvc.c
===================================================================
--- bfields.orig/fs/nfsd/nfssvc.c
+++ bfields/fs/nfsd/nfssvc.c
@@ -40,9 +40,6 @@
 extern struct svc_program	nfsd_program;
 static int			nfsd(void *vrqstp);
 struct timeval			nfssvc_boot;
-static atomic_t			nfsd_busy;
-static unsigned long		nfsd_last_call;
-static DEFINE_SPINLOCK(nfsd_call_lock);
 
 /*
  * nfsd_mutex protects nfsd_serv -- both the pointer itself and the members
@@ -227,7 +224,6 @@ int nfsd_create_serv(void)
 			nfsd_max_blksize /= 2;
 	}
 
-	atomic_set(&nfsd_busy, 0);
 	nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
 				      AF_INET,
 				      nfsd_last_thread, nfsd, THIS_MODULE);
@@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv
 	return error;
 }
 
-static inline void
-update_thread_usage(int busy_threads)
-{
-	unsigned long prev_call;
-	unsigned long diff;
-	int decile;
-
-	spin_lock(&nfsd_call_lock);
-	prev_call = nfsd_last_call;
-	nfsd_last_call = jiffies;
-	decile = busy_threads*10/nfsdstats.th_cnt;
-	if (decile>0 && decile <= 10) {
-		diff = nfsd_last_call - prev_call;
-		if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP)
-			nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP;
-		if (decile == 10)
-			nfsdstats.th_fullcnt++;
-	}
-	spin_unlock(&nfsd_call_lock);
-}
 
 /*
  * This is the NFS server kernel thread
@@ -464,8 +440,6 @@ nfsd(void *vrqstp)
 			continue;
 		}
 
-		update_thread_usage(atomic_read(&nfsd_busy));
-		atomic_inc(&nfsd_busy);
 
 		/* Lock the export hash tables for reading. */
 		exp_readlock();
@@ -474,8 +448,6 @@ nfsd(void *vrqstp)
 
 		/* Unlock export hash tables */
 		exp_readunlock();
-		update_thread_usage(atomic_read(&nfsd_busy));
-		atomic_dec(&nfsd_busy);
 	}
 
 	/* Clear signals before calling svc_exit_thread() */

--
-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
  2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
@ 2009-01-13 10:26 ` Greg Banks
  2009-01-13 14:33   ` Peter Staubach
  2009-02-11 23:10   ` J. Bruce Fields
  2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
  2009-02-09  5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
  3 siblings, 2 replies; 25+ messages in thread
From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

Avoid overloading the CPU scheduler with enormous load averages
when handling high call-rate NFS loads.  When the knfsd bottom half
is made aware of an incoming call by the socket layer, it tries to
choose an nfsd thread and wake it up.  As long as there are idle
threads, one will be woken up.

If there are lot of nfsd threads (a sensible configuration when
the server is disk-bound or is running an HSM), there will be many
more nfsd threads than CPUs to run them.  Under a high call-rate
low service-time workload, the result is that almost every nfsd is
runnable, but only a handful are actually able to run.  This situation
causes two significant problems:

1. The CPU scheduler takes over 10% of each CPU, which is robbing
   the nfsd threads of valuable CPU time.

2. At a high enough load, the nfsd threads starve userspace threads
   of CPU time, to the point where daemons like portmap and rpc.mountd
   do not schedule for tens of seconds at a time.  Clients attempting
   to mount an NFS filesystem timeout at the very first step (opening
   a TCP connection to portmap) because portmap cannot wake up from
   select() and call accept() in time.

Disclaimer: these effects were observed on a SLES9 kernel, modern
kernels' schedulers may behave more gracefully.

The solution is simple: keep in each svc_pool a counter of the number
of threads which have been woken but have not yet run, and do not wake
any more if that count reaches an arbitrary small threshold.

Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
synthetic client threads simulating an rsync (i.e. recursive directory
listing) workload reading from an i386 RH9 install image (161480
regular files in 10841 directories) on the server.  That tree is small
enough to fill in the server's RAM so no disk traffic was involved.
This setup gives a sustained call rate in excess of 60000 calls/sec
before being CPU-bound on the server.  The server was running 128 nfsds.

Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
Load average was over 120 before the patch, and 20.9 after.

This patch is a forward-ported version of knfsd-avoid-nfsd-overload
which has been shipping in the SGI "Enhanced NFS" product since 2006.
It has been posted before:

http://article.gmane.org/gmane.linux.nfs/10374

Signed-off-by: Greg Banks <gnb@sgi.com>
---

 include/linux/sunrpc/svc.h |    2 ++
 net/sunrpc/svc_xprt.c      |   25 ++++++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

Index: bfields/include/linux/sunrpc/svc.h
===================================================================
--- bfields.orig/include/linux/sunrpc/svc.h
+++ bfields/include/linux/sunrpc/svc.h
@@ -41,6 +41,7 @@ struct svc_pool {
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
+	int			sp_nwaking;	/* number of threads woken but not yet active */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -264,6 +265,7 @@ struct svc_rqst {
 						 * cache pages */
 	wait_queue_head_t	rq_wait;	/* synchronization */
 	struct task_struct	*rq_task;	/* service thread */
+	int			rq_waking;	/* 1 if thread is being woken */
 };
 
 /*
Index: bfields/net/sunrpc/svc_xprt.c
===================================================================
--- bfields.orig/net/sunrpc/svc_xprt.c
+++ bfields/net/sunrpc/svc_xprt.c
@@ -14,6 +14,8 @@
 
 #define RPCDBG_FACILITY	RPCDBG_SVCXPRT
 
+#define SVC_MAX_WAKING 5
+
 static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt);
 static int svc_deferred_recv(struct svc_rqst *rqstp);
 static struct cache_deferred_req *svc_defer(struct cache_req *req);
@@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
 	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
 	int cpu;
+	int thread_avail;
 
 	if (!(xprt->xpt_flags &
 	      ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
@@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x
 
 	spin_lock_bh(&pool->sp_lock);
 
-	if (!list_empty(&pool->sp_threads) &&
-	    !list_empty(&pool->sp_sockets))
-		printk(KERN_ERR
-		       "svc_xprt_enqueue: "
-		       "threads and transports both waiting??\n");
-
 	if (test_bit(XPT_DEAD, &xprt->xpt_flags)) {
 		/* Don't enqueue dead transports */
 		dprintk("svc: transport %p is dead, not enqueued\n", xprt);
@@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x
 	}
 
  process:
-	if (!list_empty(&pool->sp_threads)) {
+	/* Work out whether threads are available */
+	thread_avail = !list_empty(&pool->sp_threads);	/* threads are asleep */
+	if (pool->sp_nwaking >= SVC_MAX_WAKING) {
+		/* too many threads are runnable and trying to wake up */
+		thread_avail = 0;
+	}
+
+	if (thread_avail) {
 		rqstp = list_entry(pool->sp_threads.next,
 				   struct svc_rqst,
 				   rq_list);
@@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
 		svc_xprt_get(xprt);
 		rqstp->rq_reserved = serv->sv_max_mesg;
 		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
+		rqstp->rq_waking = 1;
+		pool->sp_nwaking++;
 		BUG_ON(xprt->xpt_pool != pool);
 		wake_up(&rqstp->rq_wait);
 	} else {
@@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon
 		return -EINTR;
 
 	spin_lock_bh(&pool->sp_lock);
+	if (rqstp->rq_waking) {
+		rqstp->rq_waking = 0;
+		pool->sp_nwaking--;
+		BUG_ON(pool->sp_nwaking < 0);
+	}
 	xprt = svc_xprt_dequeue(pool);
 	if (xprt) {
 		rqstp->rq_xprt = xprt;

--
-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
  2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
  2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
@ 2009-01-13 10:26 ` Greg Banks
  2009-02-12 17:11   ` J. Bruce Fields
  2009-02-09  5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
  3 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML, Harshula Jayasuriya

Add /proc/fs/nfsd/pool_stats to export to userspace various
statistics about the operation of rpc server thread pools.

This patch is based on a forward-ported version of
knfsd-add-pool-thread-stats which has been shipping in the SGI
"Enhanced NFS" product since 2006 and which was previously
posted:

http://article.gmane.org/gmane.linux.nfs/10375

It has also been updated thus:

 * moved EXPORT_SYMBOL() to near the function it exports
 * made the new struct struct seq_operations const
 * used SEQ_START_TOKEN instead of ((void *)1)
 * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
   printk in svc_pool_stats_*()" by Harshula Jayasuriya.
 * merged fix from SGI PV 964001 "Crash reading pool_stats before
   nfsds are started".

Signed-off-by: Greg Banks <gnb@sgi.com>
Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
---

 fs/nfsd/nfsctl.c           |   12 ++++
 fs/nfsd/nfssvc.c           |    7 ++
 include/linux/sunrpc/svc.h |   11 +++
 net/sunrpc/svc_xprt.c      |  100 +++++++++++++++++++++++++++++++++-
 4 files changed, 129 insertions(+), 1 deletion(-)

Index: bfields/fs/nfsd/nfsctl.c
===================================================================
--- bfields.orig/fs/nfsd/nfsctl.c
+++ bfields/fs/nfsd/nfsctl.c
@@ -60,6 +60,7 @@ enum {
 	NFSD_FO_UnlockFS,
 	NFSD_Threads,
 	NFSD_Pool_Threads,
+	NFSD_Pool_Stats,
 	NFSD_Versions,
 	NFSD_Ports,
 	NFSD_MaxBlkSize,
@@ -172,6 +173,16 @@ static const struct file_operations expo
 	.owner		= THIS_MODULE,
 };
 
+extern int nfsd_pool_stats_open(struct inode *inode, struct file *file);
+
+static struct file_operations pool_stats_operations = {
+	.open		= nfsd_pool_stats_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+	.owner		= THIS_MODULE,
+};
+
 /*----------------------------------------------------------------------------*/
 /*
  * payload - write methods
@@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_
 		[NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
+		[NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO},
 		[NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
 		[NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
 		[NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO},
Index: bfields/fs/nfsd/nfssvc.c
===================================================================
--- bfields.orig/fs/nfsd/nfssvc.c
+++ bfields/fs/nfsd/nfssvc.c
@@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __
 	nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1);
 	return 1;
 }
+
+int nfsd_pool_stats_open(struct inode *inode, struct file *file)
+{
+	if (nfsd_serv == NULL)
+		return -ENODEV;
+	return svc_pool_stats_open(nfsd_serv, file);
+}
Index: bfields/include/linux/sunrpc/svc.h
===================================================================
--- bfields.orig/include/linux/sunrpc/svc.h
+++ bfields/include/linux/sunrpc/svc.h
@@ -24,6 +24,15 @@
  */
 typedef int		(*svc_thread_fn)(void *);
 
+/* statistics for svc_pool structures */
+struct svc_pool_stats {
+	unsigned long	packets;
+	unsigned long	sockets_queued;
+	unsigned long	threads_woken;
+	unsigned long	overloads_avoided;
+	unsigned long	threads_timedout;
+};
+
 /*
  *
  * RPC service thread pool.
@@ -42,6 +51,7 @@ struct svc_pool {
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
 	int			sp_nwaking;	/* number of threads woken but not yet active */
+	struct svc_pool_stats	sp_stats;	/* statistics on pool operation */
 } ____cacheline_aligned_in_smp;
 
 /*
@@ -396,6 +406,7 @@ struct svc_serv *  svc_create_pooled(str
 			sa_family_t, void (*shutdown)(struct svc_serv *),
 			svc_thread_fn, struct module *);
 int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
+int		   svc_pool_stats_open(struct svc_serv *serv, struct file *file);
 void		   svc_destroy(struct svc_serv *);
 int		   svc_process(struct svc_rqst *);
 int		   svc_register(const struct svc_serv *, const unsigned short,
Index: bfields/net/sunrpc/svc_xprt.c
===================================================================
--- bfields.orig/net/sunrpc/svc_xprt.c
+++ bfields/net/sunrpc/svc_xprt.c
@@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
 		goto out_unlock;
 	}
 
+	pool->sp_stats.packets++;
+
 	/* Mark transport as busy. It will remain in this state until
 	 * the provider calls svc_xprt_received. We update XPT_BUSY
 	 * atomically because it also guards against trying to enqueue
@@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
 	if (pool->sp_nwaking >= SVC_MAX_WAKING) {
 		/* too many threads are runnable and trying to wake up */
 		thread_avail = 0;
+		pool->sp_stats.overloads_avoided++;
 	}
 
 	if (thread_avail) {
@@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x
 		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
 		rqstp->rq_waking = 1;
 		pool->sp_nwaking++;
+		pool->sp_stats.threads_woken++;
 		BUG_ON(xprt->xpt_pool != pool);
 		wake_up(&rqstp->rq_wait);
 	} else {
 		dprintk("svc: transport %p put into queue\n", xprt);
 		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
+		pool->sp_stats.sockets_queued++;
 		BUG_ON(xprt->xpt_pool != pool);
 	}
 
@@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon
 	int			pages;
 	struct xdr_buf		*arg;
 	DECLARE_WAITQUEUE(wait, current);
+	long			time_left;
 
 	dprintk("svc: server %p waiting for data (to = %ld)\n",
 		rqstp, timeout);
@@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon
 		add_wait_queue(&rqstp->rq_wait, &wait);
 		spin_unlock_bh(&pool->sp_lock);
 
-		schedule_timeout(timeout);
+ 		time_left = schedule_timeout(timeout);
 
 		try_to_freeze();
 
 		spin_lock_bh(&pool->sp_lock);
 		remove_wait_queue(&rqstp->rq_wait, &wait);
+		if (!time_left)
+			pool->sp_stats.threads_timedout++;
 
 		xprt = rqstp->rq_xprt;
 		if (!xprt) {
@@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv
 	return totlen;
 }
 EXPORT_SYMBOL_GPL(svc_xprt_names);
+
+
+/*----------------------------------------------------------------------------*/
+
+static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos)
+{
+	unsigned int pidx = (unsigned int)*pos;
+	struct svc_serv *serv = m->private;
+
+	dprintk("svc_pool_stats_start, *pidx=%u\n", pidx);
+
+	lock_kernel();
+	/* bump up the pseudo refcount while traversing */
+	svc_get(serv);
+	unlock_kernel();
+
+	if (!pidx)
+		return SEQ_START_TOKEN;
+	return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]);
+}
+
+static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct svc_pool *pool = p;
+	struct svc_serv *serv = m->private;
+
+	dprintk("svc_pool_stats_next, *pos=%llu\n", *pos);
+
+	if (p == SEQ_START_TOKEN) {
+		pool = &serv->sv_pools[0];
+	} else {
+		unsigned int pidx = (pool - &serv->sv_pools[0]);
+		if (pidx < serv->sv_nrpools-1)
+			pool = &serv->sv_pools[pidx+1];
+		else
+			pool = NULL;
+	}
+	++*pos;
+	return pool;
+}
+
+static void svc_pool_stats_stop(struct seq_file *m, void *p)
+{
+	struct svc_serv *serv = m->private;
+
+	lock_kernel();
+	/* this function really, really should have been called svc_put() */
+	svc_destroy(serv);
+	unlock_kernel();
+}
+
+static int svc_pool_stats_show(struct seq_file *m, void *p)
+{
+	struct svc_pool *pool = p;
+
+	if (p == SEQ_START_TOKEN) {
+		seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n");
+		return 0;
+	}
+
+	seq_printf(m, "%u %lu %lu %lu %lu %lu\n",
+		pool->sp_id,
+		pool->sp_stats.packets,
+		pool->sp_stats.sockets_queued,
+		pool->sp_stats.threads_woken,
+		pool->sp_stats.overloads_avoided,
+		pool->sp_stats.threads_timedout);
+
+	return 0;
+}
+
+static const struct seq_operations svc_pool_stats_seq_ops = {
+	.start	= svc_pool_stats_start,
+	.next	= svc_pool_stats_next,
+	.stop	= svc_pool_stats_stop,
+	.show	= svc_pool_stats_show,
+};
+
+int svc_pool_stats_open(struct svc_serv *serv, struct file *file)
+{
+	int err;
+
+	err = seq_open(file, &svc_pool_stats_seq_ops);
+	if (!err)
+		((struct seq_file *) file->private_data)->private = serv;
+	return err;
+}
+EXPORT_SYMBOL(svc_pool_stats_open);
+
+/*----------------------------------------------------------------------------*/

--
-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
@ 2009-01-13 14:33   ` Peter Staubach
  2009-01-13 22:15     ` Greg Banks
  2009-02-11 23:10   ` J. Bruce Fields
  1 sibling, 1 reply; 25+ messages in thread
From: Peter Staubach @ 2009-01-13 14:33 UTC (permalink / raw)
  To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML

Greg Banks wrote:
> Avoid overloading the CPU scheduler with enormous load averages
> when handling high call-rate NFS loads.  When the knfsd bottom half
> is made aware of an incoming call by the socket layer, it tries to
> choose an nfsd thread and wake it up.  As long as there are idle
> threads, one will be woken up.
>
> If there are lot of nfsd threads (a sensible configuration when
> the server is disk-bound or is running an HSM), there will be many
> more nfsd threads than CPUs to run them.  Under a high call-rate
> low service-time workload, the result is that almost every nfsd is
> runnable, but only a handful are actually able to run.  This situation
> causes two significant problems:
>
> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
>    the nfsd threads of valuable CPU time.
>
> 2. At a high enough load, the nfsd threads starve userspace threads
>    of CPU time, to the point where daemons like portmap and rpc.mountd
>    do not schedule for tens of seconds at a time.  Clients attempting
>    to mount an NFS filesystem timeout at the very first step (opening
>    a TCP connection to portmap) because portmap cannot wake up from
>    select() and call accept() in time.
>
> Disclaimer: these effects were observed on a SLES9 kernel, modern
> kernels' schedulers may behave more gracefully.
>
> The solution is simple: keep in each svc_pool a counter of the number
> of threads which have been woken but have not yet run, and do not wake
> any more if that count reaches an arbitrary small threshold.
>
> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
> synthetic client threads simulating an rsync (i.e. recursive directory
> listing) workload reading from an i386 RH9 install image (161480
> regular files in 10841 directories) on the server.  That tree is small
> enough to fill in the server's RAM so no disk traffic was involved.
> This setup gives a sustained call rate in excess of 60000 calls/sec
> before being CPU-bound on the server.  The server was running 128 nfsds.
>
> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
> taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
> Load average was over 120 before the patch, and 20.9 after.
>
> This patch is a forward-ported version of knfsd-avoid-nfsd-overload
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> It has been posted before:
>
> http://article.gmane.org/gmane.linux.nfs/10374
>
> Signed-off-by: Greg Banks <gnb@sgi.com>
> ---

Have you measured the impact of these changes for something
like SpecSFS?

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram
  2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
@ 2009-01-13 16:41   ` Chuck Lever
  2009-01-13 22:50     ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2009-01-13 16:41 UTC (permalink / raw)
  To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML

On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
> because the questionable data provided is not worth the scalability
> impact of calculating it.  Instead, always report zeroes.  The current
> approach suffers from three major issues:
>
> 1. update_thread_usage() increments buckets by call service
>   time or call arrival time...in jiffies.  On lightly loaded
>   machines, call service times are usually < 1 jiffy; on
>   heavily loaded machines call arrival times will be << 1 jiffy.
>   So a large portion of the updates to the buckets are rounded
>   down to zero, and the histogram is undercounting.

Use ktime_get_real() instead.  This is what the network layer uses.   
You could even steal the start timestamp from the first skbuff in an  
incoming RPC request.

This problem is made worse on "server" configurations and in virtual  
guests which may still use HZ=100, though with tickless HZ this is a  
less frequently seen configuration.

> 2. As seen previously on the nfs mailing list, the format in which
>   the histogram is presented is cryptic, difficult to explain,
>   and difficult to use.

A user space script similar to mountstats that interprets these  
metrics might help here.

> 3. Updating the histogram requires taking a global spinlock and
>   dirtying the global variables nfsd_last_call, nfsd_busy, and
>   nfsdstats *twice* on every RPC call, which is a significant
>   scaling limitation.

You might fix this by making the global variables into per-CPU  
variables, then totaling the per-CPU variables only at presentation  
time (ie when someone cats /proc/net/rpc/nfsd).  That would make the  
collection logic lockless.

> Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing
> 1K streaming reads at full line rate, shows the stats update code
> (inlined into nfsd()) takes about 1.7% of each CPU.  This patch drops
> the contribution from nfsd() into the profile noise.
>
> This patch is a forward-ported version of knfsd-remove-nfsd- 
> threadstats
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> In that time, exactly one customer has noticed that the threadstats
> were missing.

Yeah.  The real issue here is deciding whether these stats are useful  
or not; if not, can they be made useable?

>  It has been previously posted:
>
> http://article.gmane.org/gmane.linux.nfs/10376
>
> and more recently requested to be posted again.
>
> Signed-off-by: Greg Banks <gnb@sgi.com>
> ---
>
> fs/nfsd/nfssvc.c |   28 ----------------------------
> 1 file changed, 28 deletions(-)
>
> Index: bfields/fs/nfsd/nfssvc.c
> ===================================================================
> --- bfields.orig/fs/nfsd/nfssvc.c
> +++ bfields/fs/nfsd/nfssvc.c
> @@ -40,9 +40,6 @@
> extern struct svc_program	nfsd_program;
> static int			nfsd(void *vrqstp);
> struct timeval			nfssvc_boot;
> -static atomic_t			nfsd_busy;
> -static unsigned long		nfsd_last_call;
> -static DEFINE_SPINLOCK(nfsd_call_lock);
>
> /*
>  * nfsd_mutex protects nfsd_serv -- both the pointer itself and the  
> members
> @@ -227,7 +224,6 @@ int nfsd_create_serv(void)
> 			nfsd_max_blksize /= 2;
> 	}
>
> -	atomic_set(&nfsd_busy, 0);
> 	nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize,
> 				      AF_INET,
> 				      nfsd_last_thread, nfsd, THIS_MODULE);
> @@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv
> 	return error;
> }
>
> -static inline void
> -update_thread_usage(int busy_threads)
> -{
> -	unsigned long prev_call;
> -	unsigned long diff;
> -	int decile;
> -
> -	spin_lock(&nfsd_call_lock);
> -	prev_call = nfsd_last_call;
> -	nfsd_last_call = jiffies;
> -	decile = busy_threads*10/nfsdstats.th_cnt;
> -	if (decile>0 && decile <= 10) {
> -		diff = nfsd_last_call - prev_call;
> -		if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP)
> -			nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP;
> -		if (decile == 10)
> -			nfsdstats.th_fullcnt++;
> -	}
> -	spin_unlock(&nfsd_call_lock);
> -}
>
> /*
>  * This is the NFS server kernel thread
> @@ -464,8 +440,6 @@ nfsd(void *vrqstp)
> 			continue;
> 		}
>
> -		update_thread_usage(atomic_read(&nfsd_busy));
> -		atomic_inc(&nfsd_busy);
>
> 		/* Lock the export hash tables for reading. */
> 		exp_readlock();
> @@ -474,8 +448,6 @@ nfsd(void *vrqstp)
>
> 		/* Unlock export hash tables */
> 		exp_readunlock();
> -		update_thread_usage(atomic_read(&nfsd_busy));
> -		atomic_dec(&nfsd_busy);
> 	}
>
> 	/* Clear signals before calling svc_exit_thread() */
>

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-01-13 14:33   ` Peter Staubach
@ 2009-01-13 22:15     ` Greg Banks
       [not found]       ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-01-13 22:15 UTC (permalink / raw)
  To: Peter Staubach; +Cc: J. Bruce Fields, Linux NFS ML

Peter Staubach wrote:
> Greg Banks wrote:
>> [...]
>> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
>> synthetic client threads simulating an rsync (i.e. recursive directory
>> listing) workload[...]
>>
>> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
>> taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
>> Load average was over 120 before the patch, and 20.9 after.
>> [...]
>
> Have you measured the impact of these changes for something
> like SpecSFS?

Not individually.  This patch was part of some work I did in late
2005/early 2006 which was aimed at improving NFS server performance in
general.  I do know that the server's SpecSFS numbers jumped by a factor
of somewhere over 2x, from embarrassingly bad to publishable, when
SpecSFS was re-run after that work.  However at the time I did not have
the ability to run SpecSFS myself, it was run by a separate group of
people who had dedicated hardware and experience.  So I can't tell what
contribution this particular patch made to the overall SpecSFS
improvements.  Sorry.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
       [not found]       ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
@ 2009-01-13 22:35         ` Peter Staubach
  2009-01-13 23:04           ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: Peter Staubach @ 2009-01-13 22:35 UTC (permalink / raw)
  To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML

Greg Banks wrote:
> Peter Staubach wrote:
>   
>> Greg Banks wrote:
>>     
>>> [...]
>>> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
>>> synthetic client threads simulating an rsync (i.e. recursive directory
>>> listing) workload[...]
>>>
>>> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
>>> taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
>>> Load average was over 120 before the patch, and 20.9 after.
>>> [...]
>>>       
>> Have you measured the impact of these changes for something
>> like SpecSFS?
>>     
>
> Not individually.  This patch was part of some work I did in late
> 2005/early 2006 which was aimed at improving NFS server performance in
> general.  I do know that the server's SpecSFS numbers jumped by a factor
> of somewhere over 2x, from embarrassingly bad to publishable, when
> SpecSFS was re-run after that work.  However at the time I did not have
> the ability to run SpecSFS myself, it was run by a separate group of
> people who had dedicated hardware and experience.  So I can't tell what
> contribution this particular patch made to the overall SpecSFS
> improvements.  Sorry.

That does sound promising though.  :-)

It would be interesting to get some better information regarding
some of the measurable performance ramifications such as SFS
though.  The Linux NFS server has not had much attention paid to
it and I suspect that it could use some work in the performance
area.

    Thanx...

       ps

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram
  2009-01-13 16:41   ` Chuck Lever
@ 2009-01-13 22:50     ` Greg Banks
       [not found]       ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-01-13 22:50 UTC (permalink / raw)
  To: Chuck Lever; +Cc: J. Bruce Fields, Linux NFS ML

Chuck Lever wrote:
> On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
>> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
>> because the questionable data provided is not worth the scalability
>> impact of calculating it.  Instead, always report zeroes.  The current
>> approach suffers from three major issues:
>>
>> 1. update_thread_usage() increments buckets by call service
>>   time or call arrival time...in jiffies.  On lightly loaded
>>   machines, call service times are usually < 1 jiffy; on
>>   heavily loaded machines call arrival times will be << 1 jiffy.
>>   So a large portion of the updates to the buckets are rounded
>>   down to zero, and the histogram is undercounting.
>
> Use ktime_get_real() instead.  This is what the network layer uses.
IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9
in late 2005).  I haven't looked at it again since.

Later, I looked at gathering better statistics on thread usage, and I
investigated real time clock (more precisely, monotonic clock)
implementations in Linux and came to the sad conclusion that there was
no API I could call that would be both accurate and efficient on two or
more platforms, so I gave up.  The HPET hardware timer looked promising
for a while, but it turned out that a 32b kernel used a global spinlock
to access the 64b HPET registers, which created the same scalability
problem I was trying to fix.  Things may have improved since then.

If we had such a clock though, the solution is very simple.  Each nfsd
maintains two new 64b counters of nanoseconds spent in each of two
states "busy" and "idle", where "idle" is asleep waiting for a call and
"busy" is everything else.  These are maintained in svc_recv(). 
Interfaces are provided for userspace to read an aggregation of these
counters, per-pool and globally.  Userspace rate-converts the counters;
the rate of increase of the two counters tells you both how many threads
there are and how much actual demand on thread time there is.  This is
how I did it in Irix (SGI  Origin machines had a global distributed
monotonic clock in hardware).

>   You could even steal the start timestamp from the first skbuff in an
> incoming RPC request.

This would help if we had skbuffs: NFS/RDMA doesn't.
>
> This problem is made worse on "server" configurations and in virtual
> guests which may still use HZ=100, though with tickless HZ this is a
> less frequently seen configuration.

Indeed.
>
>> 2. As seen previously on the nfs mailing list, the format in which
>>   the histogram is presented is cryptic, difficult to explain,
>>   and difficult to use.
>
> A user space script similar to mountstats that interprets these
> metrics might help here.

The formatting in the pseudofile isn't the entire problem.  The problem
is translating the "thread usage histogram" information there into an
answer to the actual question the sysadmin wants, which is "should I
configure more nfsds?"

>
>> 3. Updating the histogram requires taking a global spinlock and
>>   dirtying the global variables nfsd_last_call, nfsd_busy, and
>>   nfsdstats *twice* on every RPC call, which is a significant
>>   scaling limitation.
>
> You might fix this by making the global variables into per-CPU
> variables, then totaling the per-CPU variables only at presentation
> time (ie when someone cats /proc/net/rpc/nfsd).  That would make the
> collection logic lockless.
This is how I fixed some of the other server stats in later patches. 
IIRC that approach doesn't work for the thread usage histogram because
it's scaled as it's gathered by a potentially time-varying global number
so the on-demand totalling might not give correct results.  Also, in the
presence of multiple thread pools any thread usage information should be
per-pool not global.  At the time I wrote this patch I concluded that I
couldn't make the gathering scale and still preserve the exact semantics
of the data gathered.

>
>>
>
> Yeah.  The real issue here is deciding whether these stats are useful
> or not; 
In my experience, not.

> if not, can they be made useable?
A different form of the data could certainly be made useful.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-01-13 22:35         ` Peter Staubach
@ 2009-01-13 23:04           ` Greg Banks
  0 siblings, 0 replies; 25+ messages in thread
From: Greg Banks @ 2009-01-13 23:04 UTC (permalink / raw)
  To: Peter Staubach; +Cc: J. Bruce Fields, Linux NFS ML

Peter Staubach wrote:
> Greg Banks wrote:
>
> It would be interesting to get some better information regarding
> some of the measurable performance ramifications such as SFS
> though.  
My NFS server work got SpecSFS to the condition of being disk subsystem
bound instead of CPU bound, at which point it was the XFS folks' problem.
> The Linux NFS server has not had much attention paid to
> it 
It's had attention paid to it, just not yet published :-(

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches
  2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
                   ` (2 preceding siblings ...)
  2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
@ 2009-02-09  5:24 ` Greg Banks
  2009-02-09 20:47   ` J. Bruce Fields
  3 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-02-09  5:24 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

Greg Banks wrote:
> G'day,
>
> This is the first tranche of patches from SGI's Enhanced NFS product.
>
>   
http://marc.info/?l=linux-nfs&m=123184242909159&w=2
http://marc.info/?l=linux-nfs&m=123184242709153&w=2
http://marc.info/?l=linux-nfs&m=123184242809157&w=2

Bruce, any word on these?  I don't seem to have any specific review
items that I need to pay attention to with these patches, and I don't
see them in your for-2.6.30 branch, so can I get an ack or a nack or
feedback on things that need fixing?

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches
  2009-02-09  5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
@ 2009-02-09 20:47   ` J. Bruce Fields
  2009-02-09 23:26     ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2009-02-09 20:47 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML

On Mon, Feb 09, 2009 at 04:24:27PM +1100, Greg Banks wrote:
> Bruce, any word on these?  I don't seem to have any specific review
> items that I need to pay attention to with these patches, and I don't
> see them in your for-2.6.30 branch, so can I get an ack or a nack or
> feedback on things that need fixing?

Sorry, that came around the time of the citi compromise, so I just
registered that it had gotten some responses, figured it'd probably be
resent, and filed it away....

(And, by the way, if anyone's waiting for me to respond to email from
the last month--you mght want to resend.  The longer version:

We now believe that password-logging ssh and sshd were installed on citi
machines as early as November.  We got reports of ssh scanning in
December and January, but just took down the misbehaving machines.  In
mid-January we finally realized the problem was serious, disconnected
ourselves from the internet completely, took everything on our local
network offline (including our main mail server and linux-nfs.org), then
brought our external connection back up and slowly reconnected machines
to our local network as we audited and/or rebuilt them as appropriate.

To be cautious, I also did the same for my personal machines (including
my personal mail server), though I didn't have specific evidence they'd
been compromised.

The upshot is: there were a few days when mail wasn't getting through at
all, and I know at least some was never delivered.  When it did get
through, I wasn't necessarily able to pay it much attention.  So besides
just a sob-story, this is a request that people ping me if I haven't
responded to something I should have lately.)

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches
  2009-02-09 20:47   ` J. Bruce Fields
@ 2009-02-09 23:26     ` Greg Banks
  0 siblings, 0 replies; 25+ messages in thread
From: Greg Banks @ 2009-02-09 23:26 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

J. Bruce Fields wrote:
> On Mon, Feb 09, 2009 at 04:24:27PM +1100, Greg Banks wrote:
>   
>> Bruce, any word on these?  I don't seem to have any specific review
>> items that I need to pay attention to with these patches, and I don't
>> see them in your for-2.6.30 branch, so can I get an ack or a nack or
>> feedback on things that need fixing?
>>     
>
> Sorry, that came around the time of the citi compromise, so I just
> registered that it had gotten some responses, figured it'd probably be
> resent, and filed it away....
>   
Aha.  The conversations didn't result in any specific feedback items or
improvements that I can see, unless I'm misunderstanding what people
said.  So I don't have any newer versions of the patches to send.  Do
you want me to resend anyway?
>
> We now believe that password-logging ssh and sshd were installed on citi
> machines as early as November. [...]

Ouch.  Well, that explains the linux-nfs.org outages.


-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram
       [not found]       ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
@ 2009-02-11 21:59         ` J. Bruce Fields
  0 siblings, 0 replies; 25+ messages in thread
From: J. Bruce Fields @ 2009-02-11 21:59 UTC (permalink / raw)
  To: Greg Banks; +Cc: Chuck Lever, Linux NFS ML

On Wed, Jan 14, 2009 at 09:50:52AM +1100, Greg Banks wrote:
> Chuck Lever wrote:
> > On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
> >> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
> >> because the questionable data provided is not worth the scalability
> >> impact of calculating it.  Instead, always report zeroes.  The current
> >> approach suffers from three major issues:
> >>
> >> 1. update_thread_usage() increments buckets by call service
> >>   time or call arrival time...in jiffies.  On lightly loaded
> >>   machines, call service times are usually < 1 jiffy; on
> >>   heavily loaded machines call arrival times will be << 1 jiffy.
> >>   So a large portion of the updates to the buckets are rounded
> >>   down to zero, and the histogram is undercounting.
> >
> > Use ktime_get_real() instead.  This is what the network layer uses.
> IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9
> in late 2005).  I haven't looked at it again since.
> 
> Later, I looked at gathering better statistics on thread usage, and I
> investigated real time clock (more precisely, monotonic clock)
> implementations in Linux and came to the sad conclusion that there was
> no API I could call that would be both accurate and efficient on two or
> more platforms, so I gave up.  The HPET hardware timer looked promising
> for a while, but it turned out that a 32b kernel used a global spinlock
> to access the 64b HPET registers, which created the same scalability
> problem I was trying to fix.  Things may have improved since then.
> 
> If we had such a clock though, the solution is very simple.  Each nfsd
> maintains two new 64b counters of nanoseconds spent in each of two
> states "busy" and "idle", where "idle" is asleep waiting for a call and
> "busy" is everything else.  These are maintained in svc_recv(). 
> Interfaces are provided for userspace to read an aggregation of these
> counters, per-pool and globally.  Userspace rate-converts the counters;
> the rate of increase of the two counters tells you both how many threads
> there are and how much actual demand on thread time there is.  This is
> how I did it in Irix (SGI  Origin machines had a global distributed
> monotonic clock in hardware).

Is it worth looking into this again now?

> >   You could even steal the start timestamp from the first skbuff in an
> > incoming RPC request.
> 
> This would help if we had skbuffs: NFS/RDMA doesn't.
> >
> > This problem is made worse on "server" configurations and in virtual
> > guests which may still use HZ=100, though with tickless HZ this is a
> > less frequently seen configuration.
> 
> Indeed.
> >
> >> 2. As seen previously on the nfs mailing list, the format in which
> >>   the histogram is presented is cryptic, difficult to explain,
> >>   and difficult to use.
> >
> > A user space script similar to mountstats that interprets these
> > metrics might help here.
> 
> The formatting in the pseudofile isn't the entire problem.  The problem
> is translating the "thread usage histogram" information there into an
> answer to the actual question the sysadmin wants, which is "should I
> configure more nfsds?"

Agreed.

> >> 3. Updating the histogram requires taking a global spinlock and
> >>   dirtying the global variables nfsd_last_call, nfsd_busy, and
> >>   nfsdstats *twice* on every RPC call, which is a significant
> >>   scaling limitation.
> >
> > You might fix this by making the global variables into per-CPU
> > variables, then totaling the per-CPU variables only at presentation
> > time (ie when someone cats /proc/net/rpc/nfsd).  That would make the
> > collection logic lockless.
> This is how I fixed some of the other server stats in later patches. 
> IIRC that approach doesn't work for the thread usage histogram because
> it's scaled as it's gathered by a potentially time-varying global number
> so the on-demand totalling might not give correct results.

Right, so, amuse yourself watching me as I try to remember how the 'th'
line works: if it's attempting to report, e.g., what length of time 10%
of the threads are busy, it needs very local knowledge: if you know each
thread was busy 10 jiffies out of the last 100, you'd still need to know
whether they were all busy during the *same* 10-jiffy interval, or
whether that work was spread out more evenly over the 100 jiffies....

--b.

> Also, in the
> presence of multiple thread pools any thread usage information should be
> per-pool not global.  At the time I wrote this patch I concluded that I
> couldn't make the gathering scale and still preserve the exact semantics
> of the data gathered.
> 
> >
> >>
> >
> > Yeah.  The real issue here is deciding whether these stats are useful
> > or not; 
> In my experience, not.
> 
> > if not, can they be made useable?
> A different form of the data could certainly be made useful.
> 
> -- 
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
  2009-01-13 14:33   ` Peter Staubach
@ 2009-02-11 23:10   ` J. Bruce Fields
  2009-02-19  6:25     ` Greg Banks
  1 sibling, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2009-02-11 23:10 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML

On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
> Avoid overloading the CPU scheduler with enormous load averages
> when handling high call-rate NFS loads.  When the knfsd bottom half
> is made aware of an incoming call by the socket layer, it tries to
> choose an nfsd thread and wake it up.  As long as there are idle
> threads, one will be woken up.
> 
> If there are lot of nfsd threads (a sensible configuration when
> the server is disk-bound or is running an HSM), there will be many
> more nfsd threads than CPUs to run them.  Under a high call-rate
> low service-time workload, the result is that almost every nfsd is
> runnable, but only a handful are actually able to run.  This situation
> causes two significant problems:
> 
> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
>    the nfsd threads of valuable CPU time.
> 
> 2. At a high enough load, the nfsd threads starve userspace threads
>    of CPU time, to the point where daemons like portmap and rpc.mountd
>    do not schedule for tens of seconds at a time.  Clients attempting
>    to mount an NFS filesystem timeout at the very first step (opening
>    a TCP connection to portmap) because portmap cannot wake up from
>    select() and call accept() in time.
> 
> Disclaimer: these effects were observed on a SLES9 kernel, modern
> kernels' schedulers may behave more gracefully.

Yes, googling for "SLES9 kernel"...   Was that really 2.6.5 based?

The scheduler's been through at least one complete rewrite since then,
so the obvious question is whether it's wise to apply something that may
turn out to have been very specific to an old version of the scheduler.

It's a simple enough patch, but without any suggestion for how to retest
on a more recent kernel, I'm uneasy.

--b.

> 
> The solution is simple: keep in each svc_pool a counter of the number
> of threads which have been woken but have not yet run, and do not wake
> any more if that count reaches an arbitrary small threshold.
> 
> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16
> synthetic client threads simulating an rsync (i.e. recursive directory
> listing) workload reading from an i386 RH9 install image (161480
> regular files in 10841 directories) on the server.  That tree is small
> enough to fill in the server's RAM so no disk traffic was involved.
> This setup gives a sustained call rate in excess of 60000 calls/sec
> before being CPU-bound on the server.  The server was running 128 nfsds.
> 
> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up()
> taking 5.2%.  This patch drops those contributions to 3.0% and 2.2%.
> Load average was over 120 before the patch, and 20.9 after.
> 
> This patch is a forward-ported version of knfsd-avoid-nfsd-overload
> which has been shipping in the SGI "Enhanced NFS" product since 2006.
> It has been posted before:
> 
> http://article.gmane.org/gmane.linux.nfs/10374
> 
> Signed-off-by: Greg Banks <gnb@sgi.com>
> ---
> 
>  include/linux/sunrpc/svc.h |    2 ++
>  net/sunrpc/svc_xprt.c      |   25 ++++++++++++++++++-------
>  2 files changed, 20 insertions(+), 7 deletions(-)
> 
> Index: bfields/include/linux/sunrpc/svc.h
> ===================================================================
> --- bfields.orig/include/linux/sunrpc/svc.h
> +++ bfields/include/linux/sunrpc/svc.h
> @@ -41,6 +41,7 @@ struct svc_pool {
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>  	struct list_head	sp_all_threads;	/* all server threads */
> +	int			sp_nwaking;	/* number of threads woken but not yet active */
>  } ____cacheline_aligned_in_smp;
>  
>  /*
> @@ -264,6 +265,7 @@ struct svc_rqst {
>  						 * cache pages */
>  	wait_queue_head_t	rq_wait;	/* synchronization */
>  	struct task_struct	*rq_task;	/* service thread */
> +	int			rq_waking;	/* 1 if thread is being woken */
>  };
>  
>  /*
> Index: bfields/net/sunrpc/svc_xprt.c
> ===================================================================
> --- bfields.orig/net/sunrpc/svc_xprt.c
> +++ bfields/net/sunrpc/svc_xprt.c
> @@ -14,6 +14,8 @@
>  
>  #define RPCDBG_FACILITY	RPCDBG_SVCXPRT
>  
> +#define SVC_MAX_WAKING 5
> +
>  static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt);
>  static int svc_deferred_recv(struct svc_rqst *rqstp);
>  static struct cache_deferred_req *svc_defer(struct cache_req *req);
> @@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  	struct svc_pool *pool;
>  	struct svc_rqst	*rqstp;
>  	int cpu;
> +	int thread_avail;
>  
>  	if (!(xprt->xpt_flags &
>  	      ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED))))
> @@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  
>  	spin_lock_bh(&pool->sp_lock);
>  
> -	if (!list_empty(&pool->sp_threads) &&
> -	    !list_empty(&pool->sp_sockets))
> -		printk(KERN_ERR
> -		       "svc_xprt_enqueue: "
> -		       "threads and transports both waiting??\n");
> -
>  	if (test_bit(XPT_DEAD, &xprt->xpt_flags)) {
>  		/* Don't enqueue dead transports */
>  		dprintk("svc: transport %p is dead, not enqueued\n", xprt);
> @@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  	}
>  
>   process:
> -	if (!list_empty(&pool->sp_threads)) {
> +	/* Work out whether threads are available */
> +	thread_avail = !list_empty(&pool->sp_threads);	/* threads are asleep */
> +	if (pool->sp_nwaking >= SVC_MAX_WAKING) {
> +		/* too many threads are runnable and trying to wake up */
> +		thread_avail = 0;
> +	}
> +
> +	if (thread_avail) {
>  		rqstp = list_entry(pool->sp_threads.next,
>  				   struct svc_rqst,
>  				   rq_list);
> @@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  		svc_xprt_get(xprt);
>  		rqstp->rq_reserved = serv->sv_max_mesg;
>  		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
> +		rqstp->rq_waking = 1;
> +		pool->sp_nwaking++;
>  		BUG_ON(xprt->xpt_pool != pool);
>  		wake_up(&rqstp->rq_wait);
>  	} else {
> @@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon
>  		return -EINTR;
>  
>  	spin_lock_bh(&pool->sp_lock);
> +	if (rqstp->rq_waking) {
> +		rqstp->rq_waking = 0;
> +		pool->sp_nwaking--;
> +		BUG_ON(pool->sp_nwaking < 0);
> +	}
>  	xprt = svc_xprt_dequeue(pool);
>  	if (xprt) {
>  		rqstp->rq_xprt = xprt;
> 
> --
> -- 
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
@ 2009-02-12 17:11   ` J. Bruce Fields
  2009-02-13  1:53     ` Kevin Constantine
  2009-02-19  6:42     ` Greg Banks
  0 siblings, 2 replies; 25+ messages in thread
From: J. Bruce Fields @ 2009-02-12 17:11 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML, Harshula Jayasuriya

On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
> Add /proc/fs/nfsd/pool_stats to export to userspace various
> statistics about the operation of rpc server thread pools.

Could you explainw hy these specific statistics (total packets,
sockets_queued, threads_woken, overloads_avoided, threads_timedout) are
the important ones to capture?  Could you give examples of what sort of
problems could be solved using them?

As you said, an important question for the sysadmin is "should I
configure more nfsds?"  How do they answer that?

--b.

> This patch is based on a forward-ported version of
> knfsd-add-pool-thread-stats which has been shipping in the SGI
> "Enhanced NFS" product since 2006 and which was previously
> posted:

> 
> http://article.gmane.org/gmane.linux.nfs/10375
> 
> It has also been updated thus:
> 
>  * moved EXPORT_SYMBOL() to near the function it exports
>  * made the new struct struct seq_operations const
>  * used SEQ_START_TOKEN instead of ((void *)1)
>  * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
>    printk in svc_pool_stats_*()" by Harshula Jayasuriya.
>  * merged fix from SGI PV 964001 "Crash reading pool_stats before
>    nfsds are started".
> 
> Signed-off-by: Greg Banks <gnb@sgi.com>
> Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
> ---
> 
>  fs/nfsd/nfsctl.c           |   12 ++++
>  fs/nfsd/nfssvc.c           |    7 ++
>  include/linux/sunrpc/svc.h |   11 +++
>  net/sunrpc/svc_xprt.c      |  100 +++++++++++++++++++++++++++++++++-
>  4 files changed, 129 insertions(+), 1 deletion(-)
> 
> Index: bfields/fs/nfsd/nfsctl.c
> ===================================================================
> --- bfields.orig/fs/nfsd/nfsctl.c
> +++ bfields/fs/nfsd/nfsctl.c
> @@ -60,6 +60,7 @@ enum {
>  	NFSD_FO_UnlockFS,
>  	NFSD_Threads,
>  	NFSD_Pool_Threads,
> +	NFSD_Pool_Stats,
>  	NFSD_Versions,
>  	NFSD_Ports,
>  	NFSD_MaxBlkSize,
> @@ -172,6 +173,16 @@ static const struct file_operations expo
>  	.owner		= THIS_MODULE,
>  };
>  
> +extern int nfsd_pool_stats_open(struct inode *inode, struct file *file);
> +
> +static struct file_operations pool_stats_operations = {
> +	.open		= nfsd_pool_stats_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +	.owner		= THIS_MODULE,
> +};
> +
>  /*----------------------------------------------------------------------------*/
>  /*
>   * payload - write methods
> @@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_
>  		[NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR},
>  		[NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
>  		[NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
> +		[NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO},
>  		[NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
>  		[NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
>  		[NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO},
> Index: bfields/fs/nfsd/nfssvc.c
> ===================================================================
> --- bfields.orig/fs/nfsd/nfssvc.c
> +++ bfields/fs/nfsd/nfssvc.c
> @@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __
>  	nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1);
>  	return 1;
>  }
> +
> +int nfsd_pool_stats_open(struct inode *inode, struct file *file)
> +{
> +	if (nfsd_serv == NULL)
> +		return -ENODEV;
> +	return svc_pool_stats_open(nfsd_serv, file);
> +}
> Index: bfields/include/linux/sunrpc/svc.h
> ===================================================================
> --- bfields.orig/include/linux/sunrpc/svc.h
> +++ bfields/include/linux/sunrpc/svc.h
> @@ -24,6 +24,15 @@
>   */
>  typedef int		(*svc_thread_fn)(void *);
>  
> +/* statistics for svc_pool structures */
> +struct svc_pool_stats {
> +	unsigned long	packets;
> +	unsigned long	sockets_queued;
> +	unsigned long	threads_woken;
> +	unsigned long	overloads_avoided;
> +	unsigned long	threads_timedout;
> +};
> +
>  /*
>   *
>   * RPC service thread pool.
> @@ -42,6 +51,7 @@ struct svc_pool {
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>  	struct list_head	sp_all_threads;	/* all server threads */
>  	int			sp_nwaking;	/* number of threads woken but not yet active */
> +	struct svc_pool_stats	sp_stats;	/* statistics on pool operation */
>  } ____cacheline_aligned_in_smp;
>  
>  /*
> @@ -396,6 +406,7 @@ struct svc_serv *  svc_create_pooled(str
>  			sa_family_t, void (*shutdown)(struct svc_serv *),
>  			svc_thread_fn, struct module *);
>  int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
> +int		   svc_pool_stats_open(struct svc_serv *serv, struct file *file);
>  void		   svc_destroy(struct svc_serv *);
>  int		   svc_process(struct svc_rqst *);
>  int		   svc_register(const struct svc_serv *, const unsigned short,
> Index: bfields/net/sunrpc/svc_xprt.c
> ===================================================================
> --- bfields.orig/net/sunrpc/svc_xprt.c
> +++ bfields/net/sunrpc/svc_xprt.c
> @@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  		goto out_unlock;
>  	}
>  
> +	pool->sp_stats.packets++;
> +
>  	/* Mark transport as busy. It will remain in this state until
>  	 * the provider calls svc_xprt_received. We update XPT_BUSY
>  	 * atomically because it also guards against trying to enqueue
> @@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  	if (pool->sp_nwaking >= SVC_MAX_WAKING) {
>  		/* too many threads are runnable and trying to wake up */
>  		thread_avail = 0;
> +		pool->sp_stats.overloads_avoided++;
>  	}
>  
>  	if (thread_avail) {
> @@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x
>  		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>  		rqstp->rq_waking = 1;
>  		pool->sp_nwaking++;
> +		pool->sp_stats.threads_woken++;
>  		BUG_ON(xprt->xpt_pool != pool);
>  		wake_up(&rqstp->rq_wait);
>  	} else {
>  		dprintk("svc: transport %p put into queue\n", xprt);
>  		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> +		pool->sp_stats.sockets_queued++;
>  		BUG_ON(xprt->xpt_pool != pool);
>  	}
>  
> @@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon
>  	int			pages;
>  	struct xdr_buf		*arg;
>  	DECLARE_WAITQUEUE(wait, current);
> +	long			time_left;
>  
>  	dprintk("svc: server %p waiting for data (to = %ld)\n",
>  		rqstp, timeout);
> @@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon
>  		add_wait_queue(&rqstp->rq_wait, &wait);
>  		spin_unlock_bh(&pool->sp_lock);
>  
> -		schedule_timeout(timeout);
> + 		time_left = schedule_timeout(timeout);
>  
>  		try_to_freeze();
>  
>  		spin_lock_bh(&pool->sp_lock);
>  		remove_wait_queue(&rqstp->rq_wait, &wait);
> +		if (!time_left)
> +			pool->sp_stats.threads_timedout++;
>  
>  		xprt = rqstp->rq_xprt;
>  		if (!xprt) {
> @@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv
>  	return totlen;
>  }
>  EXPORT_SYMBOL_GPL(svc_xprt_names);
> +
> +
> +/*----------------------------------------------------------------------------*/
> +
> +static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos)
> +{
> +	unsigned int pidx = (unsigned int)*pos;
> +	struct svc_serv *serv = m->private;
> +
> +	dprintk("svc_pool_stats_start, *pidx=%u\n", pidx);
> +
> +	lock_kernel();
> +	/* bump up the pseudo refcount while traversing */
> +	svc_get(serv);
> +	unlock_kernel();
> +
> +	if (!pidx)
> +		return SEQ_START_TOKEN;
> +	return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]);
> +}
> +
> +static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos)
> +{
> +	struct svc_pool *pool = p;
> +	struct svc_serv *serv = m->private;
> +
> +	dprintk("svc_pool_stats_next, *pos=%llu\n", *pos);
> +
> +	if (p == SEQ_START_TOKEN) {
> +		pool = &serv->sv_pools[0];
> +	} else {
> +		unsigned int pidx = (pool - &serv->sv_pools[0]);
> +		if (pidx < serv->sv_nrpools-1)
> +			pool = &serv->sv_pools[pidx+1];
> +		else
> +			pool = NULL;
> +	}
> +	++*pos;
> +	return pool;
> +}
> +
> +static void svc_pool_stats_stop(struct seq_file *m, void *p)
> +{
> +	struct svc_serv *serv = m->private;
> +
> +	lock_kernel();
> +	/* this function really, really should have been called svc_put() */
> +	svc_destroy(serv);
> +	unlock_kernel();
> +}
> +
> +static int svc_pool_stats_show(struct seq_file *m, void *p)
> +{
> +	struct svc_pool *pool = p;
> +
> +	if (p == SEQ_START_TOKEN) {
> +		seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n");
> +		return 0;
> +	}
> +
> +	seq_printf(m, "%u %lu %lu %lu %lu %lu\n",
> +		pool->sp_id,
> +		pool->sp_stats.packets,
> +		pool->sp_stats.sockets_queued,
> +		pool->sp_stats.threads_woken,
> +		pool->sp_stats.overloads_avoided,
> +		pool->sp_stats.threads_timedout);
> +
> +	return 0;
> +}
> +
> +static const struct seq_operations svc_pool_stats_seq_ops = {
> +	.start	= svc_pool_stats_start,
> +	.next	= svc_pool_stats_next,
> +	.stop	= svc_pool_stats_stop,
> +	.show	= svc_pool_stats_show,
> +};
> +
> +int svc_pool_stats_open(struct svc_serv *serv, struct file *file)
> +{
> +	int err;
> +
> +	err = seq_open(file, &svc_pool_stats_seq_ops);
> +	if (!err)
> +		((struct seq_file *) file->private_data)->private = serv;
> +	return err;
> +}
> +EXPORT_SYMBOL(svc_pool_stats_open);
> +
> +/*----------------------------------------------------------------------------*/
> 
> --
> -- 
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-02-12 17:11   ` J. Bruce Fields
@ 2009-02-13  1:53     ` Kevin Constantine
  2009-02-19  7:04       ` Greg Banks
  2009-02-19  6:42     ` Greg Banks
  1 sibling, 1 reply; 25+ messages in thread
From: Kevin Constantine @ 2009-02-13  1:53 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Greg Banks, Linux NFS ML, Harshula Jayasuriya

On 02/12/09 09:11, J. Bruce Fields wrote:
> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
>> Add /proc/fs/nfsd/pool_stats to export to userspace various
>> statistics about the operation of rpc server thread pools.
> 
> Could you explainw hy these specific statistics (total packets,
> sockets_queued, threads_woken, overloads_avoided, threads_timedout) are
> the important ones to capture?  Could you give examples of what sort of
> problems could be solved using them?
> 
> As you said, an important question for the sysadmin is "should I
> configure more nfsds?"  How do they answer that?
> 

I typically use the "th" line to determine whether to add more threads 
or not by looking at the distribution of values across the histogram. 
If things are weighted more towards the 90-100% group, I'll add more 
threads and watch the traffic patterns.

Usually, the question of how many to add is answered by trial and error.

echo 32 > /proc/fs/nfsd/threads
Did that improve my throughput? yes?
echo 128 > /proc/fs/nfsd/threads
Did that improve my throughput? no it actually decreased.
rinse... repeat...

> --b.
> 
>> This patch is based on a forward-ported version of
>> knfsd-add-pool-thread-stats which has been shipping in the SGI
>> "Enhanced NFS" product since 2006 and which was previously
>> posted:
> 
>> http://article.gmane.org/gmane.linux.nfs/10375
>>
>> It has also been updated thus:
>>
>>  * moved EXPORT_SYMBOL() to near the function it exports
>>  * made the new struct struct seq_operations const
>>  * used SEQ_START_TOKEN instead of ((void *)1)
>>  * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of
>>    printk in svc_pool_stats_*()" by Harshula Jayasuriya.
>>  * merged fix from SGI PV 964001 "Crash reading pool_stats before
>>    nfsds are started".
>>
>> Signed-off-by: Greg Banks <gnb@sgi.com>
>> Signed-off-by: Harshula Jayasuriya <harshula@sgi.com>
>> ---
>>
>>  fs/nfsd/nfsctl.c           |   12 ++++
>>  fs/nfsd/nfssvc.c           |    7 ++
>>  include/linux/sunrpc/svc.h |   11 +++
>>  net/sunrpc/svc_xprt.c      |  100 +++++++++++++++++++++++++++++++++-
>>  4 files changed, 129 insertions(+), 1 deletion(-)
>>
>> Index: bfields/fs/nfsd/nfsctl.c
>> ===================================================================
>> --- bfields.orig/fs/nfsd/nfsctl.c
>> +++ bfields/fs/nfsd/nfsctl.c
>> @@ -60,6 +60,7 @@ enum {
>>  	NFSD_FO_UnlockFS,
>>  	NFSD_Threads,
>>  	NFSD_Pool_Threads,
>> +	NFSD_Pool_Stats,
>>  	NFSD_Versions,
>>  	NFSD_Ports,
>>  	NFSD_MaxBlkSize,
>> @@ -172,6 +173,16 @@ static const struct file_operations expo
>>  	.owner		= THIS_MODULE,
>>  };
>>  
>> +extern int nfsd_pool_stats_open(struct inode *inode, struct file *file);
>> +
>> +static struct file_operations pool_stats_operations = {
>> +	.open		= nfsd_pool_stats_open,
>> +	.read		= seq_read,
>> +	.llseek		= seq_lseek,
>> +	.release	= seq_release,
>> +	.owner		= THIS_MODULE,
>> +};
>> +
>>  /*----------------------------------------------------------------------------*/
>>  /*
>>   * payload - write methods
>> @@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_
>>  		[NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR},
>>  		[NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR},
>>  		[NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR},
>> +		[NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO},
>>  		[NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR},
>>  		[NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO},
>>  		[NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO},
>> Index: bfields/fs/nfsd/nfssvc.c
>> ===================================================================
>> --- bfields.orig/fs/nfsd/nfssvc.c
>> +++ bfields/fs/nfsd/nfssvc.c
>> @@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __
>>  	nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1);
>>  	return 1;
>>  }
>> +
>> +int nfsd_pool_stats_open(struct inode *inode, struct file *file)
>> +{
>> +	if (nfsd_serv == NULL)
>> +		return -ENODEV;
>> +	return svc_pool_stats_open(nfsd_serv, file);
>> +}
>> Index: bfields/include/linux/sunrpc/svc.h
>> ===================================================================
>> --- bfields.orig/include/linux/sunrpc/svc.h
>> +++ bfields/include/linux/sunrpc/svc.h
>> @@ -24,6 +24,15 @@
>>   */
>>  typedef int		(*svc_thread_fn)(void *);
>>  
>> +/* statistics for svc_pool structures */
>> +struct svc_pool_stats {
>> +	unsigned long	packets;
>> +	unsigned long	sockets_queued;
>> +	unsigned long	threads_woken;
>> +	unsigned long	overloads_avoided;
>> +	unsigned long	threads_timedout;
>> +};
>> +
>>  /*
>>   *
>>   * RPC service thread pool.
>> @@ -42,6 +51,7 @@ struct svc_pool {
>>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>>  	struct list_head	sp_all_threads;	/* all server threads */
>>  	int			sp_nwaking;	/* number of threads woken but not yet active */
>> +	struct svc_pool_stats	sp_stats;	/* statistics on pool operation */
>>  } ____cacheline_aligned_in_smp;
>>  
>>  /*
>> @@ -396,6 +406,7 @@ struct svc_serv *  svc_create_pooled(str
>>  			sa_family_t, void (*shutdown)(struct svc_serv *),
>>  			svc_thread_fn, struct module *);
>>  int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
>> +int		   svc_pool_stats_open(struct svc_serv *serv, struct file *file);
>>  void		   svc_destroy(struct svc_serv *);
>>  int		   svc_process(struct svc_rqst *);
>>  int		   svc_register(const struct svc_serv *, const unsigned short,
>> Index: bfields/net/sunrpc/svc_xprt.c
>> ===================================================================
>> --- bfields.orig/net/sunrpc/svc_xprt.c
>> +++ bfields/net/sunrpc/svc_xprt.c
>> @@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x
>>  		goto out_unlock;
>>  	}
>>  
>> +	pool->sp_stats.packets++;
>> +
>>  	/* Mark transport as busy. It will remain in this state until
>>  	 * the provider calls svc_xprt_received. We update XPT_BUSY
>>  	 * atomically because it also guards against trying to enqueue
>> @@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x
>>  	if (pool->sp_nwaking >= SVC_MAX_WAKING) {
>>  		/* too many threads are runnable and trying to wake up */
>>  		thread_avail = 0;
>> +		pool->sp_stats.overloads_avoided++;
>>  	}
>>  
>>  	if (thread_avail) {
>> @@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x
>>  		atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved);
>>  		rqstp->rq_waking = 1;
>>  		pool->sp_nwaking++;
>> +		pool->sp_stats.threads_woken++;
>>  		BUG_ON(xprt->xpt_pool != pool);
>>  		wake_up(&rqstp->rq_wait);
>>  	} else {
>>  		dprintk("svc: transport %p put into queue\n", xprt);
>>  		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
>> +		pool->sp_stats.sockets_queued++;
>>  		BUG_ON(xprt->xpt_pool != pool);
>>  	}
>>  
>> @@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon
>>  	int			pages;
>>  	struct xdr_buf		*arg;
>>  	DECLARE_WAITQUEUE(wait, current);
>> +	long			time_left;
>>  
>>  	dprintk("svc: server %p waiting for data (to = %ld)\n",
>>  		rqstp, timeout);
>> @@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon
>>  		add_wait_queue(&rqstp->rq_wait, &wait);
>>  		spin_unlock_bh(&pool->sp_lock);
>>  
>> -		schedule_timeout(timeout);
>> + 		time_left = schedule_timeout(timeout);
>>  
>>  		try_to_freeze();
>>  
>>  		spin_lock_bh(&pool->sp_lock);
>>  		remove_wait_queue(&rqstp->rq_wait, &wait);
>> +		if (!time_left)
>> +			pool->sp_stats.threads_timedout++;
>>  
>>  		xprt = rqstp->rq_xprt;
>>  		if (!xprt) {
>> @@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv
>>  	return totlen;
>>  }
>>  EXPORT_SYMBOL_GPL(svc_xprt_names);
>> +
>> +
>> +/*----------------------------------------------------------------------------*/
>> +
>> +static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos)
>> +{
>> +	unsigned int pidx = (unsigned int)*pos;
>> +	struct svc_serv *serv = m->private;
>> +
>> +	dprintk("svc_pool_stats_start, *pidx=%u\n", pidx);
>> +
>> +	lock_kernel();
>> +	/* bump up the pseudo refcount while traversing */
>> +	svc_get(serv);
>> +	unlock_kernel();
>> +
>> +	if (!pidx)
>> +		return SEQ_START_TOKEN;
>> +	return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]);
>> +}
>> +
>> +static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos)
>> +{
>> +	struct svc_pool *pool = p;
>> +	struct svc_serv *serv = m->private;
>> +
>> +	dprintk("svc_pool_stats_next, *pos=%llu\n", *pos);
>> +
>> +	if (p == SEQ_START_TOKEN) {
>> +		pool = &serv->sv_pools[0];
>> +	} else {
>> +		unsigned int pidx = (pool - &serv->sv_pools[0]);
>> +		if (pidx < serv->sv_nrpools-1)
>> +			pool = &serv->sv_pools[pidx+1];
>> +		else
>> +			pool = NULL;
>> +	}
>> +	++*pos;
>> +	return pool;
>> +}
>> +
>> +static void svc_pool_stats_stop(struct seq_file *m, void *p)
>> +{
>> +	struct svc_serv *serv = m->private;
>> +
>> +	lock_kernel();
>> +	/* this function really, really should have been called svc_put() */
>> +	svc_destroy(serv);
>> +	unlock_kernel();
>> +}
>> +
>> +static int svc_pool_stats_show(struct seq_file *m, void *p)
>> +{
>> +	struct svc_pool *pool = p;
>> +
>> +	if (p == SEQ_START_TOKEN) {
>> +		seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n");
>> +		return 0;
>> +	}
>> +
>> +	seq_printf(m, "%u %lu %lu %lu %lu %lu\n",
>> +		pool->sp_id,
>> +		pool->sp_stats.packets,
>> +		pool->sp_stats.sockets_queued,
>> +		pool->sp_stats.threads_woken,
>> +		pool->sp_stats.overloads_avoided,
>> +		pool->sp_stats.threads_timedout);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct seq_operations svc_pool_stats_seq_ops = {
>> +	.start	= svc_pool_stats_start,
>> +	.next	= svc_pool_stats_next,
>> +	.stop	= svc_pool_stats_stop,
>> +	.show	= svc_pool_stats_show,
>> +};
>> +
>> +int svc_pool_stats_open(struct svc_serv *serv, struct file *file)
>> +{
>> +	int err;
>> +
>> +	err = seq_open(file, &svc_pool_stats_seq_ops);
>> +	if (!err)
>> +		((struct seq_file *) file->private_data)->private = serv;
>> +	return err;
>> +}
>> +EXPORT_SYMBOL(svc_pool_stats_open);
>> +
>> +/*----------------------------------------------------------------------------*/
>>
>> --
>> -- 
>> Greg Banks, P.Engineer, SGI Australian Software Group.
>> the brightly coloured sporks of revolution.
>> I don't speak for SGI.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-02-11 23:10   ` J. Bruce Fields
@ 2009-02-19  6:25     ` Greg Banks
  2009-03-15 21:21       ` J. Bruce Fields
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-02-19  6:25 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

J. Bruce Fields wrote:
> On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
>   
>> [...] Under a high call-rate
>> low service-time workload, the result is that almost every nfsd is
>> runnable, but only a handful are actually able to run.  This situation
>> causes two significant problems:
>>
>> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
>>    the nfsd threads of valuable CPU time.
>>
>> 2. At a high enough load, the nfsd threads starve userspace threads
>>    of CPU time, to the point where daemons like portmap and rpc.mountd
>>    do not schedule for tens of seconds at a time.  Clients attempting
>>    to mount an NFS filesystem timeout at the very first step (opening
>>    a TCP connection to portmap) because portmap cannot wake up from
>>    select() and call accept() in time.
>>
>> Disclaimer: these effects were observed on a SLES9 kernel, modern
>> kernels' schedulers may behave more gracefully.
>>     
>
> Yes, googling for "SLES9 kernel"...   Was that really 2.6.5 based?
>
> The scheduler's been through at least one complete rewrite since then,
> so the obvious question is whether it's wise to apply something that may
> turn out to have been very specific to an old version of the scheduler.
>
> It's a simple enough patch, but without any suggestion for how to retest
> on a more recent kernel, I'm uneasy.
>
>   

Ok, fair enough.  I retested using my local GIT tree, which is cloned
from yours and was last git-pull'd a couple of days ago.  The test load
was the same as in my 2005 tests (multiple userspace threads each
simulating an rsync directory traversal from a 2.4 client, i.e. almost
entirely ACCESS calls with some READDIRs and GETATTRs, running as fast
as the server will respond).  This was run on much newer hardware (and a
different architecture as well: a quad-core Xeon) so the results are not
directly comparable with my 2005 tests.  However the effect with and
without the patch can be clearly seen, with otherwise identical hardware
software and load (I added a sysctl to enable and disable the effect of
the patch at runtime).

A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better
than the 2.6.5 one and NFS can still benefit from reducing load on it.

Here's a table of measured call rates and steady-state 1-minute load
averages, before and after the patch, versus number of client load
threads.   The server was configured with 128 nfsds in the thread pool
which was under load.  In all cases except the the single CPU in the
thread pool was 100% busy (I've elided the 8-thread results where that
wasn't the case).

#threads    before                  after
            call/sec    loadavg     call/sec    loadavg
--------    --------    -------     --------    -------
16          57353       10.98       74965        6.11
24          57787       19.56       79397       13.58
32          57921       26.00       80746       21.35
40          57936       35.32       81629       31.73
48          57930       43.84       81775       42.64
56          57467       51.05       81411       52.39
64          57595       57.93       81543       64.61


As you can see, the patch improves NFS throughput for this load by up
to  40%, which is a surprisingly large improvement.  I suspect it's a
larger improvement because my 2005 tests had multiple CPUs serving NFS
traffic, and the improvements due to this patch were drowned in various
SMP effects which are absent from this test.

Also surprising is that the patch improves the reported load average
number only at higher numbers of client threads; at low client thread
counts the load average is unchanged or even slightly higher.  The patch
didn't have that effect back in 2005, so I'm confused by that
behaviour.  Perhaps the difference is due to changes in the scheduler or
the accounting that measures load averages?

Profiling at 16 client threads, 32 server threads shows differences in
the CPU usage in the CPU scheduler itself, with some ACPI effects too. 
The platform I ran on in 2005 did not support ACPI, so that's new to
me.  Nevertheless it makes a difference.  Here are the top samples from
a couple of 30-second flat profiles.

Before:

samples  %        image name               app name                 symbol name
3013      4.9327  processor.ko             processor                acpi_idle_enter_simple  <---
2583      4.2287  sunrpc.ko                sunrpc                   svc_recv
1273      2.0841  e1000e.ko                e1000e                   e1000_irq_enable
1235      2.0219  sunrpc.ko                sunrpc                   svc_process
1070      1.7517  e1000e.ko                e1000e                   e1000_intr_msi
966       1.5815  e1000e.ko                e1000e                   e1000_xmit_frame
884       1.4472  sunrpc.ko                sunrpc                   svc_xprt_enqueue
861       1.4096  e1000e.ko                e1000e                   e1000_clean_rx_irq
774       1.2671  xfs.ko                   xfs                      xfs_iget
772       1.2639  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   schedule                <---
726       1.1886  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   sched_clock             <---
693       1.1345  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   read_hpet               <---
680       1.1133  sunrpc.ko                sunrpc                   cache_check
671       1.0985  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_sendpage
641       1.0494  sunrpc.ko                sunrpc                   sunrpc_cache_lookup

Total % cpu from ACPI & scheduler: 8.5%

After:

samples  %        image name               app name                 symbol name
5145      5.2163  sunrpc.ko                sunrpc                   svc_recv
2908      2.9483  processor.ko             processor                acpi_idle_enter_simple  <---
2731      2.7688  sunrpc.ko                sunrpc                   svc_process
2092      2.1210  e1000e.ko                e1000e                   e1000_clean_rx_irq
1988      2.0155  e1000e.ko                e1000e                   e1000_xmit_frame
1863      1.8888  e1000e.ko                e1000e                   e1000_irq_enable
1606      1.6282  xfs.ko                   xfs                      xfs_iget
1514      1.5350  sunrpc.ko                sunrpc                   cache_check
1389      1.4082  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_recvmsg
1383      1.4022  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_sendpage
1310      1.3281  sunrpc.ko                sunrpc                   svc_xprt_enqueue
1177      1.1933  sunrpc.ko                sunrpc                   sunrpc_cache_lookup
1142      1.1578  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   get_page_from_freelist
1135      1.1507  sunrpc.ko                sunrpc                   svc_tcp_recvfrom
1126      1.1416  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_transmit_skb
1040      1.0544  e1000e.ko                e1000e                   e1000_intr_msi
1033      1.0473  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_ack
1030      1.0443  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   kref_get
1000      1.0138  nfsd.ko                  nfsd                     fh_verify

Total % cpu from ACPI & scheduler: 2.9%


Does that make you less uneasy?


-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-02-12 17:11   ` J. Bruce Fields
  2009-02-13  1:53     ` Kevin Constantine
@ 2009-02-19  6:42     ` Greg Banks
  2009-03-15 21:25       ` J. Bruce Fields
  1 sibling, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-02-19  6:42 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML, Harshula Jayasuriya

J. Bruce Fields wrote:
> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
>   
>> Add /proc/fs/nfsd/pool_stats to export to userspace various
>> statistics about the operation of rpc server thread pools.
>>     
>
> Could you explainw hy these specific statistics (total packets,
> sockets_queued, threads_woken, overloads_avoided, threads_timedout) are
> the important ones to capture?  Could you give examples of what sort of
> problems could be solved using them?
>   
Actually I originally added these stats to help debug the
overload-avoiding patch.  Then I thought to use them to drive a
userspace control loop for controlling the number of nfsds, which I
never finished writing.
> As you said, an important question for the sysadmin is "should I
> configure more nfsds?"  How do they answer that?
>   
You can work that out, but it's not obvious, i.e. not human-friendly.  
Firstly, you need to rate convert all the stats.

The total_packets stat tells you how many NFS packets are arriving on
each thread pool.  This is your primary load metric, i.e. with more load
you want more nfsd threads.

The sockets_queued stat tells you that calls are arriving which are not
being immediately serviced by threads, i.e. you're either thread-limited
or CPU-limited rather than network-limited and you might get better
throughput if there were more nfsd threads.

Conversely the overloads_avoided stat tells you if there are more
threads than can usefully be made runnable on the available CPUs, so
that adding more nfsd threads is unlikely to be helpful.

The threads_timedout stat will give you a first-level approximation of
whether there are threads that are completely idle, i.e. don't see any
calls for the svc_recv() timeout (which I reduced to IIRC 10 sec as part
of the original version of this patch).  This is a clue that you can now
reduce the number of threads.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-02-13  1:53     ` Kevin Constantine
@ 2009-02-19  7:04       ` Greg Banks
  0 siblings, 0 replies; 25+ messages in thread
From: Greg Banks @ 2009-02-19  7:04 UTC (permalink / raw)
  To: Kevin Constantine; +Cc: J. Bruce Fields, Linux NFS ML, Harshula Jayasuriya

Kevin Constantine wrote:
> On 02/12/09 09:11, J. Bruce Fields wrote:
>>
>>
>> As you said, an important question for the sysadmin is "should I
>> configure more nfsds?"  How do they answer that?
>>
>
> I typically use the "th" line to determine whether to add more threads
> or not by looking at the distribution of values across the histogram.
> If things are weighted more towards the 90-100% group, I'll add more
> threads and watch the traffic patterns.
That seems to have been more or less the idea behind the histogram. 
However, to get any useful indication you first need to rate-convert the
histogram.  The values are counters of jiffies which wrap every million
seconds (11.6 days) and there's no way to zero them.

Also, at a call rate above 1/jiffy (a trivially low NFS load) most of
the updates are adding 0 to the histogram.  Every now and again a call
will cross a jiffy boundary and actually add 1 to the histogram.  If
your load is high but not steady, the number of threads busy when that
happens is unpredicable.  In other words there's a sampling effect which
could in the worst case make the values in the histogram be more or less
completely random.

>
> Usually, the question of how many to add is answered by trial and error.
>
> echo 32 > /proc/fs/nfsd/threads
> Did that improve my throughput? yes?
> echo 128 > /proc/fs/nfsd/threads
> Did that improve my throughput? no it actually decreased.
> rinse... repeat...

This procedure is what I would recommend admins do today, assuming their
load is steady or reproducable.  It's not only simple, but it uses a
real-world performance measure, which is "does my application go
faster?".  There's no point adding nfsds if you're limited by the
application, or some filesystem effect that causes the same load to go
slower.  It is of course rather tedious :-)

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-02-19  6:25     ` Greg Banks
@ 2009-03-15 21:21       ` J. Bruce Fields
  2009-03-16  3:10         ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2009-03-15 21:21 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML

On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:
> >   
> >> [...] Under a high call-rate
> >> low service-time workload, the result is that almost every nfsd is
> >> runnable, but only a handful are actually able to run.  This situation
> >> causes two significant problems:
> >>
> >> 1. The CPU scheduler takes over 10% of each CPU, which is robbing
> >>    the nfsd threads of valuable CPU time.
> >>
> >> 2. At a high enough load, the nfsd threads starve userspace threads
> >>    of CPU time, to the point where daemons like portmap and rpc.mountd
> >>    do not schedule for tens of seconds at a time.  Clients attempting
> >>    to mount an NFS filesystem timeout at the very first step (opening
> >>    a TCP connection to portmap) because portmap cannot wake up from
> >>    select() and call accept() in time.
> >>
> >> Disclaimer: these effects were observed on a SLES9 kernel, modern
> >> kernels' schedulers may behave more gracefully.
> >>     
> >
> > Yes, googling for "SLES9 kernel"...   Was that really 2.6.5 based?
> >
> > The scheduler's been through at least one complete rewrite since then,
> > so the obvious question is whether it's wise to apply something that may
> > turn out to have been very specific to an old version of the scheduler.
> >
> > It's a simple enough patch, but without any suggestion for how to retest
> > on a more recent kernel, I'm uneasy.
> >
> >   
> 
> Ok, fair enough.  I retested using my local GIT tree, which is cloned
> from yours and was last git-pull'd a couple of days ago.  The test load
> was the same as in my 2005 tests (multiple userspace threads each
> simulating an rsync directory traversal from a 2.4 client, i.e. almost
> entirely ACCESS calls with some READDIRs and GETATTRs, running as fast
> as the server will respond).  This was run on much newer hardware (and a
> different architecture as well: a quad-core Xeon) so the results are not
> directly comparable with my 2005 tests.  However the effect with and
> without the patch can be clearly seen, with otherwise identical hardware
> software and load (I added a sysctl to enable and disable the effect of
> the patch at runtime).
> 
> A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better
> than the 2.6.5 one and NFS can still benefit from reducing load on it.
> 
> Here's a table of measured call rates and steady-state 1-minute load
> averages, before and after the patch, versus number of client load
> threads.   The server was configured with 128 nfsds in the thread pool
> which was under load.  In all cases except the the single CPU in the
> thread pool was 100% busy (I've elided the 8-thread results where that
> wasn't the case).
> 
> #threads    before                  after
>             call/sec    loadavg     call/sec    loadavg
> --------    --------    -------     --------    -------
> 16          57353       10.98       74965        6.11
> 24          57787       19.56       79397       13.58
> 32          57921       26.00       80746       21.35
> 40          57936       35.32       81629       31.73
> 48          57930       43.84       81775       42.64
> 56          57467       51.05       81411       52.39
> 64          57595       57.93       81543       64.61
> 
> 
> As you can see, the patch improves NFS throughput for this load by up
> to  40%, which is a surprisingly large improvement.  I suspect it's a
> larger improvement because my 2005 tests had multiple CPUs serving NFS
> traffic, and the improvements due to this patch were drowned in various
> SMP effects which are absent from this test.
> 
> Also surprising is that the patch improves the reported load average
> number only at higher numbers of client threads; at low client thread
> counts the load average is unchanged or even slightly higher.  The patch
> didn't have that effect back in 2005, so I'm confused by that
> behaviour.  Perhaps the difference is due to changes in the scheduler or
> the accounting that measures load averages?
> 
> Profiling at 16 client threads, 32 server threads shows differences in
> the CPU usage in the CPU scheduler itself, with some ACPI effects too. 
> The platform I ran on in 2005 did not support ACPI, so that's new to
> me.  Nevertheless it makes a difference.  Here are the top samples from
> a couple of 30-second flat profiles.
> 
> Before:
> 
> samples  %        image name               app name                 symbol name
> 3013      4.9327  processor.ko             processor                acpi_idle_enter_simple  <---
> 2583      4.2287  sunrpc.ko                sunrpc                   svc_recv
> 1273      2.0841  e1000e.ko                e1000e                   e1000_irq_enable
> 1235      2.0219  sunrpc.ko                sunrpc                   svc_process
> 1070      1.7517  e1000e.ko                e1000e                   e1000_intr_msi
> 966       1.5815  e1000e.ko                e1000e                   e1000_xmit_frame
> 884       1.4472  sunrpc.ko                sunrpc                   svc_xprt_enqueue
> 861       1.4096  e1000e.ko                e1000e                   e1000_clean_rx_irq
> 774       1.2671  xfs.ko                   xfs                      xfs_iget
> 772       1.2639  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   schedule                <---
> 726       1.1886  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   sched_clock             <---
> 693       1.1345  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   read_hpet               <---
> 680       1.1133  sunrpc.ko                sunrpc                   cache_check
> 671       1.0985  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_sendpage
> 641       1.0494  sunrpc.ko                sunrpc                   sunrpc_cache_lookup
> 
> Total % cpu from ACPI & scheduler: 8.5%
> 
> After:
> 
> samples  %        image name               app name                 symbol name
> 5145      5.2163  sunrpc.ko                sunrpc                   svc_recv
> 2908      2.9483  processor.ko             processor                acpi_idle_enter_simple  <---
> 2731      2.7688  sunrpc.ko                sunrpc                   svc_process
> 2092      2.1210  e1000e.ko                e1000e                   e1000_clean_rx_irq
> 1988      2.0155  e1000e.ko                e1000e                   e1000_xmit_frame
> 1863      1.8888  e1000e.ko                e1000e                   e1000_irq_enable
> 1606      1.6282  xfs.ko                   xfs                      xfs_iget
> 1514      1.5350  sunrpc.ko                sunrpc                   cache_check
> 1389      1.4082  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_recvmsg
> 1383      1.4022  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_sendpage
> 1310      1.3281  sunrpc.ko                sunrpc                   svc_xprt_enqueue
> 1177      1.1933  sunrpc.ko                sunrpc                   sunrpc_cache_lookup
> 1142      1.1578  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   get_page_from_freelist
> 1135      1.1507  sunrpc.ko                sunrpc                   svc_tcp_recvfrom
> 1126      1.1416  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_transmit_skb
> 1040      1.0544  e1000e.ko                e1000e                   e1000_intr_msi
> 1033      1.0473  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   tcp_ack
> 1030      1.0443  vmlinux-2.6.29-rc4-gnb   vmlinux-2.6.29-rc4-gnb   kref_get
> 1000      1.0138  nfsd.ko                  nfsd                     fh_verify
> 
> Total % cpu from ACPI & scheduler: 2.9%
> 
> 
> Does that make you less uneasy?

Yes, thanks!

Queued up for 2.6.30, barring objections.  But perhaps we should pass on
the patch and your results to people who know the scheduler better and
see if they can explain e.g. the loadavg numbers.

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-02-19  6:42     ` Greg Banks
@ 2009-03-15 21:25       ` J. Bruce Fields
  2009-03-16  3:21         ` Greg Banks
  0 siblings, 1 reply; 25+ messages in thread
From: J. Bruce Fields @ 2009-03-15 21:25 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML, Harshula Jayasuriya

On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
> >   
> >> Add /proc/fs/nfsd/pool_stats to export to userspace various
> >> statistics about the operation of rpc server thread pools.
> >>     
> >
> > Could you explainw hy these specific statistics (total packets,
> > sockets_queued, threads_woken, overloads_avoided, threads_timedout) are
> > the important ones to capture?  Could you give examples of what sort of
> > problems could be solved using them?
> >   
> Actually I originally added these stats to help debug the
> overload-avoiding patch.  Then I thought to use them to drive a
> userspace control loop for controlling the number of nfsds, which I
> never finished writing.
> > As you said, an important question for the sysadmin is "should I
> > configure more nfsds?"  How do they answer that?
> >   
> You can work that out, but it's not obvious, i.e. not human-friendly.  
> Firstly, you need to rate convert all the stats.
> 
> The total_packets stat tells you how many NFS packets are arriving on
> each thread pool.  This is your primary load metric, i.e. with more load
> you want more nfsd threads.
> 
> The sockets_queued stat tells you that calls are arriving which are not
> being immediately serviced by threads, i.e. you're either thread-limited
> or CPU-limited rather than network-limited and you might get better
> throughput if there were more nfsd threads.
> 
> Conversely the overloads_avoided stat tells you if there are more
> threads than can usefully be made runnable on the available CPUs, so
> that adding more nfsd threads is unlikely to be helpful.
> 
> The threads_timedout stat will give you a first-level approximation of
> whether there are threads that are completely idle, i.e. don't see any
> calls for the svc_recv() timeout (which I reduced to IIRC 10 sec as part
> of the original version of this patch).  This is a clue that you can now
> reduce the number of threads.

OK, thanks, that makes sense, but: it would be really fantastic if we
could update and/or replace some of the howto's and faq's that are out
there.  The above would be useful, as would be some of the material from
your nfs-tuning presentation.

Queued up this and the old-stats removal for 2.6.30.

--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages
  2009-03-15 21:21       ` J. Bruce Fields
@ 2009-03-16  3:10         ` Greg Banks
  0 siblings, 0 replies; 25+ messages in thread
From: Greg Banks @ 2009-03-16  3:10 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

J. Bruce Fields wrote:
> On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote:
>   
>> J. Bruce Fields wrote:
>>     
>>> On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote:  
>>>       
>>> [...]
>>> It's a simple enough patch, but without any suggestion for how to retest
>>> on a more recent kernel, I'm uneasy.
>>>       
>> [...]
>>
>> Does that make you less uneasy?
>>     
>
> Yes, thanks!
>
> Queued up for 2.6.30, barring objections. 
Thanks.
>  But perhaps we should pass on
> the patch and your results to people who know the scheduler better and
> see if they can explain e.g. the loadavg numbers.
>   
If you like.  Personally I'm happy with assuming that it's because nfsd
is putting an unnaturally harsh load on the scheduler.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-03-15 21:25       ` J. Bruce Fields
@ 2009-03-16  3:21         ` Greg Banks
  2009-03-16 13:37           ` J. Bruce Fields
  0 siblings, 1 reply; 25+ messages in thread
From: Greg Banks @ 2009-03-16  3:21 UTC (permalink / raw)
  To: J. Bruce Fields; +Cc: Linux NFS ML

J. Bruce Fields wrote:
> On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote:
>   
>> J. Bruce Fields wrote:
>>     
>>> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
>>>   
>>>       
>>> As you said, an important question for the sysadmin is "should I
>>> configure more nfsds?"  How do they answer that?
>>>   
>>>       
>> You can work that out, [...]
>>     
>
> OK, thanks, that makes sense, but: it would be really fantastic if we
> could update and/or replace some of the howto's and faq's that are out
> there.  

Fair enough.

After I composed the above reply, I wondered whether the right thing to
do would have been to add that text to a new file somewhere in
Documentation/.  Would that suit as an interim measure while I look at
howtos and faqs?  I mention this because I'll be adding more new stats
in the next tranche of patches.
>
> Queued up this and the old-stats removal for 2.6.30.
>   

Thanks.

-- 
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools
  2009-03-16  3:21         ` Greg Banks
@ 2009-03-16 13:37           ` J. Bruce Fields
  0 siblings, 0 replies; 25+ messages in thread
From: J. Bruce Fields @ 2009-03-16 13:37 UTC (permalink / raw)
  To: Greg Banks; +Cc: Linux NFS ML

On Mon, Mar 16, 2009 at 02:21:02PM +1100, Greg Banks wrote:
> J. Bruce Fields wrote:
> > On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote:
> >   
> >> J. Bruce Fields wrote:
> >>     
> >>> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote:
> >>>   
> >>>       
> >>> As you said, an important question for the sysadmin is "should I
> >>> configure more nfsds?"  How do they answer that?
> >>>   
> >>>       
> >> You can work that out, [...]
> >>     
> >
> > OK, thanks, that makes sense, but: it would be really fantastic if we
> > could update and/or replace some of the howto's and faq's that are out
> > there.  
> 
> Fair enough.
> 
> After I composed the above reply, I wondered whether the right thing to
> do would have been to add that text to a new file somewhere in
> Documentation/.  Would that suit as an interim measure while I look at
> howtos and faqs?

Yep, I think that's a great idea.

> I mention this because I'll be adding more new stats
> in the next tranche of patches.

OK!--b.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2009-03-16 13:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
2009-01-13 16:41   ` Chuck Lever
2009-01-13 22:50     ` Greg Banks
     [not found]       ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-02-11 21:59         ` J. Bruce Fields
2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
2009-01-13 14:33   ` Peter Staubach
2009-01-13 22:15     ` Greg Banks
     [not found]       ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-01-13 22:35         ` Peter Staubach
2009-01-13 23:04           ` Greg Banks
2009-02-11 23:10   ` J. Bruce Fields
2009-02-19  6:25     ` Greg Banks
2009-03-15 21:21       ` J. Bruce Fields
2009-03-16  3:10         ` Greg Banks
2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
2009-02-12 17:11   ` J. Bruce Fields
2009-02-13  1:53     ` Kevin Constantine
2009-02-19  7:04       ` Greg Banks
2009-02-19  6:42     ` Greg Banks
2009-03-15 21:25       ` J. Bruce Fields
2009-03-16  3:21         ` Greg Banks
2009-03-16 13:37           ` J. Bruce Fields
2009-02-09  5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-02-09 20:47   ` J. Bruce Fields
2009-02-09 23:26     ` Greg Banks

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox