* [patch 0/3] First tranche of SGI Enhanced NFS patches
@ 2009-01-13 10:26 Greg Banks
2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
` (3 more replies)
0 siblings, 4 replies; 25+ messages in thread
From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw)
To: J. Bruce Fields; +Cc: Linux NFS ML
G'day,
This is the first tranche of patches from SGI's Enhanced NFS product.
These are forward-ported (and sometimes updated) versions of patches
which have been shipping in SGI's NAS server products since around
2006. Testing after porting has been comprised running cthon04. The
patches in this particular group were posted before, in August 2006.
--
Greg Banks, P.Engineer, SGI Australian Software Group.
the brightly coloured sporks of revolution.
I don't speak for SGI.
^ permalink raw reply [flat|nested] 25+ messages in thread* [patch 1/3] knfsd: remove the nfsd thread busy histogram 2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks @ 2009-01-13 10:26 ` Greg Banks 2009-01-13 16:41 ` Chuck Lever 2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks ` (2 subsequent siblings) 3 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd because the questionable data provided is not worth the scalability impact of calculating it. Instead, always report zeroes. The current approach suffers from three major issues: 1. update_thread_usage() increments buckets by call service time or call arrival time...in jiffies. On lightly loaded machines, call service times are usually < 1 jiffy; on heavily loaded machines call arrival times will be << 1 jiffy. So a large portion of the updates to the buckets are rounded down to zero, and the histogram is undercounting. 2. As seen previously on the nfs mailing list, the format in which the histogram is presented is cryptic, difficult to explain, and difficult to use. 3. Updating the histogram requires taking a global spinlock and dirtying the global variables nfsd_last_call, nfsd_busy, and nfsdstats *twice* on every RPC call, which is a significant scaling limitation. Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing 1K streaming reads at full line rate, shows the stats update code (inlined into nfsd()) takes about 1.7% of each CPU. This patch drops the contribution from nfsd() into the profile noise. This patch is a forward-ported version of knfsd-remove-nfsd-threadstats which has been shipping in the SGI "Enhanced NFS" product since 2006. In that time, exactly one customer has noticed that the threadstats were missing. It has been previously posted: http://article.gmane.org/gmane.linux.nfs/10376 and more recently requested to be posted again. Signed-off-by: Greg Banks <gnb@sgi.com> --- fs/nfsd/nfssvc.c | 28 ---------------------------- 1 file changed, 28 deletions(-) Index: bfields/fs/nfsd/nfssvc.c =================================================================== --- bfields.orig/fs/nfsd/nfssvc.c +++ bfields/fs/nfsd/nfssvc.c @@ -40,9 +40,6 @@ extern struct svc_program nfsd_program; static int nfsd(void *vrqstp); struct timeval nfssvc_boot; -static atomic_t nfsd_busy; -static unsigned long nfsd_last_call; -static DEFINE_SPINLOCK(nfsd_call_lock); /* * nfsd_mutex protects nfsd_serv -- both the pointer itself and the members @@ -227,7 +224,6 @@ int nfsd_create_serv(void) nfsd_max_blksize /= 2; } - atomic_set(&nfsd_busy, 0); nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize, AF_INET, nfsd_last_thread, nfsd, THIS_MODULE); @@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv return error; } -static inline void -update_thread_usage(int busy_threads) -{ - unsigned long prev_call; - unsigned long diff; - int decile; - - spin_lock(&nfsd_call_lock); - prev_call = nfsd_last_call; - nfsd_last_call = jiffies; - decile = busy_threads*10/nfsdstats.th_cnt; - if (decile>0 && decile <= 10) { - diff = nfsd_last_call - prev_call; - if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP) - nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP; - if (decile == 10) - nfsdstats.th_fullcnt++; - } - spin_unlock(&nfsd_call_lock); -} /* * This is the NFS server kernel thread @@ -464,8 +440,6 @@ nfsd(void *vrqstp) continue; } - update_thread_usage(atomic_read(&nfsd_busy)); - atomic_inc(&nfsd_busy); /* Lock the export hash tables for reading. */ exp_readlock(); @@ -474,8 +448,6 @@ nfsd(void *vrqstp) /* Unlock export hash tables */ exp_readunlock(); - update_thread_usage(atomic_read(&nfsd_busy)); - atomic_dec(&nfsd_busy); } /* Clear signals before calling svc_exit_thread() */ -- -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram 2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks @ 2009-01-13 16:41 ` Chuck Lever 2009-01-13 22:50 ` Greg Banks 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2009-01-13 16:41 UTC (permalink / raw) To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote: > Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd > because the questionable data provided is not worth the scalability > impact of calculating it. Instead, always report zeroes. The current > approach suffers from three major issues: > > 1. update_thread_usage() increments buckets by call service > time or call arrival time...in jiffies. On lightly loaded > machines, call service times are usually < 1 jiffy; on > heavily loaded machines call arrival times will be << 1 jiffy. > So a large portion of the updates to the buckets are rounded > down to zero, and the histogram is undercounting. Use ktime_get_real() instead. This is what the network layer uses. You could even steal the start timestamp from the first skbuff in an incoming RPC request. This problem is made worse on "server" configurations and in virtual guests which may still use HZ=100, though with tickless HZ this is a less frequently seen configuration. > 2. As seen previously on the nfs mailing list, the format in which > the histogram is presented is cryptic, difficult to explain, > and difficult to use. A user space script similar to mountstats that interprets these metrics might help here. > 3. Updating the histogram requires taking a global spinlock and > dirtying the global variables nfsd_last_call, nfsd_busy, and > nfsdstats *twice* on every RPC call, which is a significant > scaling limitation. You might fix this by making the global variables into per-CPU variables, then totaling the per-CPU variables only at presentation time (ie when someone cats /proc/net/rpc/nfsd). That would make the collection logic lockless. > Testing on a 4 CPU 4 NIC Altix using 4 IRIX clients each doing > 1K streaming reads at full line rate, shows the stats update code > (inlined into nfsd()) takes about 1.7% of each CPU. This patch drops > the contribution from nfsd() into the profile noise. > > This patch is a forward-ported version of knfsd-remove-nfsd- > threadstats > which has been shipping in the SGI "Enhanced NFS" product since 2006. > In that time, exactly one customer has noticed that the threadstats > were missing. Yeah. The real issue here is deciding whether these stats are useful or not; if not, can they be made useable? > It has been previously posted: > > http://article.gmane.org/gmane.linux.nfs/10376 > > and more recently requested to be posted again. > > Signed-off-by: Greg Banks <gnb@sgi.com> > --- > > fs/nfsd/nfssvc.c | 28 ---------------------------- > 1 file changed, 28 deletions(-) > > Index: bfields/fs/nfsd/nfssvc.c > =================================================================== > --- bfields.orig/fs/nfsd/nfssvc.c > +++ bfields/fs/nfsd/nfssvc.c > @@ -40,9 +40,6 @@ > extern struct svc_program nfsd_program; > static int nfsd(void *vrqstp); > struct timeval nfssvc_boot; > -static atomic_t nfsd_busy; > -static unsigned long nfsd_last_call; > -static DEFINE_SPINLOCK(nfsd_call_lock); > > /* > * nfsd_mutex protects nfsd_serv -- both the pointer itself and the > members > @@ -227,7 +224,6 @@ int nfsd_create_serv(void) > nfsd_max_blksize /= 2; > } > > - atomic_set(&nfsd_busy, 0); > nfsd_serv = svc_create_pooled(&nfsd_program, nfsd_max_blksize, > AF_INET, > nfsd_last_thread, nfsd, THIS_MODULE); > @@ -376,26 +372,6 @@ nfsd_svc(unsigned short port, int nrserv > return error; > } > > -static inline void > -update_thread_usage(int busy_threads) > -{ > - unsigned long prev_call; > - unsigned long diff; > - int decile; > - > - spin_lock(&nfsd_call_lock); > - prev_call = nfsd_last_call; > - nfsd_last_call = jiffies; > - decile = busy_threads*10/nfsdstats.th_cnt; > - if (decile>0 && decile <= 10) { > - diff = nfsd_last_call - prev_call; > - if ( (nfsdstats.th_usage[decile-1] += diff) >= NFSD_USAGE_WRAP) > - nfsdstats.th_usage[decile-1] -= NFSD_USAGE_WRAP; > - if (decile == 10) > - nfsdstats.th_fullcnt++; > - } > - spin_unlock(&nfsd_call_lock); > -} > > /* > * This is the NFS server kernel thread > @@ -464,8 +440,6 @@ nfsd(void *vrqstp) > continue; > } > > - update_thread_usage(atomic_read(&nfsd_busy)); > - atomic_inc(&nfsd_busy); > > /* Lock the export hash tables for reading. */ > exp_readlock(); > @@ -474,8 +448,6 @@ nfsd(void *vrqstp) > > /* Unlock export hash tables */ > exp_readunlock(); > - update_thread_usage(atomic_read(&nfsd_busy)); > - atomic_dec(&nfsd_busy); > } > > /* Clear signals before calling svc_exit_thread() */ > -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram 2009-01-13 16:41 ` Chuck Lever @ 2009-01-13 22:50 ` Greg Banks [not found] ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-01-13 22:50 UTC (permalink / raw) To: Chuck Lever; +Cc: J. Bruce Fields, Linux NFS ML Chuck Lever wrote: > On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote: >> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd >> because the questionable data provided is not worth the scalability >> impact of calculating it. Instead, always report zeroes. The current >> approach suffers from three major issues: >> >> 1. update_thread_usage() increments buckets by call service >> time or call arrival time...in jiffies. On lightly loaded >> machines, call service times are usually < 1 jiffy; on >> heavily loaded machines call arrival times will be << 1 jiffy. >> So a large portion of the updates to the buckets are rounded >> down to zero, and the histogram is undercounting. > > Use ktime_get_real() instead. This is what the network layer uses. IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9 in late 2005). I haven't looked at it again since. Later, I looked at gathering better statistics on thread usage, and I investigated real time clock (more precisely, monotonic clock) implementations in Linux and came to the sad conclusion that there was no API I could call that would be both accurate and efficient on two or more platforms, so I gave up. The HPET hardware timer looked promising for a while, but it turned out that a 32b kernel used a global spinlock to access the 64b HPET registers, which created the same scalability problem I was trying to fix. Things may have improved since then. If we had such a clock though, the solution is very simple. Each nfsd maintains two new 64b counters of nanoseconds spent in each of two states "busy" and "idle", where "idle" is asleep waiting for a call and "busy" is everything else. These are maintained in svc_recv(). Interfaces are provided for userspace to read an aggregation of these counters, per-pool and globally. Userspace rate-converts the counters; the rate of increase of the two counters tells you both how many threads there are and how much actual demand on thread time there is. This is how I did it in Irix (SGI Origin machines had a global distributed monotonic clock in hardware). > You could even steal the start timestamp from the first skbuff in an > incoming RPC request. This would help if we had skbuffs: NFS/RDMA doesn't. > > This problem is made worse on "server" configurations and in virtual > guests which may still use HZ=100, though with tickless HZ this is a > less frequently seen configuration. Indeed. > >> 2. As seen previously on the nfs mailing list, the format in which >> the histogram is presented is cryptic, difficult to explain, >> and difficult to use. > > A user space script similar to mountstats that interprets these > metrics might help here. The formatting in the pseudofile isn't the entire problem. The problem is translating the "thread usage histogram" information there into an answer to the actual question the sysadmin wants, which is "should I configure more nfsds?" > >> 3. Updating the histogram requires taking a global spinlock and >> dirtying the global variables nfsd_last_call, nfsd_busy, and >> nfsdstats *twice* on every RPC call, which is a significant >> scaling limitation. > > You might fix this by making the global variables into per-CPU > variables, then totaling the per-CPU variables only at presentation > time (ie when someone cats /proc/net/rpc/nfsd). That would make the > collection logic lockless. This is how I fixed some of the other server stats in later patches. IIRC that approach doesn't work for the thread usage histogram because it's scaled as it's gathered by a potentially time-varying global number so the on-demand totalling might not give correct results. Also, in the presence of multiple thread pools any thread usage information should be per-pool not global. At the time I wrote this patch I concluded that I couldn't make the gathering scale and still preserve the exact semantics of the data gathered. > >> > > Yeah. The real issue here is deciding whether these stats are useful > or not; In my experience, not. > if not, can they be made useable? A different form of the data could certainly be made useful. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>]
* Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram [not found] ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> @ 2009-02-11 21:59 ` J. Bruce Fields 0 siblings, 0 replies; 25+ messages in thread From: J. Bruce Fields @ 2009-02-11 21:59 UTC (permalink / raw) To: Greg Banks; +Cc: Chuck Lever, Linux NFS ML On Wed, Jan 14, 2009 at 09:50:52AM +1100, Greg Banks wrote: > Chuck Lever wrote: > > On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote: > >> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd > >> because the questionable data provided is not worth the scalability > >> impact of calculating it. Instead, always report zeroes. The current > >> approach suffers from three major issues: > >> > >> 1. update_thread_usage() increments buckets by call service > >> time or call arrival time...in jiffies. On lightly loaded > >> machines, call service times are usually < 1 jiffy; on > >> heavily loaded machines call arrival times will be << 1 jiffy. > >> So a large portion of the updates to the buckets are rounded > >> down to zero, and the histogram is undercounting. > > > > Use ktime_get_real() instead. This is what the network layer uses. > IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9 > in late 2005). I haven't looked at it again since. > > Later, I looked at gathering better statistics on thread usage, and I > investigated real time clock (more precisely, monotonic clock) > implementations in Linux and came to the sad conclusion that there was > no API I could call that would be both accurate and efficient on two or > more platforms, so I gave up. The HPET hardware timer looked promising > for a while, but it turned out that a 32b kernel used a global spinlock > to access the 64b HPET registers, which created the same scalability > problem I was trying to fix. Things may have improved since then. > > If we had such a clock though, the solution is very simple. Each nfsd > maintains two new 64b counters of nanoseconds spent in each of two > states "busy" and "idle", where "idle" is asleep waiting for a call and > "busy" is everything else. These are maintained in svc_recv(). > Interfaces are provided for userspace to read an aggregation of these > counters, per-pool and globally. Userspace rate-converts the counters; > the rate of increase of the two counters tells you both how many threads > there are and how much actual demand on thread time there is. This is > how I did it in Irix (SGI Origin machines had a global distributed > monotonic clock in hardware). Is it worth looking into this again now? > > You could even steal the start timestamp from the first skbuff in an > > incoming RPC request. > > This would help if we had skbuffs: NFS/RDMA doesn't. > > > > This problem is made worse on "server" configurations and in virtual > > guests which may still use HZ=100, though with tickless HZ this is a > > less frequently seen configuration. > > Indeed. > > > >> 2. As seen previously on the nfs mailing list, the format in which > >> the histogram is presented is cryptic, difficult to explain, > >> and difficult to use. > > > > A user space script similar to mountstats that interprets these > > metrics might help here. > > The formatting in the pseudofile isn't the entire problem. The problem > is translating the "thread usage histogram" information there into an > answer to the actual question the sysadmin wants, which is "should I > configure more nfsds?" Agreed. > >> 3. Updating the histogram requires taking a global spinlock and > >> dirtying the global variables nfsd_last_call, nfsd_busy, and > >> nfsdstats *twice* on every RPC call, which is a significant > >> scaling limitation. > > > > You might fix this by making the global variables into per-CPU > > variables, then totaling the per-CPU variables only at presentation > > time (ie when someone cats /proc/net/rpc/nfsd). That would make the > > collection logic lockless. > This is how I fixed some of the other server stats in later patches. > IIRC that approach doesn't work for the thread usage histogram because > it's scaled as it's gathered by a potentially time-varying global number > so the on-demand totalling might not give correct results. Right, so, amuse yourself watching me as I try to remember how the 'th' line works: if it's attempting to report, e.g., what length of time 10% of the threads are busy, it needs very local knowledge: if you know each thread was busy 10 jiffies out of the last 100, you'd still need to know whether they were all busy during the *same* 10-jiffy interval, or whether that work was spread out more evenly over the 100 jiffies.... --b. > Also, in the > presence of multiple thread pools any thread usage information should be > per-pool not global. At the time I wrote this patch I concluded that I > couldn't make the gathering scale and still preserve the exact semantics > of the data gathered. > > > > >> > > > > Yeah. The real issue here is deciding whether these stats are useful > > or not; > In my experience, not. > > > if not, can they be made useable? > A different form of the data could certainly be made useful. > > -- > Greg Banks, P.Engineer, SGI Australian Software Group. > the brightly coloured sporks of revolution. > I don't speak for SGI. > ^ permalink raw reply [flat|nested] 25+ messages in thread
* [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks 2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks @ 2009-01-13 10:26 ` Greg Banks 2009-01-13 14:33 ` Peter Staubach 2009-02-11 23:10 ` J. Bruce Fields 2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks 2009-02-09 5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks 3 siblings, 2 replies; 25+ messages in thread From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML Avoid overloading the CPU scheduler with enormous load averages when handling high call-rate NFS loads. When the knfsd bottom half is made aware of an incoming call by the socket layer, it tries to choose an nfsd thread and wake it up. As long as there are idle threads, one will be woken up. If there are lot of nfsd threads (a sensible configuration when the server is disk-bound or is running an HSM), there will be many more nfsd threads than CPUs to run them. Under a high call-rate low service-time workload, the result is that almost every nfsd is runnable, but only a handful are actually able to run. This situation causes two significant problems: 1. The CPU scheduler takes over 10% of each CPU, which is robbing the nfsd threads of valuable CPU time. 2. At a high enough load, the nfsd threads starve userspace threads of CPU time, to the point where daemons like portmap and rpc.mountd do not schedule for tens of seconds at a time. Clients attempting to mount an NFS filesystem timeout at the very first step (opening a TCP connection to portmap) because portmap cannot wake up from select() and call accept() in time. Disclaimer: these effects were observed on a SLES9 kernel, modern kernels' schedulers may behave more gracefully. The solution is simple: keep in each svc_pool a counter of the number of threads which have been woken but have not yet run, and do not wake any more if that count reaches an arbitrary small threshold. Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 synthetic client threads simulating an rsync (i.e. recursive directory listing) workload reading from an i386 RH9 install image (161480 regular files in 10841 directories) on the server. That tree is small enough to fill in the server's RAM so no disk traffic was involved. This setup gives a sustained call rate in excess of 60000 calls/sec before being CPU-bound on the server. The server was running 128 nfsds. Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. Load average was over 120 before the patch, and 20.9 after. This patch is a forward-ported version of knfsd-avoid-nfsd-overload which has been shipping in the SGI "Enhanced NFS" product since 2006. It has been posted before: http://article.gmane.org/gmane.linux.nfs/10374 Signed-off-by: Greg Banks <gnb@sgi.com> --- include/linux/sunrpc/svc.h | 2 ++ net/sunrpc/svc_xprt.c | 25 ++++++++++++++++++------- 2 files changed, 20 insertions(+), 7 deletions(-) Index: bfields/include/linux/sunrpc/svc.h =================================================================== --- bfields.orig/include/linux/sunrpc/svc.h +++ bfields/include/linux/sunrpc/svc.h @@ -41,6 +41,7 @@ struct svc_pool { struct list_head sp_sockets; /* pending sockets */ unsigned int sp_nrthreads; /* # of threads in pool */ struct list_head sp_all_threads; /* all server threads */ + int sp_nwaking; /* number of threads woken but not yet active */ } ____cacheline_aligned_in_smp; /* @@ -264,6 +265,7 @@ struct svc_rqst { * cache pages */ wait_queue_head_t rq_wait; /* synchronization */ struct task_struct *rq_task; /* service thread */ + int rq_waking; /* 1 if thread is being woken */ }; /* Index: bfields/net/sunrpc/svc_xprt.c =================================================================== --- bfields.orig/net/sunrpc/svc_xprt.c +++ bfields/net/sunrpc/svc_xprt.c @@ -14,6 +14,8 @@ #define RPCDBG_FACILITY RPCDBG_SVCXPRT +#define SVC_MAX_WAKING 5 + static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt); static int svc_deferred_recv(struct svc_rqst *rqstp); static struct cache_deferred_req *svc_defer(struct cache_req *req); @@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x struct svc_pool *pool; struct svc_rqst *rqstp; int cpu; + int thread_avail; if (!(xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED)))) @@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x spin_lock_bh(&pool->sp_lock); - if (!list_empty(&pool->sp_threads) && - !list_empty(&pool->sp_sockets)) - printk(KERN_ERR - "svc_xprt_enqueue: " - "threads and transports both waiting??\n"); - if (test_bit(XPT_DEAD, &xprt->xpt_flags)) { /* Don't enqueue dead transports */ dprintk("svc: transport %p is dead, not enqueued\n", xprt); @@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x } process: - if (!list_empty(&pool->sp_threads)) { + /* Work out whether threads are available */ + thread_avail = !list_empty(&pool->sp_threads); /* threads are asleep */ + if (pool->sp_nwaking >= SVC_MAX_WAKING) { + /* too many threads are runnable and trying to wake up */ + thread_avail = 0; + } + + if (thread_avail) { rqstp = list_entry(pool->sp_threads.next, struct svc_rqst, rq_list); @@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x svc_xprt_get(xprt); rqstp->rq_reserved = serv->sv_max_mesg; atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); + rqstp->rq_waking = 1; + pool->sp_nwaking++; BUG_ON(xprt->xpt_pool != pool); wake_up(&rqstp->rq_wait); } else { @@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon return -EINTR; spin_lock_bh(&pool->sp_lock); + if (rqstp->rq_waking) { + rqstp->rq_waking = 0; + pool->sp_nwaking--; + BUG_ON(pool->sp_nwaking < 0); + } xprt = svc_xprt_dequeue(pool); if (xprt) { rqstp->rq_xprt = xprt; -- -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks @ 2009-01-13 14:33 ` Peter Staubach 2009-01-13 22:15 ` Greg Banks 2009-02-11 23:10 ` J. Bruce Fields 1 sibling, 1 reply; 25+ messages in thread From: Peter Staubach @ 2009-01-13 14:33 UTC (permalink / raw) To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML Greg Banks wrote: > Avoid overloading the CPU scheduler with enormous load averages > when handling high call-rate NFS loads. When the knfsd bottom half > is made aware of an incoming call by the socket layer, it tries to > choose an nfsd thread and wake it up. As long as there are idle > threads, one will be woken up. > > If there are lot of nfsd threads (a sensible configuration when > the server is disk-bound or is running an HSM), there will be many > more nfsd threads than CPUs to run them. Under a high call-rate > low service-time workload, the result is that almost every nfsd is > runnable, but only a handful are actually able to run. This situation > causes two significant problems: > > 1. The CPU scheduler takes over 10% of each CPU, which is robbing > the nfsd threads of valuable CPU time. > > 2. At a high enough load, the nfsd threads starve userspace threads > of CPU time, to the point where daemons like portmap and rpc.mountd > do not schedule for tens of seconds at a time. Clients attempting > to mount an NFS filesystem timeout at the very first step (opening > a TCP connection to portmap) because portmap cannot wake up from > select() and call accept() in time. > > Disclaimer: these effects were observed on a SLES9 kernel, modern > kernels' schedulers may behave more gracefully. > > The solution is simple: keep in each svc_pool a counter of the number > of threads which have been woken but have not yet run, and do not wake > any more if that count reaches an arbitrary small threshold. > > Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 > synthetic client threads simulating an rsync (i.e. recursive directory > listing) workload reading from an i386 RH9 install image (161480 > regular files in 10841 directories) on the server. That tree is small > enough to fill in the server's RAM so no disk traffic was involved. > This setup gives a sustained call rate in excess of 60000 calls/sec > before being CPU-bound on the server. The server was running 128 nfsds. > > Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() > taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. > Load average was over 120 before the patch, and 20.9 after. > > This patch is a forward-ported version of knfsd-avoid-nfsd-overload > which has been shipping in the SGI "Enhanced NFS" product since 2006. > It has been posted before: > > http://article.gmane.org/gmane.linux.nfs/10374 > > Signed-off-by: Greg Banks <gnb@sgi.com> > --- Have you measured the impact of these changes for something like SpecSFS? Thanx... ps ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-01-13 14:33 ` Peter Staubach @ 2009-01-13 22:15 ` Greg Banks [not found] ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-01-13 22:15 UTC (permalink / raw) To: Peter Staubach; +Cc: J. Bruce Fields, Linux NFS ML Peter Staubach wrote: > Greg Banks wrote: >> [...] >> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 >> synthetic client threads simulating an rsync (i.e. recursive directory >> listing) workload[...] >> >> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() >> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. >> Load average was over 120 before the patch, and 20.9 after. >> [...] > > Have you measured the impact of these changes for something > like SpecSFS? Not individually. This patch was part of some work I did in late 2005/early 2006 which was aimed at improving NFS server performance in general. I do know that the server's SpecSFS numbers jumped by a factor of somewhere over 2x, from embarrassingly bad to publishable, when SpecSFS was re-run after that work. However at the time I did not have the ability to run SpecSFS myself, it was run by a separate group of people who had dedicated hardware and experience. So I can't tell what contribution this particular patch made to the overall SpecSFS improvements. Sorry. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>]
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages [not found] ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org> @ 2009-01-13 22:35 ` Peter Staubach 2009-01-13 23:04 ` Greg Banks 0 siblings, 1 reply; 25+ messages in thread From: Peter Staubach @ 2009-01-13 22:35 UTC (permalink / raw) To: Greg Banks; +Cc: J. Bruce Fields, Linux NFS ML Greg Banks wrote: > Peter Staubach wrote: > >> Greg Banks wrote: >> >>> [...] >>> Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 >>> synthetic client threads simulating an rsync (i.e. recursive directory >>> listing) workload[...] >>> >>> Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() >>> taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. >>> Load average was over 120 before the patch, and 20.9 after. >>> [...] >>> >> Have you measured the impact of these changes for something >> like SpecSFS? >> > > Not individually. This patch was part of some work I did in late > 2005/early 2006 which was aimed at improving NFS server performance in > general. I do know that the server's SpecSFS numbers jumped by a factor > of somewhere over 2x, from embarrassingly bad to publishable, when > SpecSFS was re-run after that work. However at the time I did not have > the ability to run SpecSFS myself, it was run by a separate group of > people who had dedicated hardware and experience. So I can't tell what > contribution this particular patch made to the overall SpecSFS > improvements. Sorry. That does sound promising though. :-) It would be interesting to get some better information regarding some of the measurable performance ramifications such as SFS though. The Linux NFS server has not had much attention paid to it and I suspect that it could use some work in the performance area. Thanx... ps ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-01-13 22:35 ` Peter Staubach @ 2009-01-13 23:04 ` Greg Banks 0 siblings, 0 replies; 25+ messages in thread From: Greg Banks @ 2009-01-13 23:04 UTC (permalink / raw) To: Peter Staubach; +Cc: J. Bruce Fields, Linux NFS ML Peter Staubach wrote: > Greg Banks wrote: > > It would be interesting to get some better information regarding > some of the measurable performance ramifications such as SFS > though. My NFS server work got SpecSFS to the condition of being disk subsystem bound instead of CPU bound, at which point it was the XFS folks' problem. > The Linux NFS server has not had much attention paid to > it It's had attention paid to it, just not yet published :-( -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks 2009-01-13 14:33 ` Peter Staubach @ 2009-02-11 23:10 ` J. Bruce Fields 2009-02-19 6:25 ` Greg Banks 1 sibling, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2009-02-11 23:10 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote: > Avoid overloading the CPU scheduler with enormous load averages > when handling high call-rate NFS loads. When the knfsd bottom half > is made aware of an incoming call by the socket layer, it tries to > choose an nfsd thread and wake it up. As long as there are idle > threads, one will be woken up. > > If there are lot of nfsd threads (a sensible configuration when > the server is disk-bound or is running an HSM), there will be many > more nfsd threads than CPUs to run them. Under a high call-rate > low service-time workload, the result is that almost every nfsd is > runnable, but only a handful are actually able to run. This situation > causes two significant problems: > > 1. The CPU scheduler takes over 10% of each CPU, which is robbing > the nfsd threads of valuable CPU time. > > 2. At a high enough load, the nfsd threads starve userspace threads > of CPU time, to the point where daemons like portmap and rpc.mountd > do not schedule for tens of seconds at a time. Clients attempting > to mount an NFS filesystem timeout at the very first step (opening > a TCP connection to portmap) because portmap cannot wake up from > select() and call accept() in time. > > Disclaimer: these effects were observed on a SLES9 kernel, modern > kernels' schedulers may behave more gracefully. Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based? The scheduler's been through at least one complete rewrite since then, so the obvious question is whether it's wise to apply something that may turn out to have been very specific to an old version of the scheduler. It's a simple enough patch, but without any suggestion for how to retest on a more recent kernel, I'm uneasy. --b. > > The solution is simple: keep in each svc_pool a counter of the number > of threads which have been woken but have not yet run, and do not wake > any more if that count reaches an arbitrary small threshold. > > Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 > synthetic client threads simulating an rsync (i.e. recursive directory > listing) workload reading from an i386 RH9 install image (161480 > regular files in 10841 directories) on the server. That tree is small > enough to fill in the server's RAM so no disk traffic was involved. > This setup gives a sustained call rate in excess of 60000 calls/sec > before being CPU-bound on the server. The server was running 128 nfsds. > > Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() > taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. > Load average was over 120 before the patch, and 20.9 after. > > This patch is a forward-ported version of knfsd-avoid-nfsd-overload > which has been shipping in the SGI "Enhanced NFS" product since 2006. > It has been posted before: > > http://article.gmane.org/gmane.linux.nfs/10374 > > Signed-off-by: Greg Banks <gnb@sgi.com> > --- > > include/linux/sunrpc/svc.h | 2 ++ > net/sunrpc/svc_xprt.c | 25 ++++++++++++++++++------- > 2 files changed, 20 insertions(+), 7 deletions(-) > > Index: bfields/include/linux/sunrpc/svc.h > =================================================================== > --- bfields.orig/include/linux/sunrpc/svc.h > +++ bfields/include/linux/sunrpc/svc.h > @@ -41,6 +41,7 @@ struct svc_pool { > struct list_head sp_sockets; /* pending sockets */ > unsigned int sp_nrthreads; /* # of threads in pool */ > struct list_head sp_all_threads; /* all server threads */ > + int sp_nwaking; /* number of threads woken but not yet active */ > } ____cacheline_aligned_in_smp; > > /* > @@ -264,6 +265,7 @@ struct svc_rqst { > * cache pages */ > wait_queue_head_t rq_wait; /* synchronization */ > struct task_struct *rq_task; /* service thread */ > + int rq_waking; /* 1 if thread is being woken */ > }; > > /* > Index: bfields/net/sunrpc/svc_xprt.c > =================================================================== > --- bfields.orig/net/sunrpc/svc_xprt.c > +++ bfields/net/sunrpc/svc_xprt.c > @@ -14,6 +14,8 @@ > > #define RPCDBG_FACILITY RPCDBG_SVCXPRT > > +#define SVC_MAX_WAKING 5 > + > static struct svc_deferred_req *svc_deferred_dequeue(struct svc_xprt *xprt); > static int svc_deferred_recv(struct svc_rqst *rqstp); > static struct cache_deferred_req *svc_defer(struct cache_req *req); > @@ -298,6 +300,7 @@ void svc_xprt_enqueue(struct svc_xprt *x > struct svc_pool *pool; > struct svc_rqst *rqstp; > int cpu; > + int thread_avail; > > if (!(xprt->xpt_flags & > ((1<<XPT_CONN)|(1<<XPT_DATA)|(1<<XPT_CLOSE)|(1<<XPT_DEFERRED)))) > @@ -309,12 +312,6 @@ void svc_xprt_enqueue(struct svc_xprt *x > > spin_lock_bh(&pool->sp_lock); > > - if (!list_empty(&pool->sp_threads) && > - !list_empty(&pool->sp_sockets)) > - printk(KERN_ERR > - "svc_xprt_enqueue: " > - "threads and transports both waiting??\n"); > - > if (test_bit(XPT_DEAD, &xprt->xpt_flags)) { > /* Don't enqueue dead transports */ > dprintk("svc: transport %p is dead, not enqueued\n", xprt); > @@ -353,7 +350,14 @@ void svc_xprt_enqueue(struct svc_xprt *x > } > > process: > - if (!list_empty(&pool->sp_threads)) { > + /* Work out whether threads are available */ > + thread_avail = !list_empty(&pool->sp_threads); /* threads are asleep */ > + if (pool->sp_nwaking >= SVC_MAX_WAKING) { > + /* too many threads are runnable and trying to wake up */ > + thread_avail = 0; > + } > + > + if (thread_avail) { > rqstp = list_entry(pool->sp_threads.next, > struct svc_rqst, > rq_list); > @@ -368,6 +372,8 @@ void svc_xprt_enqueue(struct svc_xprt *x > svc_xprt_get(xprt); > rqstp->rq_reserved = serv->sv_max_mesg; > atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > + rqstp->rq_waking = 1; > + pool->sp_nwaking++; > BUG_ON(xprt->xpt_pool != pool); > wake_up(&rqstp->rq_wait); > } else { > @@ -633,6 +639,11 @@ int svc_recv(struct svc_rqst *rqstp, lon > return -EINTR; > > spin_lock_bh(&pool->sp_lock); > + if (rqstp->rq_waking) { > + rqstp->rq_waking = 0; > + pool->sp_nwaking--; > + BUG_ON(pool->sp_nwaking < 0); > + } > xprt = svc_xprt_dequeue(pool); > if (xprt) { > rqstp->rq_xprt = xprt; > > -- > -- > Greg Banks, P.Engineer, SGI Australian Software Group. > the brightly coloured sporks of revolution. > I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-02-11 23:10 ` J. Bruce Fields @ 2009-02-19 6:25 ` Greg Banks 2009-03-15 21:21 ` J. Bruce Fields 0 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-02-19 6:25 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML J. Bruce Fields wrote: > On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote: > >> [...] Under a high call-rate >> low service-time workload, the result is that almost every nfsd is >> runnable, but only a handful are actually able to run. This situation >> causes two significant problems: >> >> 1. The CPU scheduler takes over 10% of each CPU, which is robbing >> the nfsd threads of valuable CPU time. >> >> 2. At a high enough load, the nfsd threads starve userspace threads >> of CPU time, to the point where daemons like portmap and rpc.mountd >> do not schedule for tens of seconds at a time. Clients attempting >> to mount an NFS filesystem timeout at the very first step (opening >> a TCP connection to portmap) because portmap cannot wake up from >> select() and call accept() in time. >> >> Disclaimer: these effects were observed on a SLES9 kernel, modern >> kernels' schedulers may behave more gracefully. >> > > Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based? > > The scheduler's been through at least one complete rewrite since then, > so the obvious question is whether it's wise to apply something that may > turn out to have been very specific to an old version of the scheduler. > > It's a simple enough patch, but without any suggestion for how to retest > on a more recent kernel, I'm uneasy. > > Ok, fair enough. I retested using my local GIT tree, which is cloned from yours and was last git-pull'd a couple of days ago. The test load was the same as in my 2005 tests (multiple userspace threads each simulating an rsync directory traversal from a 2.4 client, i.e. almost entirely ACCESS calls with some READDIRs and GETATTRs, running as fast as the server will respond). This was run on much newer hardware (and a different architecture as well: a quad-core Xeon) so the results are not directly comparable with my 2005 tests. However the effect with and without the patch can be clearly seen, with otherwise identical hardware software and load (I added a sysctl to enable and disable the effect of the patch at runtime). A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better than the 2.6.5 one and NFS can still benefit from reducing load on it. Here's a table of measured call rates and steady-state 1-minute load averages, before and after the patch, versus number of client load threads. The server was configured with 128 nfsds in the thread pool which was under load. In all cases except the the single CPU in the thread pool was 100% busy (I've elided the 8-thread results where that wasn't the case). #threads before after call/sec loadavg call/sec loadavg -------- -------- ------- -------- ------- 16 57353 10.98 74965 6.11 24 57787 19.56 79397 13.58 32 57921 26.00 80746 21.35 40 57936 35.32 81629 31.73 48 57930 43.84 81775 42.64 56 57467 51.05 81411 52.39 64 57595 57.93 81543 64.61 As you can see, the patch improves NFS throughput for this load by up to 40%, which is a surprisingly large improvement. I suspect it's a larger improvement because my 2005 tests had multiple CPUs serving NFS traffic, and the improvements due to this patch were drowned in various SMP effects which are absent from this test. Also surprising is that the patch improves the reported load average number only at higher numbers of client threads; at low client thread counts the load average is unchanged or even slightly higher. The patch didn't have that effect back in 2005, so I'm confused by that behaviour. Perhaps the difference is due to changes in the scheduler or the accounting that measures load averages? Profiling at 16 client threads, 32 server threads shows differences in the CPU usage in the CPU scheduler itself, with some ACPI effects too. The platform I ran on in 2005 did not support ACPI, so that's new to me. Nevertheless it makes a difference. Here are the top samples from a couple of 30-second flat profiles. Before: samples % image name app name symbol name 3013 4.9327 processor.ko processor acpi_idle_enter_simple <--- 2583 4.2287 sunrpc.ko sunrpc svc_recv 1273 2.0841 e1000e.ko e1000e e1000_irq_enable 1235 2.0219 sunrpc.ko sunrpc svc_process 1070 1.7517 e1000e.ko e1000e e1000_intr_msi 966 1.5815 e1000e.ko e1000e e1000_xmit_frame 884 1.4472 sunrpc.ko sunrpc svc_xprt_enqueue 861 1.4096 e1000e.ko e1000e e1000_clean_rx_irq 774 1.2671 xfs.ko xfs xfs_iget 772 1.2639 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb schedule <--- 726 1.1886 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb sched_clock <--- 693 1.1345 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb read_hpet <--- 680 1.1133 sunrpc.ko sunrpc cache_check 671 1.0985 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage 641 1.0494 sunrpc.ko sunrpc sunrpc_cache_lookup Total % cpu from ACPI & scheduler: 8.5% After: samples % image name app name symbol name 5145 5.2163 sunrpc.ko sunrpc svc_recv 2908 2.9483 processor.ko processor acpi_idle_enter_simple <--- 2731 2.7688 sunrpc.ko sunrpc svc_process 2092 2.1210 e1000e.ko e1000e e1000_clean_rx_irq 1988 2.0155 e1000e.ko e1000e e1000_xmit_frame 1863 1.8888 e1000e.ko e1000e e1000_irq_enable 1606 1.6282 xfs.ko xfs xfs_iget 1514 1.5350 sunrpc.ko sunrpc cache_check 1389 1.4082 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_recvmsg 1383 1.4022 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage 1310 1.3281 sunrpc.ko sunrpc svc_xprt_enqueue 1177 1.1933 sunrpc.ko sunrpc sunrpc_cache_lookup 1142 1.1578 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb get_page_from_freelist 1135 1.1507 sunrpc.ko sunrpc svc_tcp_recvfrom 1126 1.1416 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_transmit_skb 1040 1.0544 e1000e.ko e1000e e1000_intr_msi 1033 1.0473 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_ack 1030 1.0443 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb kref_get 1000 1.0138 nfsd.ko nfsd fh_verify Total % cpu from ACPI & scheduler: 2.9% Does that make you less uneasy? -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-02-19 6:25 ` Greg Banks @ 2009-03-15 21:21 ` J. Bruce Fields 2009-03-16 3:10 ` Greg Banks 0 siblings, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2009-03-15 21:21 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote: > J. Bruce Fields wrote: > > On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote: > > > >> [...] Under a high call-rate > >> low service-time workload, the result is that almost every nfsd is > >> runnable, but only a handful are actually able to run. This situation > >> causes two significant problems: > >> > >> 1. The CPU scheduler takes over 10% of each CPU, which is robbing > >> the nfsd threads of valuable CPU time. > >> > >> 2. At a high enough load, the nfsd threads starve userspace threads > >> of CPU time, to the point where daemons like portmap and rpc.mountd > >> do not schedule for tens of seconds at a time. Clients attempting > >> to mount an NFS filesystem timeout at the very first step (opening > >> a TCP connection to portmap) because portmap cannot wake up from > >> select() and call accept() in time. > >> > >> Disclaimer: these effects were observed on a SLES9 kernel, modern > >> kernels' schedulers may behave more gracefully. > >> > > > > Yes, googling for "SLES9 kernel"... Was that really 2.6.5 based? > > > > The scheduler's been through at least one complete rewrite since then, > > so the obvious question is whether it's wise to apply something that may > > turn out to have been very specific to an old version of the scheduler. > > > > It's a simple enough patch, but without any suggestion for how to retest > > on a more recent kernel, I'm uneasy. > > > > > > Ok, fair enough. I retested using my local GIT tree, which is cloned > from yours and was last git-pull'd a couple of days ago. The test load > was the same as in my 2005 tests (multiple userspace threads each > simulating an rsync directory traversal from a 2.4 client, i.e. almost > entirely ACCESS calls with some READDIRs and GETATTRs, running as fast > as the server will respond). This was run on much newer hardware (and a > different architecture as well: a quad-core Xeon) so the results are not > directly comparable with my 2005 tests. However the effect with and > without the patch can be clearly seen, with otherwise identical hardware > software and load (I added a sysctl to enable and disable the effect of > the patch at runtime). > > A quick summary: the 2.6.29-rc4 CPU scheduler is not magically better > than the 2.6.5 one and NFS can still benefit from reducing load on it. > > Here's a table of measured call rates and steady-state 1-minute load > averages, before and after the patch, versus number of client load > threads. The server was configured with 128 nfsds in the thread pool > which was under load. In all cases except the the single CPU in the > thread pool was 100% busy (I've elided the 8-thread results where that > wasn't the case). > > #threads before after > call/sec loadavg call/sec loadavg > -------- -------- ------- -------- ------- > 16 57353 10.98 74965 6.11 > 24 57787 19.56 79397 13.58 > 32 57921 26.00 80746 21.35 > 40 57936 35.32 81629 31.73 > 48 57930 43.84 81775 42.64 > 56 57467 51.05 81411 52.39 > 64 57595 57.93 81543 64.61 > > > As you can see, the patch improves NFS throughput for this load by up > to 40%, which is a surprisingly large improvement. I suspect it's a > larger improvement because my 2005 tests had multiple CPUs serving NFS > traffic, and the improvements due to this patch were drowned in various > SMP effects which are absent from this test. > > Also surprising is that the patch improves the reported load average > number only at higher numbers of client threads; at low client thread > counts the load average is unchanged or even slightly higher. The patch > didn't have that effect back in 2005, so I'm confused by that > behaviour. Perhaps the difference is due to changes in the scheduler or > the accounting that measures load averages? > > Profiling at 16 client threads, 32 server threads shows differences in > the CPU usage in the CPU scheduler itself, with some ACPI effects too. > The platform I ran on in 2005 did not support ACPI, so that's new to > me. Nevertheless it makes a difference. Here are the top samples from > a couple of 30-second flat profiles. > > Before: > > samples % image name app name symbol name > 3013 4.9327 processor.ko processor acpi_idle_enter_simple <--- > 2583 4.2287 sunrpc.ko sunrpc svc_recv > 1273 2.0841 e1000e.ko e1000e e1000_irq_enable > 1235 2.0219 sunrpc.ko sunrpc svc_process > 1070 1.7517 e1000e.ko e1000e e1000_intr_msi > 966 1.5815 e1000e.ko e1000e e1000_xmit_frame > 884 1.4472 sunrpc.ko sunrpc svc_xprt_enqueue > 861 1.4096 e1000e.ko e1000e e1000_clean_rx_irq > 774 1.2671 xfs.ko xfs xfs_iget > 772 1.2639 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb schedule <--- > 726 1.1886 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb sched_clock <--- > 693 1.1345 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb read_hpet <--- > 680 1.1133 sunrpc.ko sunrpc cache_check > 671 1.0985 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage > 641 1.0494 sunrpc.ko sunrpc sunrpc_cache_lookup > > Total % cpu from ACPI & scheduler: 8.5% > > After: > > samples % image name app name symbol name > 5145 5.2163 sunrpc.ko sunrpc svc_recv > 2908 2.9483 processor.ko processor acpi_idle_enter_simple <--- > 2731 2.7688 sunrpc.ko sunrpc svc_process > 2092 2.1210 e1000e.ko e1000e e1000_clean_rx_irq > 1988 2.0155 e1000e.ko e1000e e1000_xmit_frame > 1863 1.8888 e1000e.ko e1000e e1000_irq_enable > 1606 1.6282 xfs.ko xfs xfs_iget > 1514 1.5350 sunrpc.ko sunrpc cache_check > 1389 1.4082 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_recvmsg > 1383 1.4022 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_sendpage > 1310 1.3281 sunrpc.ko sunrpc svc_xprt_enqueue > 1177 1.1933 sunrpc.ko sunrpc sunrpc_cache_lookup > 1142 1.1578 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb get_page_from_freelist > 1135 1.1507 sunrpc.ko sunrpc svc_tcp_recvfrom > 1126 1.1416 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_transmit_skb > 1040 1.0544 e1000e.ko e1000e e1000_intr_msi > 1033 1.0473 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb tcp_ack > 1030 1.0443 vmlinux-2.6.29-rc4-gnb vmlinux-2.6.29-rc4-gnb kref_get > 1000 1.0138 nfsd.ko nfsd fh_verify > > Total % cpu from ACPI & scheduler: 2.9% > > > Does that make you less uneasy? Yes, thanks! Queued up for 2.6.30, barring objections. But perhaps we should pass on the patch and your results to people who know the scheduler better and see if they can explain e.g. the loadavg numbers. --b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages 2009-03-15 21:21 ` J. Bruce Fields @ 2009-03-16 3:10 ` Greg Banks 0 siblings, 0 replies; 25+ messages in thread From: Greg Banks @ 2009-03-16 3:10 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML J. Bruce Fields wrote: > On Thu, Feb 19, 2009 at 05:25:47PM +1100, Greg Banks wrote: > >> J. Bruce Fields wrote: >> >>> On Tue, Jan 13, 2009 at 09:26:35PM +1100, Greg Banks wrote: >>> >>> [...] >>> It's a simple enough patch, but without any suggestion for how to retest >>> on a more recent kernel, I'm uneasy. >>> >> [...] >> >> Does that make you less uneasy? >> > > Yes, thanks! > > Queued up for 2.6.30, barring objections. Thanks. > But perhaps we should pass on > the patch and your results to people who know the scheduler better and > see if they can explain e.g. the loadavg numbers. > If you like. Personally I'm happy with assuming that it's because nfsd is putting an unnaturally harsh load on the scheduler. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks 2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks 2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks @ 2009-01-13 10:26 ` Greg Banks 2009-02-12 17:11 ` J. Bruce Fields 2009-02-09 5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks 3 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-01-13 10:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML, Harshula Jayasuriya Add /proc/fs/nfsd/pool_stats to export to userspace various statistics about the operation of rpc server thread pools. This patch is based on a forward-ported version of knfsd-add-pool-thread-stats which has been shipping in the SGI "Enhanced NFS" product since 2006 and which was previously posted: http://article.gmane.org/gmane.linux.nfs/10375 It has also been updated thus: * moved EXPORT_SYMBOL() to near the function it exports * made the new struct struct seq_operations const * used SEQ_START_TOKEN instead of ((void *)1) * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of printk in svc_pool_stats_*()" by Harshula Jayasuriya. * merged fix from SGI PV 964001 "Crash reading pool_stats before nfsds are started". Signed-off-by: Greg Banks <gnb@sgi.com> Signed-off-by: Harshula Jayasuriya <harshula@sgi.com> --- fs/nfsd/nfsctl.c | 12 ++++ fs/nfsd/nfssvc.c | 7 ++ include/linux/sunrpc/svc.h | 11 +++ net/sunrpc/svc_xprt.c | 100 +++++++++++++++++++++++++++++++++- 4 files changed, 129 insertions(+), 1 deletion(-) Index: bfields/fs/nfsd/nfsctl.c =================================================================== --- bfields.orig/fs/nfsd/nfsctl.c +++ bfields/fs/nfsd/nfsctl.c @@ -60,6 +60,7 @@ enum { NFSD_FO_UnlockFS, NFSD_Threads, NFSD_Pool_Threads, + NFSD_Pool_Stats, NFSD_Versions, NFSD_Ports, NFSD_MaxBlkSize, @@ -172,6 +173,16 @@ static const struct file_operations expo .owner = THIS_MODULE, }; +extern int nfsd_pool_stats_open(struct inode *inode, struct file *file); + +static struct file_operations pool_stats_operations = { + .open = nfsd_pool_stats_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .owner = THIS_MODULE, +}; + /*----------------------------------------------------------------------------*/ /* * payload - write methods @@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_ [NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR}, + [NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO}, [NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR}, [NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO}, [NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO}, Index: bfields/fs/nfsd/nfssvc.c =================================================================== --- bfields.orig/fs/nfsd/nfssvc.c +++ bfields/fs/nfsd/nfssvc.c @@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __ nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1); return 1; } + +int nfsd_pool_stats_open(struct inode *inode, struct file *file) +{ + if (nfsd_serv == NULL) + return -ENODEV; + return svc_pool_stats_open(nfsd_serv, file); +} Index: bfields/include/linux/sunrpc/svc.h =================================================================== --- bfields.orig/include/linux/sunrpc/svc.h +++ bfields/include/linux/sunrpc/svc.h @@ -24,6 +24,15 @@ */ typedef int (*svc_thread_fn)(void *); +/* statistics for svc_pool structures */ +struct svc_pool_stats { + unsigned long packets; + unsigned long sockets_queued; + unsigned long threads_woken; + unsigned long overloads_avoided; + unsigned long threads_timedout; +}; + /* * * RPC service thread pool. @@ -42,6 +51,7 @@ struct svc_pool { unsigned int sp_nrthreads; /* # of threads in pool */ struct list_head sp_all_threads; /* all server threads */ int sp_nwaking; /* number of threads woken but not yet active */ + struct svc_pool_stats sp_stats; /* statistics on pool operation */ } ____cacheline_aligned_in_smp; /* @@ -396,6 +406,7 @@ struct svc_serv * svc_create_pooled(str sa_family_t, void (*shutdown)(struct svc_serv *), svc_thread_fn, struct module *); int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int); +int svc_pool_stats_open(struct svc_serv *serv, struct file *file); void svc_destroy(struct svc_serv *); int svc_process(struct svc_rqst *); int svc_register(const struct svc_serv *, const unsigned short, Index: bfields/net/sunrpc/svc_xprt.c =================================================================== --- bfields.orig/net/sunrpc/svc_xprt.c +++ bfields/net/sunrpc/svc_xprt.c @@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x goto out_unlock; } + pool->sp_stats.packets++; + /* Mark transport as busy. It will remain in this state until * the provider calls svc_xprt_received. We update XPT_BUSY * atomically because it also guards against trying to enqueue @@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x if (pool->sp_nwaking >= SVC_MAX_WAKING) { /* too many threads are runnable and trying to wake up */ thread_avail = 0; + pool->sp_stats.overloads_avoided++; } if (thread_avail) { @@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); rqstp->rq_waking = 1; pool->sp_nwaking++; + pool->sp_stats.threads_woken++; BUG_ON(xprt->xpt_pool != pool); wake_up(&rqstp->rq_wait); } else { dprintk("svc: transport %p put into queue\n", xprt); list_add_tail(&xprt->xpt_ready, &pool->sp_sockets); + pool->sp_stats.sockets_queued++; BUG_ON(xprt->xpt_pool != pool); } @@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon int pages; struct xdr_buf *arg; DECLARE_WAITQUEUE(wait, current); + long time_left; dprintk("svc: server %p waiting for data (to = %ld)\n", rqstp, timeout); @@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon add_wait_queue(&rqstp->rq_wait, &wait); spin_unlock_bh(&pool->sp_lock); - schedule_timeout(timeout); + time_left = schedule_timeout(timeout); try_to_freeze(); spin_lock_bh(&pool->sp_lock); remove_wait_queue(&rqstp->rq_wait, &wait); + if (!time_left) + pool->sp_stats.threads_timedout++; xprt = rqstp->rq_xprt; if (!xprt) { @@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv return totlen; } EXPORT_SYMBOL_GPL(svc_xprt_names); + + +/*----------------------------------------------------------------------------*/ + +static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos) +{ + unsigned int pidx = (unsigned int)*pos; + struct svc_serv *serv = m->private; + + dprintk("svc_pool_stats_start, *pidx=%u\n", pidx); + + lock_kernel(); + /* bump up the pseudo refcount while traversing */ + svc_get(serv); + unlock_kernel(); + + if (!pidx) + return SEQ_START_TOKEN; + return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]); +} + +static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos) +{ + struct svc_pool *pool = p; + struct svc_serv *serv = m->private; + + dprintk("svc_pool_stats_next, *pos=%llu\n", *pos); + + if (p == SEQ_START_TOKEN) { + pool = &serv->sv_pools[0]; + } else { + unsigned int pidx = (pool - &serv->sv_pools[0]); + if (pidx < serv->sv_nrpools-1) + pool = &serv->sv_pools[pidx+1]; + else + pool = NULL; + } + ++*pos; + return pool; +} + +static void svc_pool_stats_stop(struct seq_file *m, void *p) +{ + struct svc_serv *serv = m->private; + + lock_kernel(); + /* this function really, really should have been called svc_put() */ + svc_destroy(serv); + unlock_kernel(); +} + +static int svc_pool_stats_show(struct seq_file *m, void *p) +{ + struct svc_pool *pool = p; + + if (p == SEQ_START_TOKEN) { + seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n"); + return 0; + } + + seq_printf(m, "%u %lu %lu %lu %lu %lu\n", + pool->sp_id, + pool->sp_stats.packets, + pool->sp_stats.sockets_queued, + pool->sp_stats.threads_woken, + pool->sp_stats.overloads_avoided, + pool->sp_stats.threads_timedout); + + return 0; +} + +static const struct seq_operations svc_pool_stats_seq_ops = { + .start = svc_pool_stats_start, + .next = svc_pool_stats_next, + .stop = svc_pool_stats_stop, + .show = svc_pool_stats_show, +}; + +int svc_pool_stats_open(struct svc_serv *serv, struct file *file) +{ + int err; + + err = seq_open(file, &svc_pool_stats_seq_ops); + if (!err) + ((struct seq_file *) file->private_data)->private = serv; + return err; +} +EXPORT_SYMBOL(svc_pool_stats_open); + +/*----------------------------------------------------------------------------*/ -- -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks @ 2009-02-12 17:11 ` J. Bruce Fields 2009-02-13 1:53 ` Kevin Constantine 2009-02-19 6:42 ` Greg Banks 0 siblings, 2 replies; 25+ messages in thread From: J. Bruce Fields @ 2009-02-12 17:11 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML, Harshula Jayasuriya On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: > Add /proc/fs/nfsd/pool_stats to export to userspace various > statistics about the operation of rpc server thread pools. Could you explainw hy these specific statistics (total packets, sockets_queued, threads_woken, overloads_avoided, threads_timedout) are the important ones to capture? Could you give examples of what sort of problems could be solved using them? As you said, an important question for the sysadmin is "should I configure more nfsds?" How do they answer that? --b. > This patch is based on a forward-ported version of > knfsd-add-pool-thread-stats which has been shipping in the SGI > "Enhanced NFS" product since 2006 and which was previously > posted: > > http://article.gmane.org/gmane.linux.nfs/10375 > > It has also been updated thus: > > * moved EXPORT_SYMBOL() to near the function it exports > * made the new struct struct seq_operations const > * used SEQ_START_TOKEN instead of ((void *)1) > * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of > printk in svc_pool_stats_*()" by Harshula Jayasuriya. > * merged fix from SGI PV 964001 "Crash reading pool_stats before > nfsds are started". > > Signed-off-by: Greg Banks <gnb@sgi.com> > Signed-off-by: Harshula Jayasuriya <harshula@sgi.com> > --- > > fs/nfsd/nfsctl.c | 12 ++++ > fs/nfsd/nfssvc.c | 7 ++ > include/linux/sunrpc/svc.h | 11 +++ > net/sunrpc/svc_xprt.c | 100 +++++++++++++++++++++++++++++++++- > 4 files changed, 129 insertions(+), 1 deletion(-) > > Index: bfields/fs/nfsd/nfsctl.c > =================================================================== > --- bfields.orig/fs/nfsd/nfsctl.c > +++ bfields/fs/nfsd/nfsctl.c > @@ -60,6 +60,7 @@ enum { > NFSD_FO_UnlockFS, > NFSD_Threads, > NFSD_Pool_Threads, > + NFSD_Pool_Stats, > NFSD_Versions, > NFSD_Ports, > NFSD_MaxBlkSize, > @@ -172,6 +173,16 @@ static const struct file_operations expo > .owner = THIS_MODULE, > }; > > +extern int nfsd_pool_stats_open(struct inode *inode, struct file *file); > + > +static struct file_operations pool_stats_operations = { > + .open = nfsd_pool_stats_open, > + .read = seq_read, > + .llseek = seq_lseek, > + .release = seq_release, > + .owner = THIS_MODULE, > +}; > + > /*----------------------------------------------------------------------------*/ > /* > * payload - write methods > @@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_ > [NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR}, > [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR}, > [NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR}, > + [NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO}, > [NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR}, > [NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO}, > [NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO}, > Index: bfields/fs/nfsd/nfssvc.c > =================================================================== > --- bfields.orig/fs/nfsd/nfssvc.c > +++ bfields/fs/nfsd/nfssvc.c > @@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __ > nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1); > return 1; > } > + > +int nfsd_pool_stats_open(struct inode *inode, struct file *file) > +{ > + if (nfsd_serv == NULL) > + return -ENODEV; > + return svc_pool_stats_open(nfsd_serv, file); > +} > Index: bfields/include/linux/sunrpc/svc.h > =================================================================== > --- bfields.orig/include/linux/sunrpc/svc.h > +++ bfields/include/linux/sunrpc/svc.h > @@ -24,6 +24,15 @@ > */ > typedef int (*svc_thread_fn)(void *); > > +/* statistics for svc_pool structures */ > +struct svc_pool_stats { > + unsigned long packets; > + unsigned long sockets_queued; > + unsigned long threads_woken; > + unsigned long overloads_avoided; > + unsigned long threads_timedout; > +}; > + > /* > * > * RPC service thread pool. > @@ -42,6 +51,7 @@ struct svc_pool { > unsigned int sp_nrthreads; /* # of threads in pool */ > struct list_head sp_all_threads; /* all server threads */ > int sp_nwaking; /* number of threads woken but not yet active */ > + struct svc_pool_stats sp_stats; /* statistics on pool operation */ > } ____cacheline_aligned_in_smp; > > /* > @@ -396,6 +406,7 @@ struct svc_serv * svc_create_pooled(str > sa_family_t, void (*shutdown)(struct svc_serv *), > svc_thread_fn, struct module *); > int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int); > +int svc_pool_stats_open(struct svc_serv *serv, struct file *file); > void svc_destroy(struct svc_serv *); > int svc_process(struct svc_rqst *); > int svc_register(const struct svc_serv *, const unsigned short, > Index: bfields/net/sunrpc/svc_xprt.c > =================================================================== > --- bfields.orig/net/sunrpc/svc_xprt.c > +++ bfields/net/sunrpc/svc_xprt.c > @@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x > goto out_unlock; > } > > + pool->sp_stats.packets++; > + > /* Mark transport as busy. It will remain in this state until > * the provider calls svc_xprt_received. We update XPT_BUSY > * atomically because it also guards against trying to enqueue > @@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x > if (pool->sp_nwaking >= SVC_MAX_WAKING) { > /* too many threads are runnable and trying to wake up */ > thread_avail = 0; > + pool->sp_stats.overloads_avoided++; > } > > if (thread_avail) { > @@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x > atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); > rqstp->rq_waking = 1; > pool->sp_nwaking++; > + pool->sp_stats.threads_woken++; > BUG_ON(xprt->xpt_pool != pool); > wake_up(&rqstp->rq_wait); > } else { > dprintk("svc: transport %p put into queue\n", xprt); > list_add_tail(&xprt->xpt_ready, &pool->sp_sockets); > + pool->sp_stats.sockets_queued++; > BUG_ON(xprt->xpt_pool != pool); > } > > @@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon > int pages; > struct xdr_buf *arg; > DECLARE_WAITQUEUE(wait, current); > + long time_left; > > dprintk("svc: server %p waiting for data (to = %ld)\n", > rqstp, timeout); > @@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon > add_wait_queue(&rqstp->rq_wait, &wait); > spin_unlock_bh(&pool->sp_lock); > > - schedule_timeout(timeout); > + time_left = schedule_timeout(timeout); > > try_to_freeze(); > > spin_lock_bh(&pool->sp_lock); > remove_wait_queue(&rqstp->rq_wait, &wait); > + if (!time_left) > + pool->sp_stats.threads_timedout++; > > xprt = rqstp->rq_xprt; > if (!xprt) { > @@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv > return totlen; > } > EXPORT_SYMBOL_GPL(svc_xprt_names); > + > + > +/*----------------------------------------------------------------------------*/ > + > +static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos) > +{ > + unsigned int pidx = (unsigned int)*pos; > + struct svc_serv *serv = m->private; > + > + dprintk("svc_pool_stats_start, *pidx=%u\n", pidx); > + > + lock_kernel(); > + /* bump up the pseudo refcount while traversing */ > + svc_get(serv); > + unlock_kernel(); > + > + if (!pidx) > + return SEQ_START_TOKEN; > + return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]); > +} > + > +static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos) > +{ > + struct svc_pool *pool = p; > + struct svc_serv *serv = m->private; > + > + dprintk("svc_pool_stats_next, *pos=%llu\n", *pos); > + > + if (p == SEQ_START_TOKEN) { > + pool = &serv->sv_pools[0]; > + } else { > + unsigned int pidx = (pool - &serv->sv_pools[0]); > + if (pidx < serv->sv_nrpools-1) > + pool = &serv->sv_pools[pidx+1]; > + else > + pool = NULL; > + } > + ++*pos; > + return pool; > +} > + > +static void svc_pool_stats_stop(struct seq_file *m, void *p) > +{ > + struct svc_serv *serv = m->private; > + > + lock_kernel(); > + /* this function really, really should have been called svc_put() */ > + svc_destroy(serv); > + unlock_kernel(); > +} > + > +static int svc_pool_stats_show(struct seq_file *m, void *p) > +{ > + struct svc_pool *pool = p; > + > + if (p == SEQ_START_TOKEN) { > + seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n"); > + return 0; > + } > + > + seq_printf(m, "%u %lu %lu %lu %lu %lu\n", > + pool->sp_id, > + pool->sp_stats.packets, > + pool->sp_stats.sockets_queued, > + pool->sp_stats.threads_woken, > + pool->sp_stats.overloads_avoided, > + pool->sp_stats.threads_timedout); > + > + return 0; > +} > + > +static const struct seq_operations svc_pool_stats_seq_ops = { > + .start = svc_pool_stats_start, > + .next = svc_pool_stats_next, > + .stop = svc_pool_stats_stop, > + .show = svc_pool_stats_show, > +}; > + > +int svc_pool_stats_open(struct svc_serv *serv, struct file *file) > +{ > + int err; > + > + err = seq_open(file, &svc_pool_stats_seq_ops); > + if (!err) > + ((struct seq_file *) file->private_data)->private = serv; > + return err; > +} > +EXPORT_SYMBOL(svc_pool_stats_open); > + > +/*----------------------------------------------------------------------------*/ > > -- > -- > Greg Banks, P.Engineer, SGI Australian Software Group. > the brightly coloured sporks of revolution. > I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-02-12 17:11 ` J. Bruce Fields @ 2009-02-13 1:53 ` Kevin Constantine 2009-02-19 7:04 ` Greg Banks 2009-02-19 6:42 ` Greg Banks 1 sibling, 1 reply; 25+ messages in thread From: Kevin Constantine @ 2009-02-13 1:53 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Greg Banks, Linux NFS ML, Harshula Jayasuriya On 02/12/09 09:11, J. Bruce Fields wrote: > On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: >> Add /proc/fs/nfsd/pool_stats to export to userspace various >> statistics about the operation of rpc server thread pools. > > Could you explainw hy these specific statistics (total packets, > sockets_queued, threads_woken, overloads_avoided, threads_timedout) are > the important ones to capture? Could you give examples of what sort of > problems could be solved using them? > > As you said, an important question for the sysadmin is "should I > configure more nfsds?" How do they answer that? > I typically use the "th" line to determine whether to add more threads or not by looking at the distribution of values across the histogram. If things are weighted more towards the 90-100% group, I'll add more threads and watch the traffic patterns. Usually, the question of how many to add is answered by trial and error. echo 32 > /proc/fs/nfsd/threads Did that improve my throughput? yes? echo 128 > /proc/fs/nfsd/threads Did that improve my throughput? no it actually decreased. rinse... repeat... > --b. > >> This patch is based on a forward-ported version of >> knfsd-add-pool-thread-stats which has been shipping in the SGI >> "Enhanced NFS" product since 2006 and which was previously >> posted: > >> http://article.gmane.org/gmane.linux.nfs/10375 >> >> It has also been updated thus: >> >> * moved EXPORT_SYMBOL() to near the function it exports >> * made the new struct struct seq_operations const >> * used SEQ_START_TOKEN instead of ((void *)1) >> * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of >> printk in svc_pool_stats_*()" by Harshula Jayasuriya. >> * merged fix from SGI PV 964001 "Crash reading pool_stats before >> nfsds are started". >> >> Signed-off-by: Greg Banks <gnb@sgi.com> >> Signed-off-by: Harshula Jayasuriya <harshula@sgi.com> >> --- >> >> fs/nfsd/nfsctl.c | 12 ++++ >> fs/nfsd/nfssvc.c | 7 ++ >> include/linux/sunrpc/svc.h | 11 +++ >> net/sunrpc/svc_xprt.c | 100 +++++++++++++++++++++++++++++++++- >> 4 files changed, 129 insertions(+), 1 deletion(-) >> >> Index: bfields/fs/nfsd/nfsctl.c >> =================================================================== >> --- bfields.orig/fs/nfsd/nfsctl.c >> +++ bfields/fs/nfsd/nfsctl.c >> @@ -60,6 +60,7 @@ enum { >> NFSD_FO_UnlockFS, >> NFSD_Threads, >> NFSD_Pool_Threads, >> + NFSD_Pool_Stats, >> NFSD_Versions, >> NFSD_Ports, >> NFSD_MaxBlkSize, >> @@ -172,6 +173,16 @@ static const struct file_operations expo >> .owner = THIS_MODULE, >> }; >> >> +extern int nfsd_pool_stats_open(struct inode *inode, struct file *file); >> + >> +static struct file_operations pool_stats_operations = { >> + .open = nfsd_pool_stats_open, >> + .read = seq_read, >> + .llseek = seq_lseek, >> + .release = seq_release, >> + .owner = THIS_MODULE, >> +}; >> + >> /*----------------------------------------------------------------------------*/ >> /* >> * payload - write methods >> @@ -1246,6 +1257,7 @@ static int nfsd_fill_super(struct super_ >> [NFSD_Fh] = {"filehandle", &transaction_ops, S_IWUSR|S_IRUSR}, >> [NFSD_Threads] = {"threads", &transaction_ops, S_IWUSR|S_IRUSR}, >> [NFSD_Pool_Threads] = {"pool_threads", &transaction_ops, S_IWUSR|S_IRUSR}, >> + [NFSD_Pool_Stats] = {"pool_stats", &pool_stats_operations, S_IRUGO}, >> [NFSD_Versions] = {"versions", &transaction_ops, S_IWUSR|S_IRUSR}, >> [NFSD_Ports] = {"portlist", &transaction_ops, S_IWUSR|S_IRUGO}, >> [NFSD_MaxBlkSize] = {"max_block_size", &transaction_ops, S_IWUSR|S_IRUGO}, >> Index: bfields/fs/nfsd/nfssvc.c >> =================================================================== >> --- bfields.orig/fs/nfsd/nfssvc.c >> +++ bfields/fs/nfsd/nfssvc.c >> @@ -546,3 +546,10 @@ nfsd_dispatch(struct svc_rqst *rqstp, __ >> nfsd_cache_update(rqstp, proc->pc_cachetype, statp + 1); >> return 1; >> } >> + >> +int nfsd_pool_stats_open(struct inode *inode, struct file *file) >> +{ >> + if (nfsd_serv == NULL) >> + return -ENODEV; >> + return svc_pool_stats_open(nfsd_serv, file); >> +} >> Index: bfields/include/linux/sunrpc/svc.h >> =================================================================== >> --- bfields.orig/include/linux/sunrpc/svc.h >> +++ bfields/include/linux/sunrpc/svc.h >> @@ -24,6 +24,15 @@ >> */ >> typedef int (*svc_thread_fn)(void *); >> >> +/* statistics for svc_pool structures */ >> +struct svc_pool_stats { >> + unsigned long packets; >> + unsigned long sockets_queued; >> + unsigned long threads_woken; >> + unsigned long overloads_avoided; >> + unsigned long threads_timedout; >> +}; >> + >> /* >> * >> * RPC service thread pool. >> @@ -42,6 +51,7 @@ struct svc_pool { >> unsigned int sp_nrthreads; /* # of threads in pool */ >> struct list_head sp_all_threads; /* all server threads */ >> int sp_nwaking; /* number of threads woken but not yet active */ >> + struct svc_pool_stats sp_stats; /* statistics on pool operation */ >> } ____cacheline_aligned_in_smp; >> >> /* >> @@ -396,6 +406,7 @@ struct svc_serv * svc_create_pooled(str >> sa_family_t, void (*shutdown)(struct svc_serv *), >> svc_thread_fn, struct module *); >> int svc_set_num_threads(struct svc_serv *, struct svc_pool *, int); >> +int svc_pool_stats_open(struct svc_serv *serv, struct file *file); >> void svc_destroy(struct svc_serv *); >> int svc_process(struct svc_rqst *); >> int svc_register(const struct svc_serv *, const unsigned short, >> Index: bfields/net/sunrpc/svc_xprt.c >> =================================================================== >> --- bfields.orig/net/sunrpc/svc_xprt.c >> +++ bfields/net/sunrpc/svc_xprt.c >> @@ -318,6 +318,8 @@ void svc_xprt_enqueue(struct svc_xprt *x >> goto out_unlock; >> } >> >> + pool->sp_stats.packets++; >> + >> /* Mark transport as busy. It will remain in this state until >> * the provider calls svc_xprt_received. We update XPT_BUSY >> * atomically because it also guards against trying to enqueue >> @@ -355,6 +357,7 @@ void svc_xprt_enqueue(struct svc_xprt *x >> if (pool->sp_nwaking >= SVC_MAX_WAKING) { >> /* too many threads are runnable and trying to wake up */ >> thread_avail = 0; >> + pool->sp_stats.overloads_avoided++; >> } >> >> if (thread_avail) { >> @@ -374,11 +377,13 @@ void svc_xprt_enqueue(struct svc_xprt *x >> atomic_add(rqstp->rq_reserved, &xprt->xpt_reserved); >> rqstp->rq_waking = 1; >> pool->sp_nwaking++; >> + pool->sp_stats.threads_woken++; >> BUG_ON(xprt->xpt_pool != pool); >> wake_up(&rqstp->rq_wait); >> } else { >> dprintk("svc: transport %p put into queue\n", xprt); >> list_add_tail(&xprt->xpt_ready, &pool->sp_sockets); >> + pool->sp_stats.sockets_queued++; >> BUG_ON(xprt->xpt_pool != pool); >> } >> >> @@ -591,6 +596,7 @@ int svc_recv(struct svc_rqst *rqstp, lon >> int pages; >> struct xdr_buf *arg; >> DECLARE_WAITQUEUE(wait, current); >> + long time_left; >> >> dprintk("svc: server %p waiting for data (to = %ld)\n", >> rqstp, timeout); >> @@ -676,12 +682,14 @@ int svc_recv(struct svc_rqst *rqstp, lon >> add_wait_queue(&rqstp->rq_wait, &wait); >> spin_unlock_bh(&pool->sp_lock); >> >> - schedule_timeout(timeout); >> + time_left = schedule_timeout(timeout); >> >> try_to_freeze(); >> >> spin_lock_bh(&pool->sp_lock); >> remove_wait_queue(&rqstp->rq_wait, &wait); >> + if (!time_left) >> + pool->sp_stats.threads_timedout++; >> >> xprt = rqstp->rq_xprt; >> if (!xprt) { >> @@ -1103,3 +1111,93 @@ int svc_xprt_names(struct svc_serv *serv >> return totlen; >> } >> EXPORT_SYMBOL_GPL(svc_xprt_names); >> + >> + >> +/*----------------------------------------------------------------------------*/ >> + >> +static void *svc_pool_stats_start(struct seq_file *m, loff_t *pos) >> +{ >> + unsigned int pidx = (unsigned int)*pos; >> + struct svc_serv *serv = m->private; >> + >> + dprintk("svc_pool_stats_start, *pidx=%u\n", pidx); >> + >> + lock_kernel(); >> + /* bump up the pseudo refcount while traversing */ >> + svc_get(serv); >> + unlock_kernel(); >> + >> + if (!pidx) >> + return SEQ_START_TOKEN; >> + return (pidx > serv->sv_nrpools ? NULL : &serv->sv_pools[pidx-1]); >> +} >> + >> +static void *svc_pool_stats_next(struct seq_file *m, void *p, loff_t *pos) >> +{ >> + struct svc_pool *pool = p; >> + struct svc_serv *serv = m->private; >> + >> + dprintk("svc_pool_stats_next, *pos=%llu\n", *pos); >> + >> + if (p == SEQ_START_TOKEN) { >> + pool = &serv->sv_pools[0]; >> + } else { >> + unsigned int pidx = (pool - &serv->sv_pools[0]); >> + if (pidx < serv->sv_nrpools-1) >> + pool = &serv->sv_pools[pidx+1]; >> + else >> + pool = NULL; >> + } >> + ++*pos; >> + return pool; >> +} >> + >> +static void svc_pool_stats_stop(struct seq_file *m, void *p) >> +{ >> + struct svc_serv *serv = m->private; >> + >> + lock_kernel(); >> + /* this function really, really should have been called svc_put() */ >> + svc_destroy(serv); >> + unlock_kernel(); >> +} >> + >> +static int svc_pool_stats_show(struct seq_file *m, void *p) >> +{ >> + struct svc_pool *pool = p; >> + >> + if (p == SEQ_START_TOKEN) { >> + seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken overloads-avoided threads-timedout\n"); >> + return 0; >> + } >> + >> + seq_printf(m, "%u %lu %lu %lu %lu %lu\n", >> + pool->sp_id, >> + pool->sp_stats.packets, >> + pool->sp_stats.sockets_queued, >> + pool->sp_stats.threads_woken, >> + pool->sp_stats.overloads_avoided, >> + pool->sp_stats.threads_timedout); >> + >> + return 0; >> +} >> + >> +static const struct seq_operations svc_pool_stats_seq_ops = { >> + .start = svc_pool_stats_start, >> + .next = svc_pool_stats_next, >> + .stop = svc_pool_stats_stop, >> + .show = svc_pool_stats_show, >> +}; >> + >> +int svc_pool_stats_open(struct svc_serv *serv, struct file *file) >> +{ >> + int err; >> + >> + err = seq_open(file, &svc_pool_stats_seq_ops); >> + if (!err) >> + ((struct seq_file *) file->private_data)->private = serv; >> + return err; >> +} >> +EXPORT_SYMBOL(svc_pool_stats_open); >> + >> +/*----------------------------------------------------------------------------*/ >> >> -- >> -- >> Greg Banks, P.Engineer, SGI Australian Software Group. >> the brightly coloured sporks of revolution. >> I don't speak for SGI. > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-02-13 1:53 ` Kevin Constantine @ 2009-02-19 7:04 ` Greg Banks 0 siblings, 0 replies; 25+ messages in thread From: Greg Banks @ 2009-02-19 7:04 UTC (permalink / raw) To: Kevin Constantine; +Cc: J. Bruce Fields, Linux NFS ML, Harshula Jayasuriya Kevin Constantine wrote: > On 02/12/09 09:11, J. Bruce Fields wrote: >> >> >> As you said, an important question for the sysadmin is "should I >> configure more nfsds?" How do they answer that? >> > > I typically use the "th" line to determine whether to add more threads > or not by looking at the distribution of values across the histogram. > If things are weighted more towards the 90-100% group, I'll add more > threads and watch the traffic patterns. That seems to have been more or less the idea behind the histogram. However, to get any useful indication you first need to rate-convert the histogram. The values are counters of jiffies which wrap every million seconds (11.6 days) and there's no way to zero them. Also, at a call rate above 1/jiffy (a trivially low NFS load) most of the updates are adding 0 to the histogram. Every now and again a call will cross a jiffy boundary and actually add 1 to the histogram. If your load is high but not steady, the number of threads busy when that happens is unpredicable. In other words there's a sampling effect which could in the worst case make the values in the histogram be more or less completely random. > > Usually, the question of how many to add is answered by trial and error. > > echo 32 > /proc/fs/nfsd/threads > Did that improve my throughput? yes? > echo 128 > /proc/fs/nfsd/threads > Did that improve my throughput? no it actually decreased. > rinse... repeat... This procedure is what I would recommend admins do today, assuming their load is steady or reproducable. It's not only simple, but it uses a real-world performance measure, which is "does my application go faster?". There's no point adding nfsds if you're limited by the application, or some filesystem effect that causes the same load to go slower. It is of course rather tedious :-) -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-02-12 17:11 ` J. Bruce Fields 2009-02-13 1:53 ` Kevin Constantine @ 2009-02-19 6:42 ` Greg Banks 2009-03-15 21:25 ` J. Bruce Fields 1 sibling, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-02-19 6:42 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML, Harshula Jayasuriya J. Bruce Fields wrote: > On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: > >> Add /proc/fs/nfsd/pool_stats to export to userspace various >> statistics about the operation of rpc server thread pools. >> > > Could you explainw hy these specific statistics (total packets, > sockets_queued, threads_woken, overloads_avoided, threads_timedout) are > the important ones to capture? Could you give examples of what sort of > problems could be solved using them? > Actually I originally added these stats to help debug the overload-avoiding patch. Then I thought to use them to drive a userspace control loop for controlling the number of nfsds, which I never finished writing. > As you said, an important question for the sysadmin is "should I > configure more nfsds?" How do they answer that? > You can work that out, but it's not obvious, i.e. not human-friendly. Firstly, you need to rate convert all the stats. The total_packets stat tells you how many NFS packets are arriving on each thread pool. This is your primary load metric, i.e. with more load you want more nfsd threads. The sockets_queued stat tells you that calls are arriving which are not being immediately serviced by threads, i.e. you're either thread-limited or CPU-limited rather than network-limited and you might get better throughput if there were more nfsd threads. Conversely the overloads_avoided stat tells you if there are more threads than can usefully be made runnable on the available CPUs, so that adding more nfsd threads is unlikely to be helpful. The threads_timedout stat will give you a first-level approximation of whether there are threads that are completely idle, i.e. don't see any calls for the svc_recv() timeout (which I reduced to IIRC 10 sec as part of the original version of this patch). This is a clue that you can now reduce the number of threads. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-02-19 6:42 ` Greg Banks @ 2009-03-15 21:25 ` J. Bruce Fields 2009-03-16 3:21 ` Greg Banks 0 siblings, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2009-03-15 21:25 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML, Harshula Jayasuriya On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote: > J. Bruce Fields wrote: > > On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: > > > >> Add /proc/fs/nfsd/pool_stats to export to userspace various > >> statistics about the operation of rpc server thread pools. > >> > > > > Could you explainw hy these specific statistics (total packets, > > sockets_queued, threads_woken, overloads_avoided, threads_timedout) are > > the important ones to capture? Could you give examples of what sort of > > problems could be solved using them? > > > Actually I originally added these stats to help debug the > overload-avoiding patch. Then I thought to use them to drive a > userspace control loop for controlling the number of nfsds, which I > never finished writing. > > As you said, an important question for the sysadmin is "should I > > configure more nfsds?" How do they answer that? > > > You can work that out, but it's not obvious, i.e. not human-friendly. > Firstly, you need to rate convert all the stats. > > The total_packets stat tells you how many NFS packets are arriving on > each thread pool. This is your primary load metric, i.e. with more load > you want more nfsd threads. > > The sockets_queued stat tells you that calls are arriving which are not > being immediately serviced by threads, i.e. you're either thread-limited > or CPU-limited rather than network-limited and you might get better > throughput if there were more nfsd threads. > > Conversely the overloads_avoided stat tells you if there are more > threads than can usefully be made runnable on the available CPUs, so > that adding more nfsd threads is unlikely to be helpful. > > The threads_timedout stat will give you a first-level approximation of > whether there are threads that are completely idle, i.e. don't see any > calls for the svc_recv() timeout (which I reduced to IIRC 10 sec as part > of the original version of this patch). This is a clue that you can now > reduce the number of threads. OK, thanks, that makes sense, but: it would be really fantastic if we could update and/or replace some of the howto's and faq's that are out there. The above would be useful, as would be some of the material from your nfs-tuning presentation. Queued up this and the old-stats removal for 2.6.30. --b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-03-15 21:25 ` J. Bruce Fields @ 2009-03-16 3:21 ` Greg Banks 2009-03-16 13:37 ` J. Bruce Fields 0 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-03-16 3:21 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML J. Bruce Fields wrote: > On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote: > >> J. Bruce Fields wrote: >> >>> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: >>> >>> >>> As you said, an important question for the sysadmin is "should I >>> configure more nfsds?" How do they answer that? >>> >>> >> You can work that out, [...] >> > > OK, thanks, that makes sense, but: it would be really fantastic if we > could update and/or replace some of the howto's and faq's that are out > there. Fair enough. After I composed the above reply, I wondered whether the right thing to do would have been to add that text to a new file somewhere in Documentation/. Would that suit as an interim measure while I look at howtos and faqs? I mention this because I'll be adding more new stats in the next tranche of patches. > > Queued up this and the old-stats removal for 2.6.30. > Thanks. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 3/3] knfsd: add file to export stats about nfsd pools 2009-03-16 3:21 ` Greg Banks @ 2009-03-16 13:37 ` J. Bruce Fields 0 siblings, 0 replies; 25+ messages in thread From: J. Bruce Fields @ 2009-03-16 13:37 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML On Mon, Mar 16, 2009 at 02:21:02PM +1100, Greg Banks wrote: > J. Bruce Fields wrote: > > On Thu, Feb 19, 2009 at 05:42:49PM +1100, Greg Banks wrote: > > > >> J. Bruce Fields wrote: > >> > >>> On Tue, Jan 13, 2009 at 09:26:36PM +1100, Greg Banks wrote: > >>> > >>> > >>> As you said, an important question for the sysadmin is "should I > >>> configure more nfsds?" How do they answer that? > >>> > >>> > >> You can work that out, [...] > >> > > > > OK, thanks, that makes sense, but: it would be really fantastic if we > > could update and/or replace some of the howto's and faq's that are out > > there. > > Fair enough. > > After I composed the above reply, I wondered whether the right thing to > do would have been to add that text to a new file somewhere in > Documentation/. Would that suit as an interim measure while I look at > howtos and faqs? Yep, I think that's a great idea. > I mention this because I'll be adding more new stats > in the next tranche of patches. OK!--b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches 2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks ` (2 preceding siblings ...) 2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks @ 2009-02-09 5:24 ` Greg Banks 2009-02-09 20:47 ` J. Bruce Fields 3 siblings, 1 reply; 25+ messages in thread From: Greg Banks @ 2009-02-09 5:24 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML Greg Banks wrote: > G'day, > > This is the first tranche of patches from SGI's Enhanced NFS product. > > http://marc.info/?l=linux-nfs&m=123184242909159&w=2 http://marc.info/?l=linux-nfs&m=123184242709153&w=2 http://marc.info/?l=linux-nfs&m=123184242809157&w=2 Bruce, any word on these? I don't seem to have any specific review items that I need to pay attention to with these patches, and I don't see them in your for-2.6.30 branch, so can I get an ack or a nack or feedback on things that need fixing? -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches 2009-02-09 5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks @ 2009-02-09 20:47 ` J. Bruce Fields 2009-02-09 23:26 ` Greg Banks 0 siblings, 1 reply; 25+ messages in thread From: J. Bruce Fields @ 2009-02-09 20:47 UTC (permalink / raw) To: Greg Banks; +Cc: Linux NFS ML On Mon, Feb 09, 2009 at 04:24:27PM +1100, Greg Banks wrote: > Bruce, any word on these? I don't seem to have any specific review > items that I need to pay attention to with these patches, and I don't > see them in your for-2.6.30 branch, so can I get an ack or a nack or > feedback on things that need fixing? Sorry, that came around the time of the citi compromise, so I just registered that it had gotten some responses, figured it'd probably be resent, and filed it away.... (And, by the way, if anyone's waiting for me to respond to email from the last month--you mght want to resend. The longer version: We now believe that password-logging ssh and sshd were installed on citi machines as early as November. We got reports of ssh scanning in December and January, but just took down the misbehaving machines. In mid-January we finally realized the problem was serious, disconnected ourselves from the internet completely, took everything on our local network offline (including our main mail server and linux-nfs.org), then brought our external connection back up and slowly reconnected machines to our local network as we audited and/or rebuilt them as appropriate. To be cautious, I also did the same for my personal machines (including my personal mail server), though I didn't have specific evidence they'd been compromised. The upshot is: there were a few days when mail wasn't getting through at all, and I know at least some was never delivered. When it did get through, I wasn't necessarily able to pay it much attention. So besides just a sob-story, this is a request that people ping me if I haven't responded to something I should have lately.) --b. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [patch 0/3] First tranche of SGI Enhanced NFS patches 2009-02-09 20:47 ` J. Bruce Fields @ 2009-02-09 23:26 ` Greg Banks 0 siblings, 0 replies; 25+ messages in thread From: Greg Banks @ 2009-02-09 23:26 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Linux NFS ML J. Bruce Fields wrote: > On Mon, Feb 09, 2009 at 04:24:27PM +1100, Greg Banks wrote: > >> Bruce, any word on these? I don't seem to have any specific review >> items that I need to pay attention to with these patches, and I don't >> see them in your for-2.6.30 branch, so can I get an ack or a nack or >> feedback on things that need fixing? >> > > Sorry, that came around the time of the citi compromise, so I just > registered that it had gotten some responses, figured it'd probably be > resent, and filed it away.... > Aha. The conversations didn't result in any specific feedback items or improvements that I can see, unless I'm misunderstanding what people said. So I don't have any newer versions of the patches to send. Do you want me to resend anyway? > > We now believe that password-logging ssh and sshd were installed on citi > machines as early as November. [...] Ouch. Well, that explains the linux-nfs.org outages. -- Greg Banks, P.Engineer, SGI Australian Software Group. the brightly coloured sporks of revolution. I don't speak for SGI. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2009-03-16 13:37 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
2009-01-13 16:41 ` Chuck Lever
2009-01-13 22:50 ` Greg Banks
[not found] ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-02-11 21:59 ` J. Bruce Fields
2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
2009-01-13 14:33 ` Peter Staubach
2009-01-13 22:15 ` Greg Banks
[not found] ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-01-13 22:35 ` Peter Staubach
2009-01-13 23:04 ` Greg Banks
2009-02-11 23:10 ` J. Bruce Fields
2009-02-19 6:25 ` Greg Banks
2009-03-15 21:21 ` J. Bruce Fields
2009-03-16 3:10 ` Greg Banks
2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
2009-02-12 17:11 ` J. Bruce Fields
2009-02-13 1:53 ` Kevin Constantine
2009-02-19 7:04 ` Greg Banks
2009-02-19 6:42 ` Greg Banks
2009-03-15 21:25 ` J. Bruce Fields
2009-03-16 3:21 ` Greg Banks
2009-03-16 13:37 ` J. Bruce Fields
2009-02-09 5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-02-09 20:47 ` J. Bruce Fields
2009-02-09 23:26 ` Greg Banks
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox