From: "J. Bruce Fields" <bfields@fieldses.org>
To: Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
Linux NFS ML <linux-nfs@vger.kernel.org>
Subject: Re: [patch 1/3] knfsd: remove the nfsd thread busy histogram
Date: Wed, 11 Feb 2009 16:59:47 -0500 [thread overview]
Message-ID: <20090211215947.GH27686@fieldses.org> (raw)
In-Reply-To: <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
On Wed, Jan 14, 2009 at 09:50:52AM +1100, Greg Banks wrote:
> Chuck Lever wrote:
> > On Jan 13, 2009, at Jan 13, 2009, 5:26 AM, Greg Banks wrote:
> >> Stop gathering the data that feeds the 'th' line in /proc/net/rpc/nfsd
> >> because the questionable data provided is not worth the scalability
> >> impact of calculating it. Instead, always report zeroes. The current
> >> approach suffers from three major issues:
> >>
> >> 1. update_thread_usage() increments buckets by call service
> >> time or call arrival time...in jiffies. On lightly loaded
> >> machines, call service times are usually < 1 jiffy; on
> >> heavily loaded machines call arrival times will be << 1 jiffy.
> >> So a large portion of the updates to the buckets are rounded
> >> down to zero, and the histogram is undercounting.
> >
> > Use ktime_get_real() instead. This is what the network layer uses.
> IIRC that wasn't available when I wrote the patch (2.6.5 kernel in SLES9
> in late 2005). I haven't looked at it again since.
>
> Later, I looked at gathering better statistics on thread usage, and I
> investigated real time clock (more precisely, monotonic clock)
> implementations in Linux and came to the sad conclusion that there was
> no API I could call that would be both accurate and efficient on two or
> more platforms, so I gave up. The HPET hardware timer looked promising
> for a while, but it turned out that a 32b kernel used a global spinlock
> to access the 64b HPET registers, which created the same scalability
> problem I was trying to fix. Things may have improved since then.
>
> If we had such a clock though, the solution is very simple. Each nfsd
> maintains two new 64b counters of nanoseconds spent in each of two
> states "busy" and "idle", where "idle" is asleep waiting for a call and
> "busy" is everything else. These are maintained in svc_recv().
> Interfaces are provided for userspace to read an aggregation of these
> counters, per-pool and globally. Userspace rate-converts the counters;
> the rate of increase of the two counters tells you both how many threads
> there are and how much actual demand on thread time there is. This is
> how I did it in Irix (SGI Origin machines had a global distributed
> monotonic clock in hardware).
Is it worth looking into this again now?
> > You could even steal the start timestamp from the first skbuff in an
> > incoming RPC request.
>
> This would help if we had skbuffs: NFS/RDMA doesn't.
> >
> > This problem is made worse on "server" configurations and in virtual
> > guests which may still use HZ=100, though with tickless HZ this is a
> > less frequently seen configuration.
>
> Indeed.
> >
> >> 2. As seen previously on the nfs mailing list, the format in which
> >> the histogram is presented is cryptic, difficult to explain,
> >> and difficult to use.
> >
> > A user space script similar to mountstats that interprets these
> > metrics might help here.
>
> The formatting in the pseudofile isn't the entire problem. The problem
> is translating the "thread usage histogram" information there into an
> answer to the actual question the sysadmin wants, which is "should I
> configure more nfsds?"
Agreed.
> >> 3. Updating the histogram requires taking a global spinlock and
> >> dirtying the global variables nfsd_last_call, nfsd_busy, and
> >> nfsdstats *twice* on every RPC call, which is a significant
> >> scaling limitation.
> >
> > You might fix this by making the global variables into per-CPU
> > variables, then totaling the per-CPU variables only at presentation
> > time (ie when someone cats /proc/net/rpc/nfsd). That would make the
> > collection logic lockless.
> This is how I fixed some of the other server stats in later patches.
> IIRC that approach doesn't work for the thread usage histogram because
> it's scaled as it's gathered by a potentially time-varying global number
> so the on-demand totalling might not give correct results.
Right, so, amuse yourself watching me as I try to remember how the 'th'
line works: if it's attempting to report, e.g., what length of time 10%
of the threads are busy, it needs very local knowledge: if you know each
thread was busy 10 jiffies out of the last 100, you'd still need to know
whether they were all busy during the *same* 10-jiffy interval, or
whether that work was spread out more evenly over the 100 jiffies....
--b.
> Also, in the
> presence of multiple thread pools any thread usage information should be
> per-pool not global. At the time I wrote this patch I concluded that I
> couldn't make the gathering scale and still preserve the exact semantics
> of the data gathered.
>
> >
> >>
> >
> > Yeah. The real issue here is deciding whether these stats are useful
> > or not;
> In my experience, not.
>
> > if not, can they be made useable?
> A different form of the data could certainly be made useful.
>
> --
> Greg Banks, P.Engineer, SGI Australian Software Group.
> the brightly coloured sporks of revolution.
> I don't speak for SGI.
>
next prev parent reply other threads:[~2009-02-11 21:59 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-13 10:26 [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-01-13 10:26 ` [patch 1/3] knfsd: remove the nfsd thread busy histogram Greg Banks
2009-01-13 16:41 ` Chuck Lever
2009-01-13 22:50 ` Greg Banks
[not found] ` <496D1ACC.7070106-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-02-11 21:59 ` J. Bruce Fields [this message]
2009-01-13 10:26 ` [patch 2/3] knfsd: avoid overloading the CPU scheduler with enormous load averages Greg Banks
2009-01-13 14:33 ` Peter Staubach
2009-01-13 22:15 ` Greg Banks
[not found] ` <496D1294.1060407-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org>
2009-01-13 22:35 ` Peter Staubach
2009-01-13 23:04 ` Greg Banks
2009-02-11 23:10 ` J. Bruce Fields
2009-02-19 6:25 ` Greg Banks
2009-03-15 21:21 ` J. Bruce Fields
2009-03-16 3:10 ` Greg Banks
2009-01-13 10:26 ` [patch 3/3] knfsd: add file to export stats about nfsd pools Greg Banks
2009-02-12 17:11 ` J. Bruce Fields
2009-02-13 1:53 ` Kevin Constantine
2009-02-19 7:04 ` Greg Banks
2009-02-19 6:42 ` Greg Banks
2009-03-15 21:25 ` J. Bruce Fields
2009-03-16 3:21 ` Greg Banks
2009-03-16 13:37 ` J. Bruce Fields
2009-02-09 5:24 ` [patch 0/3] First tranche of SGI Enhanced NFS patches Greg Banks
2009-02-09 20:47 ` J. Bruce Fields
2009-02-09 23:26 ` Greg Banks
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090211215947.GH27686@fieldses.org \
--to=bfields@fieldses.org \
--cc=chuck.lever@oracle.com \
--cc=gnb-cP1dWloDopni96+mSzHFpQC/G2K4zDHf@public.gmane.org \
--cc=linux-nfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox