From mboxrd@z Thu Jan  1 00:00:00 1970
From: Steven Whitehouse <swhiteho@redhat.com>
Date: Mon, 07 Nov 2011 12:06:52 +0000
Subject: [Cluster-devel] GFS2: glock statistics gathering (RFC)
In-Reply-To: <20111104172102.GC15232@redhat.com>
References: <1320419989.2732.60.camel@menhir>
	<20111104163152.GA15232@redhat.com> <1320425851.2732.82.camel@menhir>
	<20111104172102.GC15232@redhat.com>
Message-ID: <1320667612.2762.59.camel@menhir>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hi,

On Fri, 2011-11-04 at 13:21 -0400, David Teigland wrote:
> On Fri, Nov 04, 2011 at 04:57:31PM +0000, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Fri, 2011-11-04 at 12:31 -0400, David Teigland wrote:
> > > On Fri, Nov 04, 2011 at 03:19:49PM +0000, Steven Whitehouse wrote:
> > > > The three pairs of mean/variance measure the following
> > > > things:
> > > > 
> > > >  1. DLM lock time (non-blocking requests)
> > > 
> > > You don't need to track and save this value, because all results will be
> > > one of three values which can gather once:
> > > 
> > > short: the dir node and master node are local: 0 network round trip
> > > medium: one is local, one is remote: 1 network round trip
> > > long: both are remote: 2 network round trips
> > > 
> > > Once you've measured values for short/med/long, then you're done.
> > > The distribution will depend on the usage pattern.
> > > 
> > The reason for tracking this is to be able to compare it with the
> > blocking request value to (I hope) get a rough idea of the difference
> > between the two, which may indicate contention on the lock. So this
> > is really a "baseline" measurement.
> > 
> > Plus we do need to measure it, since it will vary according to a
> > number of things, such as what hardware is in use.
> 
> Right, but the baseline shouldn't change once you have it.
> 
Well, I'm not sure I agree with that. The conditions on the network
might change, and the conditions on a (remote) lock master may change.
We have no way within GFS2 to measure such things directly, so we need a
proxy for them.

> > > > 2. To spot performance issues more easily
> > > 
> > > Apart from contention, I'm not sure there are many perf issues that dlm
> > > measurements would help with.
> > > 
> > That is the #1 cause of reported performance issues, so top of our list
> > to work on. The goal is to make it easier to track down the source of
> > these kinds of problems.
> 
> I still think that time averages and computations sounds like a difficult
> and indirect way of measuring contention... but see how it works.
> 
> Dave
> 
It is, but what else do we have? There is no way to communicate between
GFS2 nodes except via the DLM. Contention is really a state of all the
nodes collectively, rather than any one node, which makes it tricky to
measure. We could try something using the LVBs, but that is rather
tricky if the nodes are using (largely) read locks and thus we can't
write to the LVB from those nodes.

For a long time we've had the "min hold time" concept which was
introduced to ensure that we continue to make progress, even when
contention occurs. That has been based off a rather rough time since
last state change estimate, with an adjustment for a corner case which
was discovered recently. I'd like to put that calculation on a more
scientific footing, and this seems to be the easiest way to do that.

I've had a look at what OCFS2 collects wrt locking stats. They have a
cumulative total of wait time and a counter for the number of waits
which have occurred. That will give a good collective estimate of time
spent waiting, but it is not suitable for changing the dynamic behaviour
of the filesystem as the time it takes to react to a change in the
system characteristics will vary with time. In other words when just
mounted, the statistic will be very sensitive to new data, but it will
become less so as time goes on.

The exponential averages which I'm proposing in the patch are designed
so that once they have gained enough information (say 12 samples or so)
to reach a steady state, they will change in response to new data in the
same way, however many lock requests have been made.

This is also why such things are used in the networking code. We don't
want the response to change according to how much data has passed over
the socket.

In addition, the patch also keeps track of the variance, so that we know
how accurate the mean is. Something that is vitally important to avoid
drawing incorrect conclusions from the data.

What I am concerned about is whether we could measure other quantities
which would give us a better proxy for whats happening in the cluster as
a whole, and at what point we start running the risk of the stats
slowing things down more than the information they return will speed
things up.

As a general rule, I don't want to collect stats just for the sake of
it. We can use tracepoints to collect data for performance analysis
later, the aim here is to collect data to directly influence the future
operation of the filesystem in an automatic manner, and to be able to
monitor that so we can check that it is really functioning correctly.

I've tried to ensure they are collected in the most efficient way
possible (no divisions or multiplications, aside from bit shifts). All
global stats in per-cpu structures.

Also, another difference from OCFS2, is that this patch will collect
global stats, so we have a good starting point to initialise the
per-glock stats from. Newly-created glocks should thus start with a good
estimate as to their correct stats values, once the fs has been mounted
for a little while,

Steve.