From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Mon, 07 Nov 2011 12:06:52 +0000 Subject: [Cluster-devel] GFS2: glock statistics gathering (RFC) In-Reply-To: <20111104172102.GC15232@redhat.com> References: <1320419989.2732.60.camel@menhir> <20111104163152.GA15232@redhat.com> <1320425851.2732.82.camel@menhir> <20111104172102.GC15232@redhat.com> Message-ID: <1320667612.2762.59.camel@menhir> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Fri, 2011-11-04 at 13:21 -0400, David Teigland wrote: > On Fri, Nov 04, 2011 at 04:57:31PM +0000, Steven Whitehouse wrote: > > Hi, > > > > On Fri, 2011-11-04 at 12:31 -0400, David Teigland wrote: > > > On Fri, Nov 04, 2011 at 03:19:49PM +0000, Steven Whitehouse wrote: > > > > The three pairs of mean/variance measure the following > > > > things: > > > > > > > > 1. DLM lock time (non-blocking requests) > > > > > > You don't need to track and save this value, because all results will be > > > one of three values which can gather once: > > > > > > short: the dir node and master node are local: 0 network round trip > > > medium: one is local, one is remote: 1 network round trip > > > long: both are remote: 2 network round trips > > > > > > Once you've measured values for short/med/long, then you're done. > > > The distribution will depend on the usage pattern. > > > > > The reason for tracking this is to be able to compare it with the > > blocking request value to (I hope) get a rough idea of the difference > > between the two, which may indicate contention on the lock. So this > > is really a "baseline" measurement. > > > > Plus we do need to measure it, since it will vary according to a > > number of things, such as what hardware is in use. > > Right, but the baseline shouldn't change once you have it. > Well, I'm not sure I agree with that. The conditions on the network might change, and the conditions on a (remote) lock master may change. We have no way within GFS2 to measure such things directly, so we need a proxy for them. > > > > 2. To spot performance issues more easily > > > > > > Apart from contention, I'm not sure there are many perf issues that dlm > > > measurements would help with. > > > > > That is the #1 cause of reported performance issues, so top of our list > > to work on. The goal is to make it easier to track down the source of > > these kinds of problems. > > I still think that time averages and computations sounds like a difficult > and indirect way of measuring contention... but see how it works. > > Dave > It is, but what else do we have? There is no way to communicate between GFS2 nodes except via the DLM. Contention is really a state of all the nodes collectively, rather than any one node, which makes it tricky to measure. We could try something using the LVBs, but that is rather tricky if the nodes are using (largely) read locks and thus we can't write to the LVB from those nodes. For a long time we've had the "min hold time" concept which was introduced to ensure that we continue to make progress, even when contention occurs. That has been based off a rather rough time since last state change estimate, with an adjustment for a corner case which was discovered recently. I'd like to put that calculation on a more scientific footing, and this seems to be the easiest way to do that. I've had a look at what OCFS2 collects wrt locking stats. They have a cumulative total of wait time and a counter for the number of waits which have occurred. That will give a good collective estimate of time spent waiting, but it is not suitable for changing the dynamic behaviour of the filesystem as the time it takes to react to a change in the system characteristics will vary with time. In other words when just mounted, the statistic will be very sensitive to new data, but it will become less so as time goes on. The exponential averages which I'm proposing in the patch are designed so that once they have gained enough information (say 12 samples or so) to reach a steady state, they will change in response to new data in the same way, however many lock requests have been made. This is also why such things are used in the networking code. We don't want the response to change according to how much data has passed over the socket. In addition, the patch also keeps track of the variance, so that we know how accurate the mean is. Something that is vitally important to avoid drawing incorrect conclusions from the data. What I am concerned about is whether we could measure other quantities which would give us a better proxy for whats happening in the cluster as a whole, and at what point we start running the risk of the stats slowing things down more than the information they return will speed things up. As a general rule, I don't want to collect stats just for the sake of it. We can use tracepoints to collect data for performance analysis later, the aim here is to collect data to directly influence the future operation of the filesystem in an automatic manner, and to be able to monitor that so we can check that it is really functioning correctly. I've tried to ensure they are collected in the most efficient way possible (no divisions or multiplications, aside from bit shifts). All global stats in per-cpu structures. Also, another difference from OCFS2, is that this patch will collect global stats, so we have a good starting point to initialise the per-glock stats from. Newly-created glocks should thus start with a good estimate as to their correct stats values, once the fs has been mounted for a little while, Steve.