From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Teigland <teigland@redhat.com>
Date: Mon, 10 Oct 2011 13:01:18 -0400
Subject: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from
 linked list to rb_tree
In-Reply-To: <1318261861.2949.91.camel@menhir>
References: <20111005200543.GD11895@redhat.com>
	<f3b7579d-6072-4bed-afca-ee26f8b62ece@zmail06.collab.prod.int.phx2.redhat.com>
	<20111010144320.GA18764@redhat.com>
	<1318261861.2949.91.camel@menhir>
Message-ID: <20111010170118.GC18764@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > ----- Original Message -----
> > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > | > Hi,
> > > | > 
> > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > | > Before, they were in a linked list off a hash table.  Now,
> > > | > they're an rb_tree off the same hash table.  This speeds up
> > > | > DLM lookups greatly.
> > > | > 
> > > | > Today's DLM is faster than older DLMs for many file systems,
> > > | > (e.g. in RHEL5) due to the larger hash table size.  However,
> > > | > this rb_tree implementation scales much better.  For my
> > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > | > show much of an improvement.  But when I scale the file system
> > > | > to 4000 directories with 4000 files (16 million files), it
> > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > | > 42.01 hours to 23.68 hours.
> > > | 
> > > | How many hash table buckets were you using in that test?
> > > | If it was the default (1024), I'd be interested to know how
> > > | 16k compares.
> > > 
> > > Hi,
> > > 
> > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > and 16K hash buckets, the time was virtually the same as
> > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > should re-evaluate whether we should use the rb_tree
> > > implementation or just increase the hash buckets as needed.
> > > I guess the question is now mainly related to scaling and
> > > memory usage for all those hash tables at this point.
> > 
> > I'm still interested in possibly using an rbtree with fewer hash buckets.
> > 
> > At the same time, I think the bigger problem may be why gfs2 is caching so
> > many locks in the first place, especially for millions of unlinked files
> > whose locks will never benefit you again.
> > 
> 
> I doubt that it is caching locks for unlinked files for any great period
> of time if there is any memory pressure. It is memory pressure which
> instigates the dropping of glocks relating to inodes usually. If the
> glock is unlocked then simply dropping the ref count on the glock will
> cause the deallocation (dlm lock drop) to occur.
>
> The reason why this tends to be seen at unlink time is just that
> unlinking small files is a good way to iterate over a lot of inodes in a
> relatively short time period, and thus locking can start to dominate the
> timings if it isn't very low latency. I suspect there are lots of other
> workloads which would see this (find perhaps?) as well. The faster the
> array, the more noticeable the effect is likely to be.
> 
> Also, we do get some benefit if glocks are cached after inodes have been
> unlinked. If an inode is then recreated in the same disk block, we'll
> have all the glocks ready for it, without having to wait for them to be
> set up from scratch.
> 
> I am slightly concerned that we ought to be testing with much greater
> numbers of inodes for these kinds of tests. Considering the memory sizes
> of modern servers, having 1m or even 10m files in cache is really not a
> lot these days.

The fact remains that caching "as much as possible" tends to be harmful,
and some careful limiting would be a good investment.

> There is a second consideration in selecting the data structure for the
> dlm lock/resource etc. tables, which is the locking. Using a simple hash
> table does make it much easier to use RCU based locking which is a great
> advantage if there are a lot of look up operations compared with the
> number of insert/remove operations. That becomes a lot more difficult
> (and requires a much greater lookup to insert/remove ratio) when trees
> are involved.
> 
> We have run into the same kind of issues with GFS2's glock hash table.
> We firstly reduced the size of the hash buckets to a single pointer so
> that we could have a lot more hash heads for the given amount of memory.
> Currently we have 2^16 hash table entries.
> 
> Also, we are now using RCU in order to reduce the locking overhead as
> much as possible. We still have a spinlock for the lru list, but that is
> not taken that much by comparison, and there is no locking in the lookup
> path now except to disable preempt through the small critical section.
> 
> I'm not yet convinced that this is enough, even so. I would like to see
> a tree based system for the glocks at some future point, but the
> complexity of tree based RCU is currently putting me off, despite the
> obvious benefits of a log N (where N is the number of items in the tree)
> lookup compared with the N/2 average lookup time for a hash chain. We'd
> need to add an seq lock on the read side which would take away some of
> the benefit of using RCU in the first place.
> 
> However, where we have scored real performance gains is simply by
> avoiding looking up the glocks at all. Hence why we cache them in the
> inodes, and this is also why the void pointer we pass to the dlm is a
> pointer to the glock, so there are then no hash table lookups when we
> receive dlm replies. I think there is scope to explore this option in
> the dlm in a few places too.
> 
> Our most recent expansion of this concept in gfs2 was the addition of a
> variable in the inode to cache the last resource group used. This
> reduces the number of lookups of the resource group data structure
> dramatically during block allocations and deallocations.
> 
> Anyway, I just wanted to point out some of the tradeoffs of the possible
> choices available in making gfs2/dlm scale to larger numbers of inodes.
> I would very much like to have an RCU based rbtree (or similar) which
> doesn't require the seq_lock if anybody has suggestions on how to
> achieve that,

While you don't want to be blatently inefficient, I suspect investing time
into things like RCU runs afoul of Amdahl's Law.  i.e. you could reduce
the time of the data structure operations to 0 and not affect real world
performance.

When deciding where to focus improvements, I think cache limting could
produce a big payoff, and data structure optimizations a small payoff.