From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Teigland Date: Mon, 10 Oct 2011 13:01:18 -0400 Subject: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree In-Reply-To: <1318261861.2949.91.camel@menhir> References: <20111005200543.GD11895@redhat.com> <20111010144320.GA18764@redhat.com> <1318261861.2949.91.camel@menhir> Message-ID: <20111010170118.GC18764@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote: > Hi, > > On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote: > > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote: > > > ----- Original Message ----- > > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote: > > > | > Hi, > > > | > > > > | > This upstream patch changes the way DLM keeps track of RSBs. > > > | > Before, they were in a linked list off a hash table. Now, > > > | > they're an rb_tree off the same hash table. This speeds up > > > | > DLM lookups greatly. > > > | > > > > | > Today's DLM is faster than older DLMs for many file systems, > > > | > (e.g. in RHEL5) due to the larger hash table size. However, > > > | > this rb_tree implementation scales much better. For my > > > | > 1000-directories-with-1000-files test, the patch doesn't > > > | > show much of an improvement. But when I scale the file system > > > | > to 4000 directories with 4000 files (16 million files), it > > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from > > > | > 42.01 hours to 23.68 hours. > > > | > > > | How many hash table buckets were you using in that test? > > > | If it was the default (1024), I'd be interested to know how > > > | 16k compares. > > > > > > Hi, > > > > > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel > > > and 16K hash buckets, the time was virtually the same as > > > with my patch: 1405m46.519s (23.43 hours). So perhaps we > > > should re-evaluate whether we should use the rb_tree > > > implementation or just increase the hash buckets as needed. > > > I guess the question is now mainly related to scaling and > > > memory usage for all those hash tables at this point. > > > > I'm still interested in possibly using an rbtree with fewer hash buckets. > > > > At the same time, I think the bigger problem may be why gfs2 is caching so > > many locks in the first place, especially for millions of unlinked files > > whose locks will never benefit you again. > > > > I doubt that it is caching locks for unlinked files for any great period > of time if there is any memory pressure. It is memory pressure which > instigates the dropping of glocks relating to inodes usually. If the > glock is unlocked then simply dropping the ref count on the glock will > cause the deallocation (dlm lock drop) to occur. > > The reason why this tends to be seen at unlink time is just that > unlinking small files is a good way to iterate over a lot of inodes in a > relatively short time period, and thus locking can start to dominate the > timings if it isn't very low latency. I suspect there are lots of other > workloads which would see this (find perhaps?) as well. The faster the > array, the more noticeable the effect is likely to be. > > Also, we do get some benefit if glocks are cached after inodes have been > unlinked. If an inode is then recreated in the same disk block, we'll > have all the glocks ready for it, without having to wait for them to be > set up from scratch. > > I am slightly concerned that we ought to be testing with much greater > numbers of inodes for these kinds of tests. Considering the memory sizes > of modern servers, having 1m or even 10m files in cache is really not a > lot these days. The fact remains that caching "as much as possible" tends to be harmful, and some careful limiting would be a good investment. > There is a second consideration in selecting the data structure for the > dlm lock/resource etc. tables, which is the locking. Using a simple hash > table does make it much easier to use RCU based locking which is a great > advantage if there are a lot of look up operations compared with the > number of insert/remove operations. That becomes a lot more difficult > (and requires a much greater lookup to insert/remove ratio) when trees > are involved. > > We have run into the same kind of issues with GFS2's glock hash table. > We firstly reduced the size of the hash buckets to a single pointer so > that we could have a lot more hash heads for the given amount of memory. > Currently we have 2^16 hash table entries. > > Also, we are now using RCU in order to reduce the locking overhead as > much as possible. We still have a spinlock for the lru list, but that is > not taken that much by comparison, and there is no locking in the lookup > path now except to disable preempt through the small critical section. > > I'm not yet convinced that this is enough, even so. I would like to see > a tree based system for the glocks at some future point, but the > complexity of tree based RCU is currently putting me off, despite the > obvious benefits of a log N (where N is the number of items in the tree) > lookup compared with the N/2 average lookup time for a hash chain. We'd > need to add an seq lock on the read side which would take away some of > the benefit of using RCU in the first place. > > However, where we have scored real performance gains is simply by > avoiding looking up the glocks at all. Hence why we cache them in the > inodes, and this is also why the void pointer we pass to the dlm is a > pointer to the glock, so there are then no hash table lookups when we > receive dlm replies. I think there is scope to explore this option in > the dlm in a few places too. > > Our most recent expansion of this concept in gfs2 was the addition of a > variable in the inode to cache the last resource group used. This > reduces the number of lookups of the resource group data structure > dramatically during block allocations and deallocations. > > Anyway, I just wanted to point out some of the tradeoffs of the possible > choices available in making gfs2/dlm scale to larger numbers of inodes. > I would very much like to have an RCU based rbtree (or similar) which > doesn't require the seq_lock if anybody has suggestions on how to > achieve that, While you don't want to be blatently inefficient, I suspect investing time into things like RCU runs afoul of Amdahl's Law. i.e. you could reduce the time of the data structure operations to 0 and not affect real world performance. When deciding where to focus improvements, I think cache limting could produce a big payoff, and data structure optimizations a small payoff.