cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
From: David Teigland <teigland@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree
Date: Mon, 10 Oct 2011 13:01:18 -0400	[thread overview]
Message-ID: <20111010170118.GC18764@redhat.com> (raw)
In-Reply-To: <1318261861.2949.91.camel@menhir>

On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > ----- Original Message -----
> > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > | > Hi,
> > > | > 
> > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > | > Before, they were in a linked list off a hash table.  Now,
> > > | > they're an rb_tree off the same hash table.  This speeds up
> > > | > DLM lookups greatly.
> > > | > 
> > > | > Today's DLM is faster than older DLMs for many file systems,
> > > | > (e.g. in RHEL5) due to the larger hash table size.  However,
> > > | > this rb_tree implementation scales much better.  For my
> > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > | > show much of an improvement.  But when I scale the file system
> > > | > to 4000 directories with 4000 files (16 million files), it
> > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > | > 42.01 hours to 23.68 hours.
> > > | 
> > > | How many hash table buckets were you using in that test?
> > > | If it was the default (1024), I'd be interested to know how
> > > | 16k compares.
> > > 
> > > Hi,
> > > 
> > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > and 16K hash buckets, the time was virtually the same as
> > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > should re-evaluate whether we should use the rb_tree
> > > implementation or just increase the hash buckets as needed.
> > > I guess the question is now mainly related to scaling and
> > > memory usage for all those hash tables at this point.
> > 
> > I'm still interested in possibly using an rbtree with fewer hash buckets.
> > 
> > At the same time, I think the bigger problem may be why gfs2 is caching so
> > many locks in the first place, especially for millions of unlinked files
> > whose locks will never benefit you again.
> > 
> 
> I doubt that it is caching locks for unlinked files for any great period
> of time if there is any memory pressure. It is memory pressure which
> instigates the dropping of glocks relating to inodes usually. If the
> glock is unlocked then simply dropping the ref count on the glock will
> cause the deallocation (dlm lock drop) to occur.
>
> The reason why this tends to be seen at unlink time is just that
> unlinking small files is a good way to iterate over a lot of inodes in a
> relatively short time period, and thus locking can start to dominate the
> timings if it isn't very low latency. I suspect there are lots of other
> workloads which would see this (find perhaps?) as well. The faster the
> array, the more noticeable the effect is likely to be.
> 
> Also, we do get some benefit if glocks are cached after inodes have been
> unlinked. If an inode is then recreated in the same disk block, we'll
> have all the glocks ready for it, without having to wait for them to be
> set up from scratch.
> 
> I am slightly concerned that we ought to be testing with much greater
> numbers of inodes for these kinds of tests. Considering the memory sizes
> of modern servers, having 1m or even 10m files in cache is really not a
> lot these days.

The fact remains that caching "as much as possible" tends to be harmful,
and some careful limiting would be a good investment.

> There is a second consideration in selecting the data structure for the
> dlm lock/resource etc. tables, which is the locking. Using a simple hash
> table does make it much easier to use RCU based locking which is a great
> advantage if there are a lot of look up operations compared with the
> number of insert/remove operations. That becomes a lot more difficult
> (and requires a much greater lookup to insert/remove ratio) when trees
> are involved.
> 
> We have run into the same kind of issues with GFS2's glock hash table.
> We firstly reduced the size of the hash buckets to a single pointer so
> that we could have a lot more hash heads for the given amount of memory.
> Currently we have 2^16 hash table entries.
> 
> Also, we are now using RCU in order to reduce the locking overhead as
> much as possible. We still have a spinlock for the lru list, but that is
> not taken that much by comparison, and there is no locking in the lookup
> path now except to disable preempt through the small critical section.
> 
> I'm not yet convinced that this is enough, even so. I would like to see
> a tree based system for the glocks at some future point, but the
> complexity of tree based RCU is currently putting me off, despite the
> obvious benefits of a log N (where N is the number of items in the tree)
> lookup compared with the N/2 average lookup time for a hash chain. We'd
> need to add an seq lock on the read side which would take away some of
> the benefit of using RCU in the first place.
> 
> However, where we have scored real performance gains is simply by
> avoiding looking up the glocks at all. Hence why we cache them in the
> inodes, and this is also why the void pointer we pass to the dlm is a
> pointer to the glock, so there are then no hash table lookups when we
> receive dlm replies. I think there is scope to explore this option in
> the dlm in a few places too.
> 
> Our most recent expansion of this concept in gfs2 was the addition of a
> variable in the inode to cache the last resource group used. This
> reduces the number of lookups of the resource group data structure
> dramatically during block allocations and deallocations.
> 
> Anyway, I just wanted to point out some of the tradeoffs of the possible
> choices available in making gfs2/dlm scale to larger numbers of inodes.
> I would very much like to have an RCU based rbtree (or similar) which
> doesn't require the seq_lock if anybody has suggestions on how to
> achieve that,

While you don't want to be blatently inefficient, I suspect investing time
into things like RCU runs afoul of Amdahl's Law.  i.e. you could reduce
the time of the data structure operations to 0 and not affect real world
performance.

When deciding where to focus improvements, I think cache limting could
produce a big payoff, and data structure optimizations a small payoff.



  reply	other threads:[~2011-10-10 17:01 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <9ba880ab-984b-4588-b2cb-04089b0943ee@zmail06.collab.prod.int.phx2.redhat.com>
2011-10-05 19:25 ` [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree Bob Peterson
2011-10-05 20:05   ` David Teigland
2011-10-08 10:13     ` Bob Peterson
2011-10-10 14:43       ` David Teigland
2011-10-10 15:51         ` Steven Whitehouse
2011-10-10 17:01           ` David Teigland [this message]
2011-10-10 19:00             ` Steven Whitehouse
2011-10-10 19:33               ` David Teigland
2011-10-24 19:47         ` Bob Peterson
2011-10-25 23:13           ` David Teigland
2011-10-26 17:28             ` Bob Peterson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111010170118.GC18764@redhat.com \
    --to=teigland@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).