From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jan Kara <jack@suse.cz>
Subject: Re: [PATCH RFC 0/2] skiplists for range indexes
Date: Fri, 3 May 2013 11:19:24 +0200
Message-ID: <20130503091924.GD10633@quack.suse.cz>
References: <20130503020211.3599.72404@localhost.localdomain>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	David Woodhouse <David.Woodhouse@intel.com>,
	"dchinner@redhat.com" <dchinner@redhat.com>, bo.li.liu@oracle.com
To: Chris Mason <chris.mason@fusionio.com>
Return-path: <linux-fsdevel-owner@vger.kernel.org>
Received: from cantor2.suse.de ([195.135.220.15]:52273 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751518Ab3ECJT2 (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Fri, 3 May 2013 05:19:28 -0400
Content-Disposition: inline
In-Reply-To: <20130503020211.3599.72404@localhost.localdomain>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

  Hi,

On Thu 02-05-13 22:02:11, Chris Mason wrote:
> I've been working on skiplists for indexing off/on for a little while
> now.  For btrfs, I really need a fast way to manage extents that doesn't
> bog down in lock contention for reads or writes.  Dave Chinner has
> mentioned this as well, so I've cc'd him.
  But I guess you still need writer-writer exclusion, don't you? Or are you
able to do some partial locking of the structure to allow parallel
modifications?

> Liu Bo started this skiplist code, but at the time it didn't make
> a huge difference.  I've reworked things a bit and used the IOMMU
> to test things.  I ran some benchmarks on a single socket
> server with two SSD cards to get a baseline of how much it is helping.
> 
> I tried to keep the basic structure of the IOMMU code, and there are probably
> many better ways to fix the IOMMU contention.  Really I'm just trying to
> compare with rbtrees, not fix the IOMMU.
  There also exist some research papers on RCU friendly RB-trees (or other
trees). Maybe they would be interesting to try? The basic trick is to use a
copy-on-write when you need to do a rotation. This is slightly impractical
because it requires memory allocation (or alternatively doubles memory
footprint of an rb node) but you need to do a similar stuff in skip lists
anyway. Just an idea...

								Honza

> Basic numbers (all aio/dio):
> 
> Streaming writes
> IOMMU off  2,575MB/s
> skiplist   1,715MB/s
> rbtree     1,659MB/s
> 
> Not a huge improvement, but the CPU time was lower.
> 
> 16 threads, iodepth 10, 20MB random reads
> IOMMU off  2,861MB/s (mostly idle)
> skiplist   2,484MB/s (100% system time)
> rbtree        99MB/s (100% system time)
> 
> 16 threads, iodepth 10, 20MB random writes
> IOMMU off  2,548MB/s
> skiplist   1,649MB/s
> rbtree        33MB/s
> 
> I ran this a bunch of times to make sure I got the rbtree numbers right,
> lowering the thread count did improve the rbtree performance, but the
> best I could do was around 300MB/s.  For skiplists, all of the CPU time
> is being spent in skiplist_insert_hole, which has a lot of room for
> improvement.
> 
> From here the code needs a lot of testing, and I need to fill out the API
> to make it a little easier to use.  But, I wanted to send this around
> for comments and to see how other might want to use things.  More
> details on all internals are below.
> 
> It starts with a basic skiplist implementation where:
> 
> * skiplist nodes are dynamically sized.  There is a slab cache for each
> node height, so every node is actively using all of the pointers allocated
> for it.
> 
> * Each node has a back pointer as well as a forward pointer
> 
> * Each skiplist node stores an array of pointers to items.  This is a huge
> speed improvement because the array of keys is much more cache friendly.
> 
> Since the array of item pointers is something of an extension
> to the basic skiplist, I've separated them out into two structs.
> First sl_node (just the indexing pointers) and then sl_leaf,
> which has an sl_node and the array of item pointers.
> 
> There's no cache benefit to the leaves if I'm just storing
> an array of pointers though, so it also has an array
> of keys:
> 
>         unsigned long keys[SKIP_KEYS_PER_NODE];
> 	struct sl_slot *ptrs[SKIP_KEYS_PER_NODE];
> 
> The keys array is used for the first search pass, and based
> on that result I narrow things down and only have to follow
> one or two pointers from the corresponding ptrs array.
> 
> The sl_slot is very simple:
> 
> struct sl_slot {
> 	unsigned long key;
> 	unsigned long size;
> }
> 
> The idea is to embed the sl_slot in your struct, giving us something like
> this:
> 
>                    sl_leaf
>                 ________________
>                 | node ptr N    |
>                 |   ....        |
>                 | node ptr 0    |
>                 |_______________|
>                 |   | keys   |  |
>                 |___|________|__|
>                 | . | ptrs   |. |
>                 |___|________|__|
>                  /             \
>                 /               \
>          -------------            ------------
>          | sl_slot 0 |           | sl_slot N |
>          |           |           |           |
>          | data      |           |  data     |
>          -------------           -------------
> 	    your
> 	    struct
> 
> This basic setup gives us similar performance to rbtrees, but uses less memory.
> The performance varies (a lot really) with the size of the keys array.
> 
> Locking is a mixture of RCU and per-node locking.  For searches,
> it can be pure RCU, or it can use the per-node lock to synchronize
> the final search though the last leaf.  But, everything from the
> skiplist head to that final node is RCU either way.
> 
> Insert locking is slightly different.  Before the insert is started,
> a new node is preallocated.  The height of this node is used to
> pick the level where we start locking nodes during the insert.  Much
> of the time we're able to get through huge portions of the list
> without any locks.
> 
> For deletes, it does an RCU search for the leaf, and hold the per-node lock
> while we remove a specific slot.  If the leaf is empty it is marked as dead and
> then unlinked from the skiplist level by level.
> 
> All the locking and tries does slow things down a bit, but the benefits
> for concurrent access make a big difference.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR