From: Jan Kara <jack@suse.cz>
To: Chris Mason <chris.mason@fusionio.com>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
David Woodhouse <David.Woodhouse@intel.com>,
"dchinner@redhat.com" <dchinner@redhat.com>,
bo.li.liu@oracle.com
Subject: Re: [PATCH RFC 0/2] skiplists for range indexes
Date: Fri, 3 May 2013 11:19:24 +0200 [thread overview]
Message-ID: <20130503091924.GD10633@quack.suse.cz> (raw)
In-Reply-To: <20130503020211.3599.72404@localhost.localdomain>
Hi,
On Thu 02-05-13 22:02:11, Chris Mason wrote:
> I've been working on skiplists for indexing off/on for a little while
> now. For btrfs, I really need a fast way to manage extents that doesn't
> bog down in lock contention for reads or writes. Dave Chinner has
> mentioned this as well, so I've cc'd him.
But I guess you still need writer-writer exclusion, don't you? Or are you
able to do some partial locking of the structure to allow parallel
modifications?
> Liu Bo started this skiplist code, but at the time it didn't make
> a huge difference. I've reworked things a bit and used the IOMMU
> to test things. I ran some benchmarks on a single socket
> server with two SSD cards to get a baseline of how much it is helping.
>
> I tried to keep the basic structure of the IOMMU code, and there are probably
> many better ways to fix the IOMMU contention. Really I'm just trying to
> compare with rbtrees, not fix the IOMMU.
There also exist some research papers on RCU friendly RB-trees (or other
trees). Maybe they would be interesting to try? The basic trick is to use a
copy-on-write when you need to do a rotation. This is slightly impractical
because it requires memory allocation (or alternatively doubles memory
footprint of an rb node) but you need to do a similar stuff in skip lists
anyway. Just an idea...
Honza
> Basic numbers (all aio/dio):
>
> Streaming writes
> IOMMU off 2,575MB/s
> skiplist 1,715MB/s
> rbtree 1,659MB/s
>
> Not a huge improvement, but the CPU time was lower.
>
> 16 threads, iodepth 10, 20MB random reads
> IOMMU off 2,861MB/s (mostly idle)
> skiplist 2,484MB/s (100% system time)
> rbtree 99MB/s (100% system time)
>
> 16 threads, iodepth 10, 20MB random writes
> IOMMU off 2,548MB/s
> skiplist 1,649MB/s
> rbtree 33MB/s
>
> I ran this a bunch of times to make sure I got the rbtree numbers right,
> lowering the thread count did improve the rbtree performance, but the
> best I could do was around 300MB/s. For skiplists, all of the CPU time
> is being spent in skiplist_insert_hole, which has a lot of room for
> improvement.
>
> From here the code needs a lot of testing, and I need to fill out the API
> to make it a little easier to use. But, I wanted to send this around
> for comments and to see how other might want to use things. More
> details on all internals are below.
>
> It starts with a basic skiplist implementation where:
>
> * skiplist nodes are dynamically sized. There is a slab cache for each
> node height, so every node is actively using all of the pointers allocated
> for it.
>
> * Each node has a back pointer as well as a forward pointer
>
> * Each skiplist node stores an array of pointers to items. This is a huge
> speed improvement because the array of keys is much more cache friendly.
>
> Since the array of item pointers is something of an extension
> to the basic skiplist, I've separated them out into two structs.
> First sl_node (just the indexing pointers) and then sl_leaf,
> which has an sl_node and the array of item pointers.
>
> There's no cache benefit to the leaves if I'm just storing
> an array of pointers though, so it also has an array
> of keys:
>
> unsigned long keys[SKIP_KEYS_PER_NODE];
> struct sl_slot *ptrs[SKIP_KEYS_PER_NODE];
>
> The keys array is used for the first search pass, and based
> on that result I narrow things down and only have to follow
> one or two pointers from the corresponding ptrs array.
>
> The sl_slot is very simple:
>
> struct sl_slot {
> unsigned long key;
> unsigned long size;
> }
>
> The idea is to embed the sl_slot in your struct, giving us something like
> this:
>
> sl_leaf
> ________________
> | node ptr N |
> | .... |
> | node ptr 0 |
> |_______________|
> | | keys | |
> |___|________|__|
> | . | ptrs |. |
> |___|________|__|
> / \
> / \
> ------------- ------------
> | sl_slot 0 | | sl_slot N |
> | | | |
> | data | | data |
> ------------- -------------
> your
> struct
>
> This basic setup gives us similar performance to rbtrees, but uses less memory.
> The performance varies (a lot really) with the size of the keys array.
>
> Locking is a mixture of RCU and per-node locking. For searches,
> it can be pure RCU, or it can use the per-node lock to synchronize
> the final search though the last leaf. But, everything from the
> skiplist head to that final node is RCU either way.
>
> Insert locking is slightly different. Before the insert is started,
> a new node is preallocated. The height of this node is used to
> pick the level where we start locking nodes during the insert. Much
> of the time we're able to get through huge portions of the list
> without any locks.
>
> For deletes, it does an RCU search for the leaf, and hold the per-node lock
> while we remove a specific slot. If the leaf is empty it is marked as dead and
> then unlinked from the skiplist level by level.
>
> All the locking and tries does slow things down a bit, but the benefits
> for concurrent access make a big difference.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2013-05-03 9:19 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-03 2:02 [PATCH RFC 0/2] skiplists for range indexes Chris Mason
2013-05-03 2:06 ` [PATCH RFC 1/2] core skiplist code Chris Mason
2013-05-03 2:10 ` [PATCH RFC 2/2] skiplists for the IOMMU Chris Mason
2013-05-03 9:19 ` Jan Kara [this message]
2013-05-03 10:45 ` [PATCH RFC 0/2] skiplists for range indexes Chris Mason
2013-05-04 3:25 ` Dave Chinner
2013-05-04 11:11 ` Chris Mason
2013-05-05 7:33 ` Dave Chinner
2013-05-05 14:38 ` Chris Mason
2013-05-05 22:44 ` Dave Chinner
2013-05-06 11:28 ` [BULK] " Chris Mason
2013-05-07 2:12 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130503091924.GD10633@quack.suse.cz \
--to=jack@suse.cz \
--cc=David.Woodhouse@intel.com \
--cc=bo.li.liu@oracle.com \
--cc=chris.mason@fusionio.com \
--cc=dchinner@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).