Re: [PATCH RFC 0/2] skiplists for range indexes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Chris Mason <chris.mason@fusionio.com>
Cc: Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	David Woodhouse <David.Woodhouse@intel.com>,
	"dchinner@redhat.com" <dchinner@redhat.com>,
	bo.li.liu@oracle.com
Subject: Re: [PATCH RFC 0/2] skiplists for range indexes
Date: Fri, 3 May 2013 11:19:24 +0200	[thread overview]
Message-ID: <20130503091924.GD10633@quack.suse.cz> (raw)
In-Reply-To: <20130503020211.3599.72404@localhost.localdomain>

  Hi,

On Thu 02-05-13 22:02:11, Chris Mason wrote:
> I've been working on skiplists for indexing off/on for a little while
> now.  For btrfs, I really need a fast way to manage extents that doesn't
> bog down in lock contention for reads or writes.  Dave Chinner has
> mentioned this as well, so I've cc'd him.
  But I guess you still need writer-writer exclusion, don't you? Or are you
able to do some partial locking of the structure to allow parallel
modifications?

> Liu Bo started this skiplist code, but at the time it didn't make
> a huge difference.  I've reworked things a bit and used the IOMMU
> to test things.  I ran some benchmarks on a single socket
> server with two SSD cards to get a baseline of how much it is helping.
> 
> I tried to keep the basic structure of the IOMMU code, and there are probably
> many better ways to fix the IOMMU contention.  Really I'm just trying to
> compare with rbtrees, not fix the IOMMU.
  There also exist some research papers on RCU friendly RB-trees (or other
trees). Maybe they would be interesting to try? The basic trick is to use a
copy-on-write when you need to do a rotation. This is slightly impractical
because it requires memory allocation (or alternatively doubles memory
footprint of an rb node) but you need to do a similar stuff in skip lists
anyway. Just an idea...

								Honza

> Basic numbers (all aio/dio):
> 
> Streaming writes
> IOMMU off  2,575MB/s
> skiplist   1,715MB/s
> rbtree     1,659MB/s
> 
> Not a huge improvement, but the CPU time was lower.
> 
> 16 threads, iodepth 10, 20MB random reads
> IOMMU off  2,861MB/s (mostly idle)
> skiplist   2,484MB/s (100% system time)
> rbtree        99MB/s (100% system time)
> 
> 16 threads, iodepth 10, 20MB random writes
> IOMMU off  2,548MB/s
> skiplist   1,649MB/s
> rbtree        33MB/s
> 
> I ran this a bunch of times to make sure I got the rbtree numbers right,
> lowering the thread count did improve the rbtree performance, but the
> best I could do was around 300MB/s.  For skiplists, all of the CPU time
> is being spent in skiplist_insert_hole, which has a lot of room for
> improvement.
> 
> From here the code needs a lot of testing, and I need to fill out the API
> to make it a little easier to use.  But, I wanted to send this around
> for comments and to see how other might want to use things.  More
> details on all internals are below.
> 
> It starts with a basic skiplist implementation where:
> 
> * skiplist nodes are dynamically sized.  There is a slab cache for each
> node height, so every node is actively using all of the pointers allocated
> for it.
> 
> * Each node has a back pointer as well as a forward pointer
> 
> * Each skiplist node stores an array of pointers to items.  This is a huge
> speed improvement because the array of keys is much more cache friendly.
> 
> Since the array of item pointers is something of an extension
> to the basic skiplist, I've separated them out into two structs.
> First sl_node (just the indexing pointers) and then sl_leaf,
> which has an sl_node and the array of item pointers.
> 
> There's no cache benefit to the leaves if I'm just storing
> an array of pointers though, so it also has an array
> of keys:
> 
>         unsigned long keys[SKIP_KEYS_PER_NODE];
> 	struct sl_slot *ptrs[SKIP_KEYS_PER_NODE];
> 
> The keys array is used for the first search pass, and based
> on that result I narrow things down and only have to follow
> one or two pointers from the corresponding ptrs array.
> 
> The sl_slot is very simple:
> 
> struct sl_slot {
> 	unsigned long key;
> 	unsigned long size;
> }
> 
> The idea is to embed the sl_slot in your struct, giving us something like
> this:
> 
>                    sl_leaf
>                 ________________
>                 | node ptr N    |
>                 |   ....        |
>                 | node ptr 0    |
>                 |_______________|
>                 |   | keys   |  |
>                 |___|________|__|
>                 | . | ptrs   |. |
>                 |___|________|__|
>                  /             \
>                 /               \
>          -------------            ------------
>          | sl_slot 0 |           | sl_slot N |
>          |           |           |           |
>          | data      |           |  data     |
>          -------------           -------------
> 	    your
> 	    struct
> 
> This basic setup gives us similar performance to rbtrees, but uses less memory.
> The performance varies (a lot really) with the size of the keys array.
> 
> Locking is a mixture of RCU and per-node locking.  For searches,
> it can be pure RCU, or it can use the per-node lock to synchronize
> the final search though the last leaf.  But, everything from the
> skiplist head to that final node is RCU either way.
> 
> Insert locking is slightly different.  Before the insert is started,
> a new node is preallocated.  The height of this node is used to
> pick the level where we start locking nodes during the insert.  Much
> of the time we're able to get through huge portions of the list
> without any locks.
> 
> For deletes, it does an RCU search for the leaf, and hold the per-node lock
> while we remove a specific slot.  If the leaf is empty it is marked as dead and
> then unlinked from the skiplist level by level.
> 
> All the locking and tries does slow things down a bit, but the benefits
> for concurrent access make a big difference.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

next prev parent reply	other threads:[~2013-05-03  9:19 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-03  2:02 [PATCH RFC 0/2] skiplists for range indexes Chris Mason
2013-05-03  2:06 ` [PATCH RFC 1/2] core skiplist code Chris Mason
2013-05-03  2:10 ` [PATCH RFC 2/2] skiplists for the IOMMU Chris Mason
2013-05-03  9:19 ` Jan Kara [this message]
2013-05-03 10:45   ` [PATCH RFC 0/2] skiplists for range indexes Chris Mason
2013-05-04  3:25     ` Dave Chinner
2013-05-04 11:11       ` Chris Mason
2013-05-05  7:33         ` Dave Chinner
2013-05-05 14:38           ` Chris Mason
2013-05-05 22:44             ` Dave Chinner
2013-05-06 11:28               ` [BULK] " Chris Mason
2013-05-07  2:12                 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130503091924.GD10633@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=David.Woodhouse@intel.com \
    --cc=bo.li.liu@oracle.com \
    --cc=chris.mason@fusionio.com \
    --cc=dchinner@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).