From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kara Subject: Re: [PATCH RFC 0/2] skiplists for range indexes Date: Fri, 3 May 2013 11:19:24 +0200 Message-ID: <20130503091924.GD10633@quack.suse.cz> References: <20130503020211.3599.72404@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linux FS Devel , David Woodhouse , "dchinner@redhat.com" , bo.li.liu@oracle.com To: Chris Mason Return-path: Received: from cantor2.suse.de ([195.135.220.15]:52273 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751518Ab3ECJT2 (ORCPT ); Fri, 3 May 2013 05:19:28 -0400 Content-Disposition: inline In-Reply-To: <20130503020211.3599.72404@localhost.localdomain> Sender: linux-fsdevel-owner@vger.kernel.org List-ID: Hi, On Thu 02-05-13 22:02:11, Chris Mason wrote: > I've been working on skiplists for indexing off/on for a little while > now. For btrfs, I really need a fast way to manage extents that doesn't > bog down in lock contention for reads or writes. Dave Chinner has > mentioned this as well, so I've cc'd him. But I guess you still need writer-writer exclusion, don't you? Or are you able to do some partial locking of the structure to allow parallel modifications? > Liu Bo started this skiplist code, but at the time it didn't make > a huge difference. I've reworked things a bit and used the IOMMU > to test things. I ran some benchmarks on a single socket > server with two SSD cards to get a baseline of how much it is helping. > > I tried to keep the basic structure of the IOMMU code, and there are probably > many better ways to fix the IOMMU contention. Really I'm just trying to > compare with rbtrees, not fix the IOMMU. There also exist some research papers on RCU friendly RB-trees (or other trees). Maybe they would be interesting to try? The basic trick is to use a copy-on-write when you need to do a rotation. This is slightly impractical because it requires memory allocation (or alternatively doubles memory footprint of an rb node) but you need to do a similar stuff in skip lists anyway. Just an idea... Honza > Basic numbers (all aio/dio): > > Streaming writes > IOMMU off 2,575MB/s > skiplist 1,715MB/s > rbtree 1,659MB/s > > Not a huge improvement, but the CPU time was lower. > > 16 threads, iodepth 10, 20MB random reads > IOMMU off 2,861MB/s (mostly idle) > skiplist 2,484MB/s (100% system time) > rbtree 99MB/s (100% system time) > > 16 threads, iodepth 10, 20MB random writes > IOMMU off 2,548MB/s > skiplist 1,649MB/s > rbtree 33MB/s > > I ran this a bunch of times to make sure I got the rbtree numbers right, > lowering the thread count did improve the rbtree performance, but the > best I could do was around 300MB/s. For skiplists, all of the CPU time > is being spent in skiplist_insert_hole, which has a lot of room for > improvement. > > From here the code needs a lot of testing, and I need to fill out the API > to make it a little easier to use. But, I wanted to send this around > for comments and to see how other might want to use things. More > details on all internals are below. > > It starts with a basic skiplist implementation where: > > * skiplist nodes are dynamically sized. There is a slab cache for each > node height, so every node is actively using all of the pointers allocated > for it. > > * Each node has a back pointer as well as a forward pointer > > * Each skiplist node stores an array of pointers to items. This is a huge > speed improvement because the array of keys is much more cache friendly. > > Since the array of item pointers is something of an extension > to the basic skiplist, I've separated them out into two structs. > First sl_node (just the indexing pointers) and then sl_leaf, > which has an sl_node and the array of item pointers. > > There's no cache benefit to the leaves if I'm just storing > an array of pointers though, so it also has an array > of keys: > > unsigned long keys[SKIP_KEYS_PER_NODE]; > struct sl_slot *ptrs[SKIP_KEYS_PER_NODE]; > > The keys array is used for the first search pass, and based > on that result I narrow things down and only have to follow > one or two pointers from the corresponding ptrs array. > > The sl_slot is very simple: > > struct sl_slot { > unsigned long key; > unsigned long size; > } > > The idea is to embed the sl_slot in your struct, giving us something like > this: > > sl_leaf > ________________ > | node ptr N | > | .... | > | node ptr 0 | > |_______________| > | | keys | | > |___|________|__| > | . | ptrs |. | > |___|________|__| > / \ > / \ > ------------- ------------ > | sl_slot 0 | | sl_slot N | > | | | | > | data | | data | > ------------- ------------- > your > struct > > This basic setup gives us similar performance to rbtrees, but uses less memory. > The performance varies (a lot really) with the size of the keys array. > > Locking is a mixture of RCU and per-node locking. For searches, > it can be pure RCU, or it can use the per-node lock to synchronize > the final search though the last leaf. But, everything from the > skiplist head to that final node is RCU either way. > > Insert locking is slightly different. Before the insert is started, > a new node is preallocated. The height of this node is used to > pick the level where we start locking nodes during the insert. Much > of the time we're able to get through huge portions of the list > without any locks. > > For deletes, it does an RCU search for the leaf, and hold the per-node lock > while we remove a specific slot. If the leaf is empty it is marked as dead and > then unlinked from the skiplist level by level. > > All the locking and tries does slow things down a bit, but the benefits > for concurrent access make a big difference. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR