* [RFC] RCU Judy array with distributed locking for FS extents @ 2013-06-03 5:27 Mathieu Desnoyers 2013-06-03 12:40 ` Chris Mason 0 siblings, 1 reply; 12+ messages in thread From: Mathieu Desnoyers @ 2013-06-03 5:27 UTC (permalink / raw) To: Chris Mason Cc: Linux FS Devel, David Woodhouse, dchinner@redhat.com, bo.li.liu, rp, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern Hi Chris, I stumbled the LWN article "A kernel skiplist implementation (Part 1)", and really thought I should tell you about the side-project I am currently working on: RCU Judy array, with locking distributed within the data structure. I developed my own design of this structure specifically to be usable with RCU. (ref. my LPC2012 presentation: https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) I think it would fit your extents use-case perfectly, with excellent cache usage, constant lookup time, near-linear scalability for lookups, and pretty good update scability. The code is available in a development branch of the Userspace RCU library: git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja It runs entirely in user-space, although it can be ported easily to the Linux kernel. I'd be curious to see how it behaves with your workload. Relevant files: - rcuja/design.txt: design document of the data structure, - urcu/rcuja.h (public API) - rcuja/*.[ch]: implementation - tests/test_urcu_ja.c: stress-tests, example usage. After reading the article on skiplists, I added a new API specifically to handle non-overlapping ranges: struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, uint64_t key); AFAIU, "extents", as far as keys are concerned, are nothing more than segments, with start and end values. Extents can be indexed by the Judy array in the following way: we use the start of segment as node key, and keep the end of segment value within the leaf node. We add those nodes into the judy array, indexed by start-of-segment value. Then, when we want to figure out if a value matches a segment, we do the following: 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), passing the key to lookup as parameter, 2) check that our key fits within the range of the segment using the end-of-segment value within the leaf node. Of course, this approach is limited to non-overlapping segments, so I hope this is what you need. Feedback is welcome! Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-03 5:27 [RFC] RCU Judy array with distributed locking for FS extents Mathieu Desnoyers @ 2013-06-03 12:40 ` Chris Mason 2013-06-03 12:46 ` Mathieu Desnoyers 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2013-06-03 12:40 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linux FS Devel, David Woodhouse, dchinner@redhat.com, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern Quoting Mathieu Desnoyers (2013-06-03 01:27:58) > Hi Chris, > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)", > and really thought I should tell you about the side-project I am > currently working on: RCU Judy array, with locking distributed within > the data structure. I developed my own design of this structure > specifically to be usable with RCU. (ref. my LPC2012 presentation: > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) > > I think it would fit your extents use-case perfectly, with excellent > cache usage, constant lookup time, near-linear scalability for lookups, > and pretty good update scability. > > The code is available in a development branch of the Userspace RCU > library: > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja > > It runs entirely in user-space, although it can be ported easily to the > Linux kernel. I'd be curious to see how it behaves with your workload. > > Relevant files: > - rcuja/design.txt: design document of the data structure, > - urcu/rcuja.h (public API) > - rcuja/*.[ch]: implementation > - tests/test_urcu_ja.c: stress-tests, example usage. > > After reading the article on skiplists, I added a new API specifically > to handle non-overlapping ranges: > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, > uint64_t key); > > AFAIU, "extents", as far as keys are concerned, are nothing more than > segments, with start and end values. > > Extents can be indexed by the Judy array in the following way: we use > the start of segment as node key, and keep the end of segment value > within the leaf node. We add those nodes into the judy array, indexed by > start-of-segment value. Then, when we want to figure out if a value > matches a segment, we do the following: > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), > passing the key to lookup as parameter, > > 2) check that our key fits within the range of the segment using the > end-of-segment value within the leaf node. > > Of course, this approach is limited to non-overlapping segments, so I > hope this is what you need. Hi Mathieu, One problem here is that XFS wants to allow duplicate keys in the tree. This is possible with some modifications to the skiplist code, but I'm not sure if it fits into your description above. Regardless, it is worth trying out. I'll pull my current code back into userland and try out the userspace-rcu port (I had planned this anyway). That way we can have a proper head-to-head benchmark ;) -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-03 12:40 ` Chris Mason @ 2013-06-03 12:46 ` Mathieu Desnoyers 2013-06-03 13:07 ` Chris Mason 2013-06-04 11:54 ` Dave Chinner 0 siblings, 2 replies; 12+ messages in thread From: Mathieu Desnoyers @ 2013-06-03 12:46 UTC (permalink / raw) To: Chris Mason Cc: Linux FS Devel, David Woodhouse, dchinner@redhat.com, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern * Chris Mason (clmason@fusionio.com) wrote: > Quoting Mathieu Desnoyers (2013-06-03 01:27:58) > > Hi Chris, > > > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)", > > and really thought I should tell you about the side-project I am > > currently working on: RCU Judy array, with locking distributed within > > the data structure. I developed my own design of this structure > > specifically to be usable with RCU. (ref. my LPC2012 presentation: > > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) > > > > I think it would fit your extents use-case perfectly, with excellent > > cache usage, constant lookup time, near-linear scalability for lookups, > > and pretty good update scability. > > > > The code is available in a development branch of the Userspace RCU > > library: > > > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja > > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja > > > > It runs entirely in user-space, although it can be ported easily to the > > Linux kernel. I'd be curious to see how it behaves with your workload. > > > > Relevant files: > > - rcuja/design.txt: design document of the data structure, > > - urcu/rcuja.h (public API) > > - rcuja/*.[ch]: implementation > > - tests/test_urcu_ja.c: stress-tests, example usage. > > > > After reading the article on skiplists, I added a new API specifically > > to handle non-overlapping ranges: > > > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, > > uint64_t key); > > > > AFAIU, "extents", as far as keys are concerned, are nothing more than > > segments, with start and end values. > > > > Extents can be indexed by the Judy array in the following way: we use > > the start of segment as node key, and keep the end of segment value > > within the leaf node. We add those nodes into the judy array, indexed by > > start-of-segment value. Then, when we want to figure out if a value > > matches a segment, we do the following: > > > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), > > passing the key to lookup as parameter, > > > > 2) check that our key fits within the range of the segment using the > > end-of-segment value within the leaf node. > > > > Of course, this approach is limited to non-overlapping segments, so I > > hope this is what you need. > > Hi Mathieu, > > One problem here is that XFS wants to allow duplicate keys in the tree. > This is possible with some modifications to the skiplist code, but I'm > not sure if it fits into your description above. Are those segments that completely overlap, or partially overlap ? > > Regardless, it is worth trying out. I'll pull my current code back into > userland and try out the userspace-rcu port (I had planned this anyway). > That way we can have a proper head-to-head benchmark ;) Cool! Don't hesitate to ping me if I can be of any help. Thanks, Mathieu > > -chris > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-03 12:46 ` Mathieu Desnoyers @ 2013-06-03 13:07 ` Chris Mason 2013-06-03 13:50 ` Mathieu Desnoyers 2013-06-04 11:54 ` Dave Chinner 1 sibling, 1 reply; 12+ messages in thread From: Chris Mason @ 2013-06-03 13:07 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Linux FS Devel, David Woodhouse, dchinner@redhat.com, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern Quoting Mathieu Desnoyers (2013-06-03 08:46:01) > > Hi Mathieu, > > > > One problem here is that XFS wants to allow duplicate keys in the tree. > > This is possible with some modifications to the skiplist code, but I'm > > not sure if it fits into your description above. > > Are those segments that completely overlap, or partially overlap ? I believe completely overlap. On the skiplist side I'll make it possible for either one. > > > > > Regardless, it is worth trying out. I'll pull my current code back into > > userland and try out the userspace-rcu port (I had planned this anyway). > > That way we can have a proper head-to-head benchmark ;) > > Cool! Don't hesitate to ping me if I can be of any help. Once I get into the userland rcu code I'll probably have more questions. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-03 13:07 ` Chris Mason @ 2013-06-03 13:50 ` Mathieu Desnoyers 0 siblings, 0 replies; 12+ messages in thread From: Mathieu Desnoyers @ 2013-06-03 13:50 UTC (permalink / raw) To: Chris Mason Cc: Linux FS Devel, David Woodhouse, dchinner@redhat.com, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern * Chris Mason (clmason@fusionio.com) wrote: > Quoting Mathieu Desnoyers (2013-06-03 08:46:01) > > > Hi Mathieu, > > > > > > One problem here is that XFS wants to allow duplicate keys in the tree. > > > This is possible with some modifications to the skiplist code, but I'm > > > not sure if it fits into your description above. > > > > Are those segments that completely overlap, or partially overlap ? > > I believe completely overlap. On the skiplist side I'll make it > possible for either one. A complete segment overlap is conceptually the same as having duplicate segments, which makes it easy to handle in my Judy implementation: the nodes returned by a key lookup are a linked RCU hlist. So when a segment match is found for a key, iterating on the linked list of duplicate segments should achieve the intended goal. Partial overlap can be trickier. There might be extra overhead and complexity required to support those, so I'll wait until this becomes really needed before trying to figure out how to support them in Judy. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-03 12:46 ` Mathieu Desnoyers 2013-06-03 13:07 ` Chris Mason @ 2013-06-04 11:54 ` Dave Chinner 2013-06-04 14:21 ` Chris Mason 1 sibling, 1 reply; 12+ messages in thread From: Dave Chinner @ 2013-06-04 11:54 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Chris Mason, Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern On Mon, Jun 03, 2013 at 08:46:01AM -0400, Mathieu Desnoyers wrote: > * Chris Mason (clmason@fusionio.com) wrote: > > Quoting Mathieu Desnoyers (2013-06-03 01:27:58) > > > Hi Chris, > > > > > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)", > > > and really thought I should tell you about the side-project I am > > > currently working on: RCU Judy array, with locking distributed within > > > the data structure. I developed my own design of this structure > > > specifically to be usable with RCU. (ref. my LPC2012 presentation: > > > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) > > > > > > I think it would fit your extents use-case perfectly, with excellent > > > cache usage, constant lookup time, near-linear scalability for lookups, > > > and pretty good update scability. > > > > > > The code is available in a development branch of the Userspace RCU > > > library: > > > > > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja > > > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja > > > > > > It runs entirely in user-space, although it can be ported easily to the > > > Linux kernel. I'd be curious to see how it behaves with your workload. > > > > > > Relevant files: > > > - rcuja/design.txt: design document of the data structure, > > > - urcu/rcuja.h (public API) > > > - rcuja/*.[ch]: implementation > > > - tests/test_urcu_ja.c: stress-tests, example usage. > > > > > > After reading the article on skiplists, I added a new API specifically > > > to handle non-overlapping ranges: > > > > > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, > > > uint64_t key); > > > > > > AFAIU, "extents", as far as keys are concerned, are nothing more than > > > segments, with start and end values. > > > > > > Extents can be indexed by the Judy array in the following way: we use > > > the start of segment as node key, and keep the end of segment value > > > within the leaf node. We add those nodes into the judy array, indexed by > > > start-of-segment value. Then, when we want to figure out if a value > > > matches a segment, we do the following: > > > > > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), > > > passing the key to lookup as parameter, > > > > > > 2) check that our key fits within the range of the segment using the > > > end-of-segment value within the leaf node. > > > > > > Of course, this approach is limited to non-overlapping segments, so I > > > hope this is what you need. > > > > Hi Mathieu, > > > > One problem here is that XFS wants to allow duplicate keys in the tree. > > This is possible with some modifications to the skiplist code, but I'm > > not sure if it fits into your description above. > > Are those segments that completely overlap, or partially overlap ? Partial overlap. Can be wholly within an existing exist, or overlap start, end or both. Cheers, Dave. -- Dave Chinner dchinner@redhat.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-04 11:54 ` Dave Chinner @ 2013-06-04 14:21 ` Chris Mason 2013-06-04 18:57 ` Mathieu Desnoyers ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Chris Mason @ 2013-06-04 14:21 UTC (permalink / raw) To: Dave Chinner, Mathieu Desnoyers Cc: Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern Quoting Dave Chinner (2013-06-04 07:54:56) > On Mon, Jun 03, 2013 at 08:46:01AM -0400, Mathieu Desnoyers wrote: > > * Chris Mason (clmason@fusionio.com) wrote: > > > Quoting Mathieu Desnoyers (2013-06-03 01:27:58) > > > > Hi Chris, > > > > > > > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)", > > > > and really thought I should tell you about the side-project I am > > > > currently working on: RCU Judy array, with locking distributed within > > > > the data structure. I developed my own design of this structure > > > > specifically to be usable with RCU. (ref. my LPC2012 presentation: > > > > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) > > > > > > > > I think it would fit your extents use-case perfectly, with excellent > > > > cache usage, constant lookup time, near-linear scalability for lookups, > > > > and pretty good update scability. > > > > > > > > The code is available in a development branch of the Userspace RCU > > > > library: > > > > > > > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja > > > > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja > > > > > > > > It runs entirely in user-space, although it can be ported easily to the > > > > Linux kernel. I'd be curious to see how it behaves with your workload. > > > > > > > > Relevant files: > > > > - rcuja/design.txt: design document of the data structure, > > > > - urcu/rcuja.h (public API) > > > > - rcuja/*.[ch]: implementation > > > > - tests/test_urcu_ja.c: stress-tests, example usage. > > > > > > > > After reading the article on skiplists, I added a new API specifically > > > > to handle non-overlapping ranges: > > > > > > > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, > > > > uint64_t key); > > > > > > > > AFAIU, "extents", as far as keys are concerned, are nothing more than > > > > segments, with start and end values. > > > > > > > > Extents can be indexed by the Judy array in the following way: we use > > > > the start of segment as node key, and keep the end of segment value > > > > within the leaf node. We add those nodes into the judy array, indexed by > > > > start-of-segment value. Then, when we want to figure out if a value > > > > matches a segment, we do the following: > > > > > > > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), > > > > passing the key to lookup as parameter, > > > > > > > > 2) check that our key fits within the range of the segment using the > > > > end-of-segment value within the leaf node. > > > > > > > > Of course, this approach is limited to non-overlapping segments, so I > > > > hope this is what you need. > > > > > > Hi Mathieu, > > > > > > One problem here is that XFS wants to allow duplicate keys in the tree. > > > This is possible with some modifications to the skiplist code, but I'm > > > not sure if it fits into your description above. > > > > Are those segments that completely overlap, or partially overlap ? > > Partial overlap. Can be wholly within an existing exist, or overlap > start, end or both. Ouch, ok. In private email yesterday I talked with Mathieu about how his current setup can't prevent the concurrent insertion of overlapping extents. He does have a plan to address this where the insertion is synchronized by keeping placeholders in the tree for the free space. I think it'll work, but I'm worried about doubling the cost of the insert. Mathieu, I'm going to spend some time on the kernel side of the skiplist code filling out the API and reworking the insert code to get rid of my cursors. I won't have a chance to move it into userland until at least next week. In terms of comparing the two, I'd rather compare in the kernel so we don't have any surprises. Are you willing to port the judy code in? -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-04 14:21 ` Chris Mason @ 2013-06-04 18:57 ` Mathieu Desnoyers 2013-06-05 23:48 ` Dave Chinner 2013-06-12 1:12 ` Mathieu Desnoyers 2 siblings, 0 replies; 12+ messages in thread From: Mathieu Desnoyers @ 2013-06-04 18:57 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern * Chris Mason (clmason@fusionio.com) wrote: [...] > In terms of comparing the two, I'd rather compare in the kernel so we > don't have any surprises. Are you willing to port the judy code in? I'm in a similar situation as yours on the Judy side: I still need to clean up the API and extend its documentation (this is why the Judy code is still in a development branch), and I'd prefer to have the API stabilized on the userspace RCU side before porting it to the kernel. Moreover, I see that it might be a good thing to create an API on top of Judy to handle overlapping and non-overlapping sparse index ranges. Ideally, I'd like to implement this and test it in user-space before porting to the kernel. I'd be very much interested in helping out in porting your skip list code to Userspace RCU, I think this should be very much straightforward. By the way, would you be willing to license it under LGPLv2.1 so it could be integrated within the Userspace RCU code base ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-04 14:21 ` Chris Mason 2013-06-04 18:57 ` Mathieu Desnoyers @ 2013-06-05 23:48 ` Dave Chinner 2013-06-12 1:12 ` Mathieu Desnoyers 2 siblings, 0 replies; 12+ messages in thread From: Dave Chinner @ 2013-06-05 23:48 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Mathieu Desnoyers, Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern On Tue, Jun 04, 2013 at 10:21:32AM -0400, Chris Mason wrote: > Quoting Dave Chinner (2013-06-04 07:54:56) > > On Mon, Jun 03, 2013 at 08:46:01AM -0400, Mathieu Desnoyers wrote: > > > * Chris Mason (clmason@fusionio.com) wrote: > > > > Quoting Mathieu Desnoyers (2013-06-03 01:27:58) > > > > > Hi Chris, > > > > > > > > > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)", > > > > > and really thought I should tell you about the side-project I am > > > > > currently working on: RCU Judy array, with locking distributed within > > > > > the data structure. I developed my own design of this structure > > > > > specifically to be usable with RCU. (ref. my LPC2012 presentation: > > > > > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie) > > > > > > > > > > I think it would fit your extents use-case perfectly, with excellent > > > > > cache usage, constant lookup time, near-linear scalability for lookups, > > > > > and pretty good update scability. > > > > > > > > > > The code is available in a development branch of the Userspace RCU > > > > > library: > > > > > > > > > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja > > > > > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja > > > > > > > > > > It runs entirely in user-space, although it can be ported easily to the > > > > > Linux kernel. I'd be curious to see how it behaves with your workload. > > > > > > > > > > Relevant files: > > > > > - rcuja/design.txt: design document of the data structure, > > > > > - urcu/rcuja.h (public API) > > > > > - rcuja/*.[ch]: implementation > > > > > - tests/test_urcu_ja.c: stress-tests, example usage. > > > > > > > > > > After reading the article on skiplists, I added a new API specifically > > > > > to handle non-overlapping ranges: > > > > > > > > > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja, > > > > > uint64_t key); > > > > > > > > > > AFAIU, "extents", as far as keys are concerned, are nothing more than > > > > > segments, with start and end values. > > > > > > > > > > Extents can be indexed by the Judy array in the following way: we use > > > > > the start of segment as node key, and keep the end of segment value > > > > > within the leaf node. We add those nodes into the judy array, indexed by > > > > > start-of-segment value. Then, when we want to figure out if a value > > > > > matches a segment, we do the following: > > > > > > > > > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(), > > > > > passing the key to lookup as parameter, > > > > > > > > > > 2) check that our key fits within the range of the segment using the > > > > > end-of-segment value within the leaf node. > > > > > > > > > > Of course, this approach is limited to non-overlapping segments, so I > > > > > hope this is what you need. > > > > > > > > Hi Mathieu, > > > > > > > > One problem here is that XFS wants to allow duplicate keys in the tree. > > > > This is possible with some modifications to the skiplist code, but I'm > > > > not sure if it fits into your description above. > > > > > > Are those segments that completely overlap, or partially overlap ? > > > > Partial overlap. Can be wholly within an existing exist, or overlap > > start, end or both. > > Ouch, ok. In private email yesterday I talked with Mathieu about how > his current setup can't prevent the concurrent insertion of overlapping > extents. He does have a plan to address this where the insertion is > synchronized by keeping placeholders in the tree for the free space. I > think it'll work, but I'm worried about doubling the cost of the insert. Insert cost probably isn't an issue for XFS. Most workloads have a buffer lookup ratio that is dominated by reads and cache hits. FWIW, I've been thinking of ways of avoiding needing overlaps in the tree to make it simpler to drop in replacements. The overlaps stem from buffers that map freed extents that haven't yet been written to the journal. They are free space, but we have to keep tracking it until the transactiont hat freed the extent hits the disk. In the mean time, it's free space, so it can be reallocated and a new buffer can be built that overlaps part or all of that extent that has been freed. If it's a perfect match we just reuse the existing buffer, but if it isn't we need to have both the new buffer and the old buffer in cache for some period of time... There's a couple of things I think I can do to rework this so that we don't need to track buffers over free space in the main cache, and that would remove all the overlap issues from the cache. That would certainly make things easier for dropping in some other type of index that doesn't support overlaps easily... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-04 14:21 ` Chris Mason 2013-06-04 18:57 ` Mathieu Desnoyers 2013-06-05 23:48 ` Dave Chinner @ 2013-06-12 1:12 ` Mathieu Desnoyers 2013-06-13 1:25 ` Chris Mason 2 siblings, 1 reply; 12+ messages in thread From: Mathieu Desnoyers @ 2013-06-12 1:12 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern * Chris Mason (clmason@fusionio.com) wrote: [...] > Ouch, ok. In private email yesterday I talked with Mathieu about how > his current setup can't prevent the concurrent insertion of overlapping > extents. He does have a plan to address this where the insertion is > synchronized by keeping placeholders in the tree for the free space. I > think it'll work, but I'm worried about doubling the cost of the insert. Hi Chris, The weekend and early week has been productive on my side. My updated work is available on this new branch: git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja-range Since last week, I managed to: - expand the RCU Judy Array API documentation: https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=urcu/rcuja.h;h=82e272bd4ede1aec436845aef287754dd1dab8b6;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c - create an API for Judy Array Ranges, as discussed via email privately: API: https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=urcu/rcuja-range.h;h=63035a1660888aa5f9b20548046571dcb54ad193;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c Implementation: https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=rcuja/rcuja-range.c;h=7e4585ef942d76f1811f3c958fff3138ac120ca3;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c Please keep in mind that this code has only been moderately stress-tested (with up to 24 cores, on small keyspaces of 3, 5, 10, 100 keys, so races occur much more frequently). It should not be considered production-ready yet. The test code (and thus examples usage) is available here: https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=tests/test_urcu_ja_range.c;h=12abcc51465b64a7124fb3e48a2150e225e145af;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=tests/test_urcu_ja_range.h;h=e9bbdbc3ed7eb8f57e30c26b8789ba609a6bfdd9;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c So far, my benchmarks shows near-linear read-side scalability (as expected from RCU). However, early results does not show the scalability I would have expected for concurrent updates. It's not as bad as, e.g., a global lock making performances crawl due to ping-pong between processors, but so far, roughly speaking, if I multiply the number of cores doing updates by e.g. 12, the per-core throughput of update stress-test gets divided by approximately 12. Therefore, the number of updates system-wide seems to stay constant as we increase the number of cores. I will try to get more info as I dig into more benchmarking, which may point at some memory-throughput bottlenecks. I stopped working on the range implementation thinking that I should wait to get some feedback before I start implementing more complex features like RCU-friendly range resize. Feedback is welcome! Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-12 1:12 ` Mathieu Desnoyers @ 2013-06-13 1:25 ` Chris Mason 2013-06-16 14:02 ` Liu Bo 0 siblings, 1 reply; 12+ messages in thread From: Chris Mason @ 2013-06-13 1:25 UTC (permalink / raw) To: Mathieu Desnoyers Cc: Dave Chinner, Linux FS Devel, David Woodhouse, bo.li.liu@oracle.com, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern Quoting Mathieu Desnoyers (2013-06-11 21:12:31) > * Chris Mason (clmason@fusionio.com) wrote: > [...] > > Ouch, ok. In private email yesterday I talked with Mathieu about how > > his current setup can't prevent the concurrent insertion of overlapping > > extents. He does have a plan to address this where the insertion is > > synchronized by keeping placeholders in the tree for the free space. I > > think it'll work, but I'm worried about doubling the cost of the insert. > > Hi Chris, > > The weekend and early week has been productive on my side. My updated > work is available on this new branch: > > git://git.lttng.org/userspace-rcu.git > branch: urcu/rcuja-range > > Since last week, I managed to: > > - expand the RCU Judy Array API documentation: > https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=urcu/rcuja.h;h=82e272bd4ede1aec436845aef287754dd1dab8b6;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c Nice > > - create an API for Judy Array Ranges, as discussed via email privately: > > API: > https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=urcu/rcuja-range.h;h=63035a1660888aa5f9b20548046571dcb54ad193;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c > > Implementation: > https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=rcuja/rcuja-range.c;h=7e4585ef942d76f1811f3c958fff3138ac120ca3;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c > > Please keep in mind that this code has only been moderately > stress-tested (with up to 24 cores, on small keyspaces of 3, 5, 10, 100 > keys, so races occur much more frequently). It should not be considered > production-ready yet. Ok, I'll definitely take a look. > > The test code (and thus examples usage) is available here: > https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=tests/test_urcu_ja_range.c;h=12abcc51465b64a7124fb3e48a2150e225e145af;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c > https://git.lttng.org/?p=userspace-rcu.git;a=blob;f=tests/test_urcu_ja_range.h;h=e9bbdbc3ed7eb8f57e30c26b8789ba609a6bfdd9;hb=03a50ae89ec4d7f39e91d0d49c4639c4cf6e894c > > So far, my benchmarks shows near-linear read-side scalability (as > expected from RCU). However, early results does not show the scalability > I would have expected for concurrent updates. It's not as bad as, e.g., > a global lock making performances crawl due to ping-pong between > processors, but so far, roughly speaking, if I multiply the number of > cores doing updates by e.g. 12, the per-core throughput of update > stress-test gets divided by approximately 12. Therefore, the number of > updates system-wide seems to stay constant as we increase the number of > cores. I will try to get more info as I dig into more benchmarking, > which may point at some memory-throughput bottlenecks. We're benchmarking different workloads, and I'm not sure how much of the scalability difference is from being in the kernel. One test I have here is a batch of deletion and reinsertion of keys at random. I'm running on a key space of 10 million keys. If I run the same number of random operations on 100,000 keys I get similar (but slightly faster) numbers: 100,000 random insertion and deletions batches: skiplist: 3.01s per thread rbtree: 2.1s per thread With 16 threads: skiplist: 5.8s per thread rbtree: ~70s per thread (ranges from 15s to 76s) The random part is crucial for scaling with the skiplists. The locks are per node, and as long as all the threads are working in different places things scale fairly well. > > I stopped working on the range implementation thinking that I should > wait to get some feedback before I start implementing more complex > features like RCU-friendly range resize. I really wanted to send out my code this morning, but I also wanted to match rbtrees single threaded first. It's much closer now, so I'm commenting and cleaning up what I have for posting tomorrow. I'll talk with Liu Bo about putting the skiplists under LGPL, but I'd love some help getting numbers against librcu. -chris ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC] RCU Judy array with distributed locking for FS extents 2013-06-13 1:25 ` Chris Mason @ 2013-06-16 14:02 ` Liu Bo 0 siblings, 0 replies; 12+ messages in thread From: Liu Bo @ 2013-06-16 14:02 UTC (permalink / raw) To: Chris Mason Cc: Mathieu Desnoyers, Dave Chinner, Linux FS Devel, David Woodhouse, rp@svcs.cs.pdx.edu, Paul E. McKenney, Lai Jiangshan, Stephen Hemminger, Alan Stern On Wed, Jun 12, 2013 at 09:25:49PM -0400, Chris Mason wrote: > Quoting Mathieu Desnoyers (2013-06-11 21:12:31) > > * Chris Mason (clmason@fusionio.com) wrote: ... > > > > I stopped working on the range implementation thinking that I should > > wait to get some feedback before I start implementing more complex > > features like RCU-friendly range resize. > > I really wanted to send out my code this morning, but I also wanted to > match rbtrees single threaded first. It's much closer now, so I'm > commenting and cleaning up what I have for posting tomorrow. > > I'll talk with Liu Bo about putting the skiplists under LGPL, but I'd > love some help getting numbers against librcu. I'm ok with LGPL :) thanks, liubo > > -chris > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2013-06-16 14:03 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-03 5:27 [RFC] RCU Judy array with distributed locking for FS extents Mathieu Desnoyers 2013-06-03 12:40 ` Chris Mason 2013-06-03 12:46 ` Mathieu Desnoyers 2013-06-03 13:07 ` Chris Mason 2013-06-03 13:50 ` Mathieu Desnoyers 2013-06-04 11:54 ` Dave Chinner 2013-06-04 14:21 ` Chris Mason 2013-06-04 18:57 ` Mathieu Desnoyers 2013-06-05 23:48 ` Dave Chinner 2013-06-12 1:12 ` Mathieu Desnoyers 2013-06-13 1:25 ` Chris Mason 2013-06-16 14:02 ` Liu Bo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).