From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266572AbUGQA42 (ORCPT ); Fri, 16 Jul 2004 20:56:28 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266666AbUGQA42 (ORCPT ); Fri, 16 Jul 2004 20:56:28 -0400 Received: from mail.ocs.com.au ([202.147.117.210]:24771 "EHLO mail.ocs.com.au") by vger.kernel.org with ESMTP id S266572AbUGQA4Y (ORCPT ); Fri, 16 Jul 2004 20:56:24 -0400 X-Mailer: exmh version 2.6.3_20040314 03/14/2004 with nmh-1.0.4 From: Keith Owens To: William Lee Irwin III Cc: Jesse Barnes , Chris Wright , Ravikiran G Thirumalai , linux-kernel@vger.kernel.org, dipankar@in.ibm.com Subject: Re: [RFC] Lock free fd lookup In-reply-to: Your message of "Thu, 15 Jul 2004 23:27:32 MST." <20040716062732.GG3411@holomorphy.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 17 Jul 2004 10:55:59 +1000 Message-ID: <903.1090025759@ocs3.ocs.com.au> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 15 Jul 2004 23:27:32 -0700, William Lee Irwin III wrote: >The gist of all this is that busywait-free atomic updates are only >implementable for data structures that don't link through the object >but rather maintain an external index with a single pointer to elements >needing updates, like radix trees, B+ trees, arrays of pointers, and >open-addressed hashtables. There are algorithms for busywait-free (lock free) traversal and concurrent update of lists and structures that contain embedded pointers. I once had to implement one on an O/S where the user space to kernel wait/alarm semantics could not satisfy the timing requirements of the calling code (don't ask). The beauty of these lockfree algorithms on large structures is that nothing ever stalls indefinitely. If the underlying SMP cache hardware is fair, everything running a lockfree list will make progress. These algorithms do not suffer from reader vs. writer starvation. The downside is that the algorithms rely on an extra sequence field per word. They copy out the structure, modify the local copy, write back with an update of sequence field. Writing back detects if any other update has invalidated the current structure, at which point the second update is discarded and the writer redrives its update. Readers are guaranteed to see a consistent view of the copy, even if the master structure is being updated at the same time as it is being read. Bottom line, it can be done but ... * The structure size doubles with the addition of the sequence fields. * The hardware must support cmpxchg on the combination of the sequence field and the data word that it protects. LL/SC can be used instead of cmpxchg if the hardware supports LL/SC. * If the hardware only supports cmpxchg4 then the sequence field is restricted to 2 bytes, which increases the risk of wraparound. If an update is delayed for exactly the wraparound interval then the data may be corrupted, that is a standard risk of small sequence fields. * The extra copy out just to read a structure needs more memory bandwidth, plus local allocation for the copy. * The internal code of the API to traverse a list containing lockfree structures is pretty messy, although that is hidden from the caller. * All writers must preserve enough state to redrive their update from scratch if a concurrent update has changed the same structure. Note, only the curent structure, not the rest of the list. * It requires type stable storage. That is, once a data area has been allocated to a particular structure type, it always contains that structure type, even when it has been freed from the list. Each list requires its own free pool, which can never be returned to the OS. Nice research topics, but not suitable for day to day work. I only used the lockfree algorithms because I could not find an alternative on that particular OS. I am not sure that the Linux kernel has any problems that require the additional complexity. For some light reading, the code I implemented was based on Practical Implementations of Non-Blocking Synchronization Primitives by Mark Moir http://research.sun.com/scalable/Papers/moir-podc97.ps. Note that page 6, line 6 has an error. Replace "y := z" with "y := addr->data[i];". I implemented a completely lockfree 2-3-4 tree with embedded pointers using Moir's algorithm. The implementation allowed multiple concurrent readers and writers. It even allowed concurrent list insert, delete and rebalancing, while the structures on the were being read and updated. 2-3-4 trees are self balancing, which gave decent lookup performance. Since red-black trees are logically equivalent to 2-3-4 trees it should be possible to use lockfree red-black trees. However I could not come up with a lockfree red-black tree, mainly because an read-black insert requires atomic changes to multiple structures. The 2-3-4 tree only needs atomic update to one structure at a time.