Re: [RFC] Lock free fd lookup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Keith Owens <kaos@ocs.com.au>
To: William Lee Irwin III <wli@holomorphy.com>
Cc: Jesse Barnes <jbarnes@engr.sgi.com>,
	Chris Wright <chrisw@osdl.org>,
	Ravikiran G Thirumalai <kiran@in.ibm.com>,
	linux-kernel@vger.kernel.org, dipankar@in.ibm.com
Subject: Re: [RFC] Lock free fd lookup
Date: Sat, 17 Jul 2004 10:55:59 +1000	[thread overview]
Message-ID: <903.1090025759@ocs3.ocs.com.au> (raw)
In-Reply-To: Your message of "Thu, 15 Jul 2004 23:27:32 MST." <20040716062732.GG3411@holomorphy.com>

On Thu, 15 Jul 2004 23:27:32 -0700, 
William Lee Irwin III <wli@holomorphy.com> wrote:
>The gist of all this is that busywait-free atomic updates are only
>implementable for data structures that don't link through the object
>but rather maintain an external index with a single pointer to elements
>needing updates, like radix trees, B+ trees, arrays of pointers, and
>open-addressed hashtables.

There are algorithms for busywait-free (lock free) traversal and
concurrent update of lists and structures that contain embedded
pointers.  I once had to implement one on an O/S where the user space
to kernel wait/alarm semantics could not satisfy the timing
requirements of the calling code (don't ask).

The beauty of these lockfree algorithms on large structures is that
nothing ever stalls indefinitely.  If the underlying SMP cache hardware
is fair, everything running a lockfree list will make progress.  These
algorithms do not suffer from reader vs. writer starvation.

The downside is that the algorithms rely on an extra sequence field per
word.  They copy out the structure, modify the local copy, write back
with an update of sequence field.  Writing back detects if any other
update has invalidated the current structure, at which point the second
update is discarded and the writer redrives its update.  Readers are
guaranteed to see a consistent view of the copy, even if the master
structure is being updated at the same time as it is being read.

Bottom line, it can be done but ...

* The structure size doubles with the addition of the sequence fields.
* The hardware must support cmpxchg on the combination of the sequence
  field and the data word that it protects.  LL/SC can be used instead
  of cmpxchg if the hardware supports LL/SC.
* If the hardware only supports cmpxchg4 then the sequence field is
  restricted to 2 bytes, which increases the risk of wraparound.  If an
  update is delayed for exactly the wraparound interval then the data
  may be corrupted, that is a standard risk of small sequence fields.
* The extra copy out just to read a structure needs more memory
  bandwidth, plus local allocation for the copy.
* The internal code of the API to traverse a list containing lockfree
  structures is pretty messy, although that is hidden from the caller.
* All writers must preserve enough state to redrive their update from
  scratch if a concurrent update has changed the same structure.  Note,
  only the curent structure, not the rest of the list.
* It requires type stable storage.  That is, once a data area has been
  allocated to a particular structure type, it always contains that
  structure type, even when it has been freed from the list.  Each list
  requires its own free pool, which can never be returned to the OS.

Nice research topics, but not suitable for day to day work.  I only
used the lockfree algorithms because I could not find an alternative on
that particular OS.  I am not sure that the Linux kernel has any
problems that require the additional complexity.

For some light reading, the code I implemented was based on Practical
Implementations of Non-Blocking Synchronization Primitives by Mark Moir
http://research.sun.com/scalable/Papers/moir-podc97.ps.  Note that page
6, line 6 has an error.  Replace "y := z" with "y := addr->data[i];".

I implemented a completely lockfree 2-3-4 tree with embedded pointers
using Moir's algorithm.  The implementation allowed multiple concurrent
readers and writers.  It even allowed concurrent list insert, delete
and rebalancing, while the structures on the were being read and
updated.

2-3-4 trees are self balancing, which gave decent lookup performance.
Since red-black trees are logically equivalent to 2-3-4 trees it should
be possible to use lockfree red-black trees.  However I could not come
up with a lockfree red-black tree, mainly because an read-black insert
requires atomic changes to multiple structures.  The 2-3-4 tree only
needs atomic update to one structure at a time.

next prev parent reply	other threads:[~2004-07-17  0:56 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-07-14  4:53 [RFC] Refcounting of objects part of a lockfree collection Ravikiran G Thirumalai
2004-07-14  4:56 ` [RFC] Lock free fd lookup Ravikiran G Thirumalai
2004-07-14 15:17   ` Chris Wright
2004-07-15 14:22     ` Jesse Barnes
2004-07-15 16:10       ` Dipankar Sarma
2004-07-15 16:22         ` Jesse Barnes
2004-07-15 16:34         ` Chris Wright
2004-07-16  5:38           ` Ravikiran G Thirumalai
2004-07-16  6:27       ` William Lee Irwin III
2004-07-17  0:55         ` Keith Owens [this message]
2004-07-17  1:19           ` William Lee Irwin III
2004-07-17  2:12             ` Keith Owens
2004-07-17  2:34               ` William Lee Irwin III
2004-07-17  2:28             ` Keith Owens
2004-07-17  3:16               ` William Lee Irwin III
2004-07-17 13:48     ` Peter Zijlstra
2004-07-14  7:07 ` [RFC] Refcounting of objects part of a lockfree collection Greg KH
2004-07-14  8:26   ` Dipankar Sarma
2004-07-14 14:26     ` Greg KH
2004-07-14 15:22       ` Dipankar Sarma
2004-07-14 17:03         ` Greg KH
2004-07-14 17:49           ` Dipankar Sarma
2004-07-14 18:03             ` Greg KH
2004-07-15  6:21       ` Ravikiran G Thirumalai
2004-07-15  6:56         ` Dipankar Sarma
2004-07-14  8:57   ` Ravikiran G Thirumalai
2004-07-14 17:08     ` Greg KH
2004-07-14 18:17       ` Dipankar Sarma
2004-07-15  8:02       ` Ravikiran G Thirumalai
2004-07-15  9:36         ` Dipankar Sarma
2004-07-16 14:32         ` Greg KH
2004-07-16 15:50           ` Ravikiran G Thirumalai
  -- strict thread matches above, loose matches on Subject: below --
2004-07-17  8:50 [RFC] Lock free fd lookup Manfred Spraul
2004-07-17  9:30 ` William Lee Irwin III
2004-07-17 19:17 Albert Cahalan
2004-07-29  0:14 ` William Lee Irwin III

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=903.1090025759@ocs3.ocs.com.au \
    --to=kaos@ocs.com.au \
    --cc=chrisw@osdl.org \
    --cc=dipankar@in.ibm.com \
    --cc=jbarnes@engr.sgi.com \
    --cc=kiran@in.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox