From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
David Miller <davem@davemloft.net>,
Corey Minyard <minyard@acm.org>,
Stephen Hemminger <shemminger@vyatta.com>,
benny+usenet@amorsen.dk,
Linux Netdev List <netdev@vger.kernel.org>,
Christoph Lameter <cl@linux-foundation.org>,
Evgeniy Polyakov <zbr@ioremap.net>,
Christian Bell <christian@myri.com>
Subject: Re: [PATCH 4/3] rcu: documents rculist_nulls
Date: Wed, 19 Nov 2008 09:07:03 -0800 [thread overview]
Message-ID: <20081119170703.GB6753@linux.vnet.ibm.com> (raw)
In-Reply-To: <491C4FA8.4020704@cosmosbay.com>
On Thu, Nov 13, 2008 at 05:02:48PM +0100, Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Peter Zijlstra a écrit :
>>> So by not using some memory barriers (would be nice to have it
>>> illustrated which ones), we can race and end up on the wrong chain, in
>>> case that happens we detect this by using this per-chain terminator and
>>> try again.
>>>
>>> It would be really good to have it explained in the rculist_nulls.h
>>> comments what memory barriers are missing, what races they open, and how
>>> the this special terminator trick closes that race.
>> OK, maybe I should add a Documentation/RCU/rculist_nulls.txt file with
>> appropriate examples and documentation.
>> (Say the lookup/insert algorithms, with standard hlist and memory
>> barriers,
>> and with hlist_nulls without those two memory barriers.
>
> [PATCH 4/3] rcu: documents rculist_nulls
Very good -- only one small suggestion below.
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls'
> end-of-list can help in some RCU algos.
>
>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
> Documentation/RCU/rculist_nulls.txt | 167 ++++++++++++++++++++++++++
> 1 files changed, 167 insertions(+)
> diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt
> new file mode 100644
> index 0000000..5db5549
> --- /dev/null
> +++ b/Documentation/RCU/rculist_nulls.txt
> @@ -0,0 +1,167 @@
> +Using hlist_nulls to protect read-mostly linked lists and
> +objects using SLAB_DESTROY_BY_RCU allocations.
> +
> +Please read the basics in Documentation/RCU/listRCU.txt
> +
> +Using special makers (called 'nulls') is a convenient way
> +to solve following problem :
> +
> +A typical RCU linked list managing objects which are
> +allocated with SLAB_DESTROY_BY_RCU kmem_cache can
> +use following algos :
> +
> +1) Lookup algo
> +--------------
> +rcu_read_lock()
> +begin:
> +obj = lockless_lookup(key);
> +if (obj) {
> + if (!try_get_ref(obj)) // might fail for free objects
> + goto begin;
> + /*
> + * Because a writer could delete object, and a writer could
> + * reuse these object before the RCU grace period, we
> + * must check key after geting the reference on object
> + */
> + if (obj->key != key) { // not the object we expected
In some cases, a "generation number" will be needed. For example, there
are algorithms that must detect when an object has been removed and then
re-inserted with the same key. One increments the generation number
on each free and sometimes also on each allocation.
> + put_ref(obj);
> + goto begin;
> + }
> +}
> +rcu_read_unlock();
> +
> +Beware that lockless_lookup(key) cannot use traditional hlist_for_each_entry_rcu()
> +but a version with an additional memory barrier (smp_rmb())
> +
> +lockless_lookup(key)
> +{
> + struct hlist_node *node, *next;
> + for (pos = rcu_dereference((head)->first);
> + pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
> + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
> + pos = rcu_dereference(next))
> + if (obj->key == key)
> + return obj;
> + return NULL;
> +
> +And note the traditional hlist_for_each_entry_rcu() misses this smp_rmb() :
> +
> + struct hlist_node *node;
> + for (pos = rcu_dereference((head)->first);
> + pos && ({ prefetch(pos->next); 1; }) &&
> + ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1; });
> + pos = rcu_dereference(pos->next))
> + if (obj->key == key)
> + return obj;
> + return NULL;
> +}
> +
> +Quoting Corey Minyard :
> +
> +"If the object is moved from one list to another list in-between the
> + time the hash is calculated and the next field is accessed, and the
> + object has moved to the end of a new list, the traversal will not
> + complete properly on the list it should have, since the object will
> + be on the end of the new list and there's not a way to tell it's on a
> + new list and restart the list traversal. I think that this can be
> + solved by pre-fetching the "next" field (with proper barriers) before
> + checking the key."
> +
> +2) Insert algo :
> +----------------
> +
> +We need to make sure a reader cannot read the new 'obj->obj_next' value
> +and previous value of 'obj->key'. Or else, an item could be deleted
> +from a chain, and inserted into another chain. If new chain was empty
> +before the move, 'next' pointer is NULL, and lockless reader can
> +not detect it missed following items in original chain.
> +
> +/*
> + * Please note that new inserts are done at the head of list,
> + * not in the middle or end.
> + */
> +obj = kmem_cache_alloc(...);
> +lock_chain(); // typically a spin_lock()
> +obj->key = key;
> +atomic_inc(&obj->refcnt);
> +/*
> + * we need to make sure obj->key is updated before obj->next
> + */
> +smp_wmb();
> +hlist_add_head_rcu(&obj->obj_node, list);
> +unlock_chain(); // typically a spin_unlock()
> +
> +
> +3) Remove algo
> +--------------
> +Nothing special here, we can use a standard RCU hlist deletion.
> +But thanks to SLAB_DESTROY_BY_RCU, beware a deleted object can be reused
> +very very fast (before the end of RCU grace period)
> +
> +if (put_last_reference_on(obj) {
> + lock_chain(); // typically a spin_lock()
> + hlist_del_init_rcu(&obj->obj_node);
> + unlock_chain(); // typically a spin_unlock()
> + kmem_cache_free(cachep, obj);
> +}
> +
> +
> +
> +--------------------------------------------------------------------------
> +With hlist_nulls we can avoid extra smp_rmb() in lockless_lookup()
> +and extra smp_wmb() in insert function.
> +
> +For example, if we choose to store the slot number as the 'nulls'
> +end-of-list marker for each slot of the hash table, we can detect
> +a race (some writer did a delete and/or a move of an object
> +to another chain) checking the final 'nulls' value if
> +the lookup met the end of chain. If final 'nulls' value
> +is not the slot number, then we must restart the lookup at
> +the begining. If the object was moved to same chain,
> +then the reader doesnt care : It might eventually
> +scan the list again without harm.
> +
> +
> +1) lookup algo
> +
> + head = &table[slot];
> + rcu_read_lock();
> +begin:
> + hlist_nulls_for_each_entry_rcu(obj, node, head, member) {
> + if (obj->key == key) {
> + if (!try_get_ref(obj)) // might fail for free objects
> + goto begin;
> + if (obj->key != key) { // not the object we expected
> + put_ref(obj);
> + goto begin;
> + }
> + goto out;
> + }
> +/*
> + * if the nulls value we got at the end of this lookup is
> + * not the expected one, we must restart lookup.
> + * We probably met an item that was moved to another chain.
> + */
> + if (get_nulls_value(node) != slot)
> + goto begin;
> + obj = NULL;
> +
> +out:
> + rcu_read_unlock();
> +
> +2) Insert function :
> +--------------------
> +
> +/*
> + * Please note that new inserts are done at the head of list,
> + * not in the middle or end.
> + */
> +obj = kmem_cache_alloc(cachep);
> +lock_chain(); // typically a spin_lock()
> +obj->key = key;
> +atomic_set(&obj->refcnt, 1);
> +/*
> + * insert obj in RCU way (readers might be traversing chain)
> + */
> +hlist_nulls_add_head_rcu(&obj->obj_node, list);
> +unlock_chain(); // typically a spin_unlock()
next prev parent reply other threads:[~2008-11-19 17:07 UTC|newest]
Thread overview: 134+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-06 18:50 [PATCH 3/3] Convert the UDP hash lock to RCU Corey Minyard
2008-10-06 21:22 ` Eric Dumazet
2008-10-06 21:40 ` David Miller
2008-10-06 23:08 ` Corey Minyard
2008-10-07 8:37 ` Evgeniy Polyakov
2008-10-07 14:16 ` Christoph Lameter
2008-10-07 14:29 ` Evgeniy Polyakov
2008-10-07 14:38 ` Christoph Lameter
2008-10-07 14:33 ` Paul E. McKenney
2008-10-07 14:45 ` Christoph Lameter
2008-10-07 15:07 ` Eric Dumazet
2008-10-07 15:07 ` Paul E. McKenney
2008-10-07 5:24 ` Eric Dumazet
2008-10-07 8:54 ` Benny Amorsen
2008-10-07 12:59 ` Eric Dumazet
2008-10-07 14:07 ` Stephen Hemminger
2008-10-07 20:55 ` David Miller
2008-10-07 21:20 ` Stephen Hemminger
2008-10-08 13:55 ` Eric Dumazet
2008-10-08 18:45 ` David Miller
2008-10-28 20:37 ` [PATCH 1/2] udp: introduce struct udp_table and multiple rwlocks Eric Dumazet
2008-10-28 21:23 ` Christian Bell
2008-10-28 21:31 ` Evgeniy Polyakov
2008-10-28 21:48 ` Eric Dumazet
2008-10-28 21:28 ` Evgeniy Polyakov
2008-10-28 20:42 ` [PATCH 2/2] udp: RCU handling for Unicast packets Eric Dumazet
2008-10-28 22:45 ` Eric Dumazet
2008-10-29 5:05 ` David Miller
2008-10-29 8:23 ` Eric Dumazet
2008-10-29 8:56 ` David Miller
2008-10-29 10:19 ` Eric Dumazet
2008-10-29 18:19 ` David Miller
2008-10-29 9:04 ` Eric Dumazet
2008-10-29 9:17 ` David Miller
2008-10-29 13:17 ` Corey Minyard
2008-10-29 14:36 ` Eric Dumazet
2008-10-29 15:34 ` Corey Minyard
2008-10-29 16:09 ` Eric Dumazet
2008-10-29 16:37 ` Paul E. McKenney
2008-10-29 17:22 ` Corey Minyard
2008-10-29 17:45 ` Eric Dumazet
2008-10-29 18:28 ` Corey Minyard
2008-10-29 18:52 ` Paul E. McKenney
2008-10-29 20:00 ` Eric Dumazet
2008-10-29 20:17 ` Paul E. McKenney
2008-10-29 21:29 ` Corey Minyard
2008-10-29 21:57 ` Eric Dumazet
2008-10-29 21:58 ` Paul E. McKenney
2008-10-29 22:08 ` Eric Dumazet
2008-10-30 3:22 ` Corey Minyard
2008-10-30 5:50 ` Eric Dumazet
2008-11-02 4:19 ` David Miller
2008-10-30 5:40 ` David Miller
2008-10-30 5:51 ` Eric Dumazet
2008-10-30 7:04 ` Eric Dumazet
2008-10-30 7:05 ` David Miller
2008-10-30 15:40 ` [PATCH] udp: Introduce special NULL pointers for hlist termination Eric Dumazet
2008-10-30 15:51 ` Stephen Hemminger
2008-10-30 16:28 ` Corey Minyard
2008-10-31 14:37 ` Eric Dumazet
2008-10-31 14:55 ` Pavel Emelyanov
2008-11-02 4:22 ` David Miller
2008-10-30 17:12 ` Eric Dumazet
2008-10-31 7:51 ` David Miller
2008-10-30 16:01 ` Peter Zijlstra
2008-10-31 0:14 ` Keith Owens
2008-11-13 13:13 ` [PATCH 0/3] net: RCU lookups for UDP, DCCP and TCP protocol Eric Dumazet
2008-11-13 17:20 ` Andi Kleen
2008-11-17 3:41 ` David Miller
2008-11-19 19:52 ` Christoph Lameter
2008-11-13 13:14 ` [PATCH 1/3] rcu: Introduce hlist_nulls variant of hlist Eric Dumazet
2008-11-13 13:29 ` Peter Zijlstra
2008-11-13 13:44 ` Eric Dumazet
2008-11-13 16:02 ` [PATCH 4/3] rcu: documents rculist_nulls Eric Dumazet
2008-11-14 15:16 ` Peter Zijlstra
2008-11-17 3:36 ` David Miller
2008-11-19 17:07 ` Paul E. McKenney [this message]
2008-11-14 15:16 ` [PATCH 1/3] rcu: Introduce hlist_nulls variant of hlist Peter Zijlstra
2008-11-19 17:01 ` Paul E. McKenney
2008-11-19 17:53 ` Eric Dumazet
2008-11-19 18:46 ` Paul E. McKenney
2008-11-19 18:53 ` Arnaldo Carvalho de Melo
2008-11-19 21:17 ` Paul E. McKenney
2008-11-19 20:39 ` Eric Dumazet
2008-11-19 21:21 ` Paul E. McKenney
2008-11-13 13:15 ` [PATCH 2/3] udp: Use hlist_nulls in UDP RCU code Eric Dumazet
2008-11-19 17:29 ` Paul E. McKenney
2008-11-19 17:53 ` Eric Dumazet
2008-11-13 13:15 ` [PATCH 3/3] net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls Eric Dumazet
2008-11-13 13:34 ` Peter Zijlstra
2008-11-13 13:51 ` Eric Dumazet
2008-11-13 14:08 ` Christoph Lameter
2008-11-13 14:22 ` Peter Zijlstra
2008-11-13 14:27 ` Christoph Lameter
2008-11-19 17:53 ` Paul E. McKenney
2008-11-23 9:33 ` [PATCH] net: Convert TCP/DCCP listening hash tables to use RCU Eric Dumazet
2008-11-23 15:59 ` Paul E. McKenney
2008-11-23 18:42 ` Eric Dumazet
2008-11-23 19:17 ` Paul E. McKenney
2008-11-23 20:18 ` Eric Dumazet
2008-11-23 22:33 ` Paul E. McKenney
2008-11-24 1:23 ` David Miller
2008-10-30 11:04 ` [PATCH 2/2] udp: RCU handling for Unicast packets Peter Zijlstra
2008-10-30 11:30 ` Eric Dumazet
2008-10-30 18:25 ` Paul E. McKenney
2008-10-31 16:40 ` Eric Dumazet
2008-11-01 3:10 ` Paul E. McKenney
2008-10-29 17:32 ` Eric Dumazet
2008-10-29 18:11 ` Paul E. McKenney
2008-10-29 18:29 ` David Miller
2008-10-29 18:38 ` Paul E. McKenney
2008-10-29 18:36 ` Eric Dumazet
2008-10-29 18:20 ` David Miller
2008-10-30 11:12 ` Peter Zijlstra
2008-10-30 11:29 ` Eric Dumazet
2008-10-28 20:37 ` [PATCH 0/2] udp: Convert the UDP hash lock to RCU Eric Dumazet
2008-10-28 21:28 ` Stephen Hemminger
2008-10-28 21:50 ` Eric Dumazet
2008-10-07 16:43 ` [PATCH 3/3] " Corey Minyard
2008-10-07 18:26 ` David Miller
2008-10-08 8:35 ` Eric Dumazet
2008-10-08 16:38 ` David Miller
2008-10-07 8:31 ` Peter Zijlstra
2008-10-07 14:36 ` Paul E. McKenney
2008-10-07 18:29 ` David Miller
2008-10-06 22:07 ` Corey Minyard
2008-10-07 8:17 ` Peter Zijlstra
2008-10-07 9:24 ` Eric Dumazet
2008-10-07 14:15 ` Christoph Lameter
2008-10-07 14:38 ` Paul E. McKenney
2008-10-07 14:50 ` Eric Dumazet
2008-10-07 15:05 ` Paul E. McKenney
2008-10-07 15:09 ` Peter Zijlstra
2008-10-07 15:23 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20081119170703.GB6753@linux.vnet.ibm.com \
--to=paulmck@linux.vnet.ibm.com \
--cc=a.p.zijlstra@chello.nl \
--cc=benny+usenet@amorsen.dk \
--cc=christian@myri.com \
--cc=cl@linux-foundation.org \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=minyard@acm.org \
--cc=netdev@vger.kernel.org \
--cc=shemminger@vyatta.com \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).