From mboxrd@z Thu Jan 1 00:00:00 1970 From: Josh Triplett Subject: Re: [PATCH 4/4] rhashtable: Add parent argument to mutex_is_held Date: Sat, 15 Nov 2014 11:12:20 -0800 Message-ID: <20141115191219.GA19060@thin> References: <20141113101025.GA3728@gondor.apana.org.au> <20141113103723.GO19157@casper.infradead.org> <20141113103834.GA4024@gondor.apana.org.au> <20141113104124.GA24379@casper.infradead.org> <20141113104343.GA4112@gondor.apana.org.au> <20141115032512.GA19299@gondor.apana.org.au> <20141115111626.GP19157@casper.infradead.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Herbert Xu , netdev@vger.kernel.org, eric.dumazet@gmail.com, paulmck@linux.vnet.ibm.com To: Thomas Graf Return-path: Received: from relay5-d.mail.gandi.net ([217.70.183.197]:57743 "EHLO relay5-d.mail.gandi.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754305AbaKOTMb (ORCPT ); Sat, 15 Nov 2014 14:12:31 -0500 Content-Disposition: inline In-Reply-To: <20141115111626.GP19157@casper.infradead.org> Sender: netdev-owner@vger.kernel.org List-ID: On Sat, Nov 15, 2014 at 11:16:26AM +0000, Thomas Graf wrote: > On 11/15/14 at 11:25am, Herbert Xu wrote: > > On Thu, Nov 13, 2014 at 06:43:43PM +0800, Herbert Xu wrote: > > > On Thu, Nov 13, 2014 at 10:41:24AM +0000, Thomas Graf wrote: > > > > > > > > Never mind. You did fix it. I looked at the wrong patch. > > > > > > OK. Now comes the fun part of shoehorning the xfrm_policy bydst > > > hash into rhashtable :) > > > > Thomas, it appears that rhashtable as it stands cannot handle > > changing the random seed because of the way it constructs the > > new hash table without degrading into a linked list. Is there > > something I'm missing? > > > > FWIW my hashtable in net/bridge/br_multicast.c handles rehashing > > correctly. Any objections to me converting rhashtable to use my > > scheme instead? > > Can you elaborate a bit? > > The point of rhashtable is to not require two sets of linked list > pointers as done by MDB or OVS flow tables to work around the > increased cache footprint of that approach. The difference of the > two algos is dicussed in this paper [0]. > > The disadvantage of rhashtable is that, AFAIK, the hash function > cannot change while resizing as it would break the mutual linked > lists. You can handle hash function changes with rhashtable without needing a second set of linked-list pointers, actually. You'd have to add ~1 unlikely() conditional to the reader common case, and you'd make readers *during* a hash algorithm change (which I'd hope happens as rarely as a resize) somewhat less efficient, but you wouldn't use any more memory than a resize currently does, and you wouldn't use the memory of extra linked-list pointers in the common case. Rather than the current approach of switching out the hash table pointer and having readers only search the new table (which will have valid-but-crosslinked buckets during a resize), instead keep two hash table pointers (each with their own hash parameters) and a single toggleable old/new indicator. Readers check both hash table pointers, and if valid, check both tables, first old then new (order important to make sure they don't miss a node). Then, the rehashing algorithm can incrementally move nodes from the old buckets to the new buckets, *without* disrupting concurrent readers. Rough rehashing algorithm sketch: - Set up the new empty table with the new set of hash parameters. - synchronize_rcu(). Readers will now search both old and new tables. - Peel nodes off the ends of the old hash table and add them to the new hash table (similar to the unzip step in the resize algorithm). Because readers search old-then-new, and writers make each node appear in the new table before making it disappear from the old, you don't need a synchronize_rcu() here, just an smp_wmb() after linking the node into the new table and before unlinking the node from the old table. Also, because you remove nodes from the *ends* of old-table buckets, you don't have to worry about a reader's linked-list walk getting dragged over to the new table and missing nodes from the old one. - Once all nodes have moved to the new table, mark the pointer to the old table as NULL, synchronize_rcu(), and free the old table. - Toggle the old/new flag. Ordering principles in this algorithm: - You want readers to see the changes made by the rehasher in order. - If the reader reads location A then B, and the rehasher writes location B then A, the rehasher just needs an smp_wmb() in between. - If the reader reads location A then B, and the rehasher writes location A then B, the rehasher needs a synchronize_rcu() in between. - Josh Triplett