From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [PATCH 0/2] udp: Convert the UDP hash lock to RCU Date: Tue, 28 Oct 2008 14:28:21 -0700 Message-ID: <20081028142821.5f4718e8@extreme> References: <48EB5D28.7000503@cosmosbay.com> <20081007160729.60c076c4@speedy> <20081007.135548.56141000.davem@davemloft.net> <48ECBBD8.9060602@cosmosbay.com> <490777FB.2070506@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: David Miller , benny+usenet@amorsen.dk, minyard@acm.org, netdev@vger.kernel.org, paulmck@linux.vnet.ibm.com, Christoph Lameter , Peter Zijlstra , Evgeniy Polyakov To: Eric Dumazet Return-path: Received: from mail.vyatta.com ([76.74.103.46]:58681 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753162AbYJ1V2Y (ORCPT ); Tue, 28 Oct 2008 17:28:24 -0400 In-Reply-To: <490777FB.2070506@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: On Tue, 28 Oct 2008 21:37:15 +0100 Eric Dumazet wrote: > UDP sockets are hashed in a 128 slots hash table. > > This hash table is protected by *one* rwlock. > > This rwlock is readlocked each time an incoming UDP message is handled. > > This rwlock is writelocked each time a socket must be inserted in > hash table (bind time), or deleted from this table (unbind time) > > This is not scalable on SMP machines : > > 1) Even in read mode, lock() and unlock() are atomic operations and > must dirty a contended cache line, shared by all cpus. > > 2) A writer might be starved if many readers are 'in flight'. This can > happen on a machine with some NIC receiving many UDP messages. User > process can be delayed a long time at socket creation/dismantle time. > > > What Corey and I propose is to use RCU to protect this hash table. > > Goals are : > > 1) Optimizing handling of incoming Unicast UDP frames, so that no memory > writes should happen in the fast path. Using an array of rwlocks (one per > slot for example is not an option in this regard) > > Note: Multicasts and broadcasts still will need to take a lock, > because doing a full lockless lookup in this case is difficult. > > 2) No expensive operations in the socket bind/unhash phases : > - No expensive synchronize_rcu() calls. > > - No added rcu_head in socket structure, increasing memory needs, > but more important, forcing us to use call_rcu() calls, > that have the bad property of making sockets structure cold. > (rcu grace period between socket freeing and its potential reuse > make this socket being cold in CPU cache). > David did a previous patch using call_rcu() and noticed a 20% > impact on TCP connection rates. > > Quoting Cristopher Lameter : > "Right. That results in cacheline cooldown. You'd want to recycle > the object as they are cache hot on a per cpu basis. That is screwed > up by the delayed regular rcu processing. We have seen multiple > regressions due to cacheline cooldown. > The only choice in cacheline hot sensitive areas is to deal with the > complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU." > > - Because udp sockets are allocated from dedicated kmem_cache, > use of SLAB_DESTROY_BY_RCU can help here. > > Theory of operation : > --------------------- > > As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()), > special attention must be taken by readers and writers. > > Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed, > reused, inserted in a different chain or in worst case in the same chain > while readers could do lookups in the same time. > > In order to avoid loops, a reader must check each socket found in a chain > really belongs to the chain the reader was traversing. If it finds a > mismatch, lookup must start again at the begining. This *restart* loop > is the reason we had to use rdlock for the multicast case, because > we dont want to send same message several times to the same socket. > > We use RCU only for fast path. Thus, /proc/net/udp still take rdlocks. We should just make it a spin_lock later and speed up udp socket creation.