From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [RFC] [PATCH] udp: optimize lookup of UDP sockets to by including destination address in the hash key Date: Thu, 05 Nov 2009 16:07:25 +0100 Message-ID: <4AF2EA2D.6040301@gmail.com> References: <4AF1EC18.9090106@ixiacom.com> <4AF1F273.5020207@gmail.com> <200911050104.09538.opurdila@ixiacom.com> <4AF20F02.7000601@gmail.com> <877hu5892g.fsf@basil.nowhere.org> <4AF2CCD9.7010507@gmail.com> <20091105145428.GS31511@one.firstfloor.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Octavian Purdila , Lucian Adrian Grijincu , netdev@vger.kernel.org To: Andi Kleen Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:57124 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751944AbZKEPHZ (ORCPT ); Thu, 5 Nov 2009 10:07:25 -0500 In-Reply-To: <20091105145428.GS31511@one.firstfloor.org> Sender: netdev-owner@vger.kernel.org List-ID: Andi Kleen a =E9crit : >> I assume cache is cold or even on other cpu (worst case), dealing wi= th >> 100.000+ sockets or so... >=20 > Other CPU cache hit is actually typically significantly=20 > faster than a DRAM access (unless you're talking about a very large N= UMA=20 > system and a remote CPU far away) Even if data is dirty in remote CPU cache ?=20 I dont speak of shared data. (if data is shared, workload mostly fits c= aches) >> If workload fits in one CPU cache/registers, we dont mind taking one >> or two cache lines per object, obviously. >=20 > It's more like part of your workload needs to fit. >=20 > For example if you use a tree and the higher levels fit into > the cache, having a few levels in the tree is (approximately) free. >=20 > That's why I'm not always fond of large hash tables. They pretty > much guarantee a lot of cache misses under high load, because > they have little locality. We already had this discussion Andi, and you know some servers handle 1= =2E000.000+ sockets, 100.000+ frames per second on XX.XXX different flows, and a bi= nary tree means 20 accesses before target. Only 5 or 6 first levels are in cache. Machine is barely usable. hash table with 2.000.000 slots gives one or two accesses before target= , and rcu is trivial with hash tables. btree are ok for generalist workloads, and rcu is more complex.