From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: Re: [RFC][PATCH] optimise iptables interface matching Date: Wed, 06 Jun 2007 08:13:13 +0200 Message-ID: <46665079.80108@cosmosbay.com> References: <465528CB.4020108@snapgear.com> <200705250044.l4P0i3f8007580@toshiba.co.jp> <4656342A.5040202@snapgear.com> <4657FBD4.80506@trash.net> <1180381991.6505.46.camel@henriknordstrom.net> <465B72C0.5000907@snapgear.com> <1180488846.6505.310.camel@henriknordstrom.net> <46664AE8.7020901@snapgear.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Cc: netfilter-devel@lists.netfilter.org, Patrick McHardy , Henrik Nordstrom To: Philip Craig Return-path: In-Reply-To: <46664AE8.7020901@snapgear.com> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: netfilter-devel-bounces@lists.netfilter.org Errors-To: netfilter-devel-bounces@lists.netfilter.org List-Id: netfilter-devel.vger.kernel.org Philip Craig a =E9crit : > Henrik Nordstrom wrote: >> tis 2007-05-29 klockan 10:24 +1000 skrev Philip Craig: >>> I'll try that, but that will also prevent loop unrolling if you're >>> using -funroll-loops. Not sure if any builds use this, but the >>> comment in the code implies some do. >> My GCC unrolls that loop just fine... >> >> x86_64 gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-51) >=20 > You're right, I should check things before making statements. >=20 >> Here is a corrected version of the for loop: >> >> for (i =3D 0, ret =3D 0; i < IFNAMSIZ/sizeof(unsigned long) && ((cons= t unsigned long *)ipinfo->outiface_mask)[i]; i++) { >> >> >> but for modern 64-bit CPUs I suspect the original is actually fastest = as >> IFNAMSIZ is only 16 bytes and fits in two parallel 64-bit operations.. >=20 > For my ARM platform, changing the for loop is a win for no interface > matches, or short interface names (I didn't test longer interface names= ). >=20 > But looking at the generated assembly for x86_64, this results in more > instructions and branches, and I can't see this being a win. (I'm not > set up to profile this.)=20 >=20 > Also, the two optimisations are not mutually exclusive: one is to skip > the whole comparison completely (including inversion), and one is to > terminate the for loop early. So we can use your loop termination > condition to skip the whole comparison too. >=20 > The attached patch is logically equivalent to my first (assuming a > contiguous and zero padded mask), but it avoids messing with flags. In > practice, my profiling says it is slightly slower than the first patch > for 0 interface matches, but slightly faster for 1 or 2. >=20 > Note: this patch (and the original patch) change the behaviour when > inverting a zero length mask. That is for either of: > iptables -A INPUT ! -i + > iptables -A INPUT ! -i "" > Not sure if this matters. >=20 > I also tried testing just the first byte, instead of a long, but that > was slower. >=20 >=20 In my analysis (oprofiling), I found instructions were not really a probl= em. The big problem comes from the size of data that have to be read by the C= PU to=20 perform a typical table lookup. I consider 16 bytes masks (32 bytes per rule, one for iface, one for ofac= e) is=20 a waste of memory. And on current CPUS, with 64 bytes cache lines, memory= =20 bandwidth is the limiting factor. We probably could use a mask length (one byte for iface, one for oface) t= o=20 reduce memory footprint, and have better chance to keep tables in cpu cac= hes.