From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: [RFC] fib_trie: flush improvement Date: Wed, 2 Apr 2008 11:03:35 -0700 Message-ID: <20080402110335.66b04181@extreme> References: <20080401172702.094c0700@extreme> <47F33D42.9080302@cosmosbay.com> <47F39998.8040605@cosmosbay.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Eric Dumazet , Robert Olsson , David Miller , netdev@vger.kernel.org To: Eric Dumazet Return-path: Received: from mail.vyatta.com ([216.93.170.194]:33116 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753708AbYDBSDl convert rfc822-to-8bit (ORCPT ); Wed, 2 Apr 2008 14:03:41 -0400 In-Reply-To: <47F39998.8040605@cosmosbay.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 02 Apr 2008 16:35:04 +0200 Eric Dumazet wrote: > Eric Dumazet a =C3=A9crit : > > Stephen Hemminger a =C3=A9crit : > >> This is an attempt to fix the problem described in: > >> http://bugzilla.kernel.org/show_bug.cgi?id=3D6648 > >> I can reproduce this by loading lots and lots of routes and the ta= king > >> the interface down. This causes all entries in trie to be flushed,= but > >> each leaf removal causes a rebalance of the trie. And since the re= moval > >> is depth first, it creates lots of needless work. > >> > >> Instead on flush, just walk the trie and prune as we go. > >> The implementation is for description only, it probably doesn't wo= rk=20 > >> yet. > >> > >> =20 > > > > I dont get it, since the bug reporter mentions with recent kernels = : > > > > Fix inflate_threshold_root. Now=3D15 size=3D11 bits > > > > Is it what you get with your tests ? > > > > Pawel reports : > > > > cat /proc/net/fib_triestat > > Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924 > > Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: = 893=20 > > 18: 1 > > > > Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB > > > > warning messages comes from rootnode that cannot be expanded, since= it=20 > > hits MAX_ORDER (on a 32bit x86) > > > > > > > > (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounde= d=20 > > to 4 << (bit + 1), ie 2 << 20 > > > > For larger allocations Pawel has two choices : > > > > change MAX_ORDER from 11 to 13 or 14 > > If this machine is a pure router, this change wont have performance= =20 > > impact. > > > > Or (more difficult, but more appropriate for mainline) change=20 > > fib_trie.c to use vmalloc() for very big allocaions (for the root=20 > > only), and vfree() > > > > Since vfree() cannot be called from rcu callback, one has to setup = a=20 > > struct work_struct helper. > > > Here is a patch (untested unfortunatly) to implement this. >=20 > [IPV4] fib_trie: root_tnode can benefit of vmalloc() >=20 > FIB_TRIE root node can be very large and currently hits MAX_ORDER lim= it. > It also wastes about 50% of allocated size, because of power of two=20 > rounding of tnode. >=20 > A switch to vmalloc() can improve FIB_TRIE performance by allowing ro= ot=20 > node to grow > past the alloc_pages() limit, while preserving memory. >=20 > Special care must be taken to free such zone, as rcu handler is not=20 > allowed to call vfree(), > we use a worker instead. >=20 > Signed-off-by: Eric Dumazet >=20 >=20 Rather than switching between three allocation strategies, I would rath= er just have kmalloc and vmalloc.