Re: [RFC] fib_trie: flush improvement

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Dumazet <dada1@cosmosbay.com>
To: Stephen Hemminger <shemminger@vyatta.com>
Cc: Robert Olsson <Robert.Olsson@data.slu.se>,
	David Miller <davem@davemloft.net>,
	netdev@vger.kernel.org
Subject: Re: [RFC] fib_trie: flush improvement
Date: Wed, 02 Apr 2008 21:36:17 +0200	[thread overview]
Message-ID: <47F3E031.1030806@cosmosbay.com> (raw)
In-Reply-To: <20080402110335.66b04181@extreme>

[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]

Stephen Hemminger a écrit :
> On Wed, 02 Apr 2008 16:35:04 +0200
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Eric Dumazet a écrit :
>>> Stephen Hemminger a écrit :
>>>> This is an attempt to fix the problem described in:
>>>>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>>> I can reproduce this by loading lots and lots of routes and the taking
>>>> the interface down. This causes all entries in trie to be flushed, but
>>>> each leaf removal causes a rebalance of the trie. And since the removal
>>>> is depth first, it creates lots of needless work.
>>>>
>>>> Instead on flush, just walk the trie and prune as we go.
>>>> The implementation is for description only, it probably doesn't work 
>>>> yet.
>>>>
>>>>   
>>> I dont get it, since the bug reporter mentions with recent kernels :
>>>
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>
>>> Is it what you get with your tests ?
>>>
>>> Pawel reports :
>>>
>>> cat /proc/net/fib_triestat
>>> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
>>> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
>>> 18: 1
>>>
>>> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>>>
>>> warning messages comes from rootnode that cannot be expanded, since it 
>>> hits MAX_ORDER (on a 32bit x86)
>>>
>>>
>>>
>>> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded 
>>> to 4 << (bit + 1), ie 2 << 20
>>>
>>> For larger allocations Pawel has two choices :
>>>
>>> change MAX_ORDER from 11 to 13 or 14
>>> If this machine is a pure router, this change wont have performance 
>>> impact.
>>>
>>> Or (more difficult, but more appropriate for mainline) change 
>>> fib_trie.c to use vmalloc() for very big allocaions (for the root 
>>> only), and vfree()
>>>
>>> Since vfree() cannot be called from rcu callback, one has to setup a 
>>> struct work_struct helper.
>>>
>> Here is a patch (untested unfortunatly) to implement this.
>>
>> [IPV4] fib_trie: root_tnode can benefit of vmalloc()
>>
>> FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
>> It also wastes about 50% of allocated size, because of power of two 
>> rounding of tnode.
>>
>> A switch to vmalloc() can improve FIB_TRIE performance by allowing root 
>> node to grow
>> past the alloc_pages() limit, while preserving memory.
>>
>> Special care must be taken to free such zone, as rcu handler is not 
>> allowed to call vfree(),
>> we use a worker instead.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>>
> 
> Rather than switching between three allocation strategies, I would rather
> just have kmalloc and vmalloc.

Yes, probably :)

[IPV4] fib_trie: root_tnode can benefit of vmalloc()

FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two
rounding of tnode.

A switch to vmalloc() can improve FIB_TRIE performance by allowing root
node to grow past the alloc_pages() limit, while preserving memory.

Special care must be taken to free such zone, as rcu handler is not
allowed to call vfree(), we use a worker instead.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 1837 bytes --]

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..c7d7d9e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -122,7 +122,10 @@ struct tnode {
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct tnode *next;
+	};
 	struct node *child[0];
 };
 
@@ -346,18 +349,17 @@ static inline void free_leaf_info(struct leaf_info *leaf)
 
 static struct tnode *tnode_alloc(size_t size)
 {
-	struct page *pages;
-
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
-		return NULL;
-
-	return page_address(pages);
+	return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 }
 
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +368,28 @@ static void __tnode_free_rcu(struct rcu_head *head)
 
 	if (size <= PAGE_SIZE)
 		kfree(tn);
-	else
-		free_pages((unsigned long)tn, get_order(size));
+	else {
+		spin_lock(&fb_vfree_lock);
+		tn->next = fb_vfree_list;
+		fb_vfree_list = tn;
+		schedule_work(&fb_vfree_work);
+		spin_unlock(&fb_vfree_lock);
+	}
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+	struct tnode *tn, *next;
+
+	spin_lock_bh(&fb_vfree_lock);
+	tn = fb_vfree_list;
+	fb_vfree_list = NULL;
+	spin_unlock_bh(&fb_vfree_lock);
+	while (tn) {
+		next = tn->next;
+		vfree(tn);
+		tn = next;
+	}
 }
 
 static inline void tnode_free(struct tnode *tn)

next prev parent reply	other threads:[~2008-04-02 19:37 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-02  0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
2008-04-02  8:01 ` Eric Dumazet
2008-04-02 14:35   ` Eric Dumazet
2008-04-02 18:03     ` Stephen Hemminger
2008-04-02 19:36       ` Eric Dumazet [this message]
2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
2008-04-07  6:55           ` Robert Olsson
2008-04-07  7:58             ` Andi Kleen
2008-04-07 14:42               ` Robert Olsson
2008-04-07 15:15                 ` Andi Kleen
2008-04-07 15:36                   ` Eric Dumazet
2008-04-07 16:46           ` Eric Dumazet
2008-04-07 22:48             ` Stephen Hemminger
2008-04-10  9:57               ` David Miller
2008-04-02  9:31 ` [RFC] fib_trie: flush improvement Robert Olsson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:9e491e7 dfblob:c7d7d9e )
 OR (
bs:"Re: [RFC] fib_trie: flush improvement" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47F3E031.1030806@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=Robert.Olsson@data.slu.se \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).