From: Eric Dumazet <dada1@cosmosbay.com>
To: Stephen Hemminger <shemminger@vyatta.com>
Cc: Robert Olsson <Robert.Olsson@data.slu.se>,
David Miller <davem@davemloft.net>,
netdev@vger.kernel.org
Subject: Re: [RFC] fib_trie: flush improvement
Date: Wed, 02 Apr 2008 21:36:17 +0200 [thread overview]
Message-ID: <47F3E031.1030806@cosmosbay.com> (raw)
In-Reply-To: <20080402110335.66b04181@extreme>
[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]
Stephen Hemminger a écrit :
> On Wed, 02 Apr 2008 16:35:04 +0200
> Eric Dumazet <dada1@cosmosbay.com> wrote:
>
>> Eric Dumazet a écrit :
>>> Stephen Hemminger a écrit :
>>>> This is an attempt to fix the problem described in:
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>>> I can reproduce this by loading lots and lots of routes and the taking
>>>> the interface down. This causes all entries in trie to be flushed, but
>>>> each leaf removal causes a rebalance of the trie. And since the removal
>>>> is depth first, it creates lots of needless work.
>>>>
>>>> Instead on flush, just walk the trie and prune as we go.
>>>> The implementation is for description only, it probably doesn't work
>>>> yet.
>>>>
>>>>
>>> I dont get it, since the bug reporter mentions with recent kernels :
>>>
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>
>>> Is it what you get with your tests ?
>>>
>>> Pawel reports :
>>>
>>> cat /proc/net/fib_triestat
>>> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
>>> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893
>>> 18: 1
>>>
>>> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>>>
>>> warning messages comes from rootnode that cannot be expanded, since it
>>> hits MAX_ORDER (on a 32bit x86)
>>>
>>>
>>>
>>> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded
>>> to 4 << (bit + 1), ie 2 << 20
>>>
>>> For larger allocations Pawel has two choices :
>>>
>>> change MAX_ORDER from 11 to 13 or 14
>>> If this machine is a pure router, this change wont have performance
>>> impact.
>>>
>>> Or (more difficult, but more appropriate for mainline) change
>>> fib_trie.c to use vmalloc() for very big allocaions (for the root
>>> only), and vfree()
>>>
>>> Since vfree() cannot be called from rcu callback, one has to setup a
>>> struct work_struct helper.
>>>
>> Here is a patch (untested unfortunatly) to implement this.
>>
>> [IPV4] fib_trie: root_tnode can benefit of vmalloc()
>>
>> FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
>> It also wastes about 50% of allocated size, because of power of two
>> rounding of tnode.
>>
>> A switch to vmalloc() can improve FIB_TRIE performance by allowing root
>> node to grow
>> past the alloc_pages() limit, while preserving memory.
>>
>> Special care must be taken to free such zone, as rcu handler is not
>> allowed to call vfree(),
>> we use a worker instead.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>>
>
> Rather than switching between three allocation strategies, I would rather
> just have kmalloc and vmalloc.
Yes, probably :)
[IPV4] fib_trie: root_tnode can benefit of vmalloc()
FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two
rounding of tnode.
A switch to vmalloc() can improve FIB_TRIE performance by allowing root
node to grow past the alloc_pages() limit, while preserving memory.
Special care must be taken to free such zone, as rcu handler is not
allowed to call vfree(), we use a worker instead.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 1837 bytes --]
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..c7d7d9e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -122,7 +122,10 @@ struct tnode {
unsigned char bits; /* 2log(KEYLENGTH) bits needed */
unsigned int full_children; /* KEYLENGTH bits needed */
unsigned int empty_children; /* KEYLENGTH bits needed */
- struct rcu_head rcu;
+ union {
+ struct rcu_head rcu;
+ struct tnode *next;
+ };
struct node *child[0];
};
@@ -346,18 +349,17 @@ static inline void free_leaf_info(struct leaf_info *leaf)
static struct tnode *tnode_alloc(size_t size)
{
- struct page *pages;
-
if (size <= PAGE_SIZE)
return kzalloc(size, GFP_KERNEL);
- pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
- if (!pages)
- return NULL;
-
- return page_address(pages);
+ return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
}
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +368,28 @@ static void __tnode_free_rcu(struct rcu_head *head)
if (size <= PAGE_SIZE)
kfree(tn);
- else
- free_pages((unsigned long)tn, get_order(size));
+ else {
+ spin_lock(&fb_vfree_lock);
+ tn->next = fb_vfree_list;
+ fb_vfree_list = tn;
+ schedule_work(&fb_vfree_work);
+ spin_unlock(&fb_vfree_lock);
+ }
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+ struct tnode *tn, *next;
+
+ spin_lock_bh(&fb_vfree_lock);
+ tn = fb_vfree_list;
+ fb_vfree_list = NULL;
+ spin_unlock_bh(&fb_vfree_lock);
+ while (tn) {
+ next = tn->next;
+ vfree(tn);
+ tn = next;
+ }
}
static inline void tnode_free(struct tnode *tn)
next prev parent reply other threads:[~2008-04-02 19:37 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-04-02 0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
2008-04-02 8:01 ` Eric Dumazet
2008-04-02 14:35 ` Eric Dumazet
2008-04-02 18:03 ` Stephen Hemminger
2008-04-02 19:36 ` Eric Dumazet [this message]
2008-04-04 16:02 ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
2008-04-07 6:55 ` Robert Olsson
2008-04-07 7:58 ` Andi Kleen
2008-04-07 14:42 ` Robert Olsson
2008-04-07 15:15 ` Andi Kleen
2008-04-07 15:36 ` Eric Dumazet
2008-04-07 16:46 ` Eric Dumazet
2008-04-07 22:48 ` Stephen Hemminger
2008-04-10 9:57 ` David Miller
2008-04-02 9:31 ` [RFC] fib_trie: flush improvement Robert Olsson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47F3E031.1030806@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=Robert.Olsson@data.slu.se \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=shemminger@vyatta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.