From: Eric Dumazet <dada1@cosmosbay.com>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>,
Robert Olsson <Robert.Olsson@data.slu.se>,
David Miller <davem@davemloft.net>,
netdev@vger.kernel.org
Subject: Re: [RFC] fib_trie: flush improvement
Date: Wed, 02 Apr 2008 16:35:04 +0200 [thread overview]
Message-ID: <47F39998.8040605@cosmosbay.com> (raw)
In-Reply-To: <47F33D42.9080302@cosmosbay.com>
[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]
Eric Dumazet a écrit :
> Stephen Hemminger a écrit :
>> This is an attempt to fix the problem described in:
>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>> I can reproduce this by loading lots and lots of routes and the taking
>> the interface down. This causes all entries in trie to be flushed, but
>> each leaf removal causes a rebalance of the trie. And since the removal
>> is depth first, it creates lots of needless work.
>>
>> Instead on flush, just walk the trie and prune as we go.
>> The implementation is for description only, it probably doesn't work
>> yet.
>>
>>
>
> I dont get it, since the bug reporter mentions with recent kernels :
>
> Fix inflate_threshold_root. Now=15 size=11 bits
>
> Is it what you get with your tests ?
>
> Pawel reports :
>
> cat /proc/net/fib_triestat
> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893
> 18: 1
>
> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>
> warning messages comes from rootnode that cannot be expanded, since it
> hits MAX_ORDER (on a 32bit x86)
>
>
>
> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded
> to 4 << (bit + 1), ie 2 << 20
>
> For larger allocations Pawel has two choices :
>
> change MAX_ORDER from 11 to 13 or 14
> If this machine is a pure router, this change wont have performance
> impact.
>
> Or (more difficult, but more appropriate for mainline) change
> fib_trie.c to use vmalloc() for very big allocaions (for the root
> only), and vfree()
>
> Since vfree() cannot be called from rcu callback, one has to setup a
> struct work_struct helper.
>
Here is a patch (untested unfortunatly) to implement this.
[IPV4] fib_trie: root_tnode can benefit of vmalloc()
FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two
rounding of tnode.
A switch to vmalloc() can improve FIB_TRIE performance by allowing root
node to grow
past the alloc_pages() limit, while preserving memory.
Special care must be taken to free such zone, as rcu handler is not
allowed to call vfree(),
we use a worker instead.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 2325 bytes --]
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..871e9e9 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -120,9 +120,13 @@ struct tnode {
t_key key;
unsigned char pos; /* 2log(KEYLENGTH) bits needed */
unsigned char bits; /* 2log(KEYLENGTH) bits needed */
+ unsigned char vmalloced;
unsigned int full_children; /* KEYLENGTH bits needed */
unsigned int empty_children; /* KEYLENGTH bits needed */
- struct rcu_head rcu;
+ union {
+ struct rcu_head rcu;
+ struct tnode *next;
+ };
struct node *child[0];
};
@@ -347,17 +351,31 @@ static inline void free_leaf_info(struct leaf_info *leaf)
static struct tnode *tnode_alloc(size_t size)
{
struct page *pages;
+ struct tnode *tn;
if (size <= PAGE_SIZE)
return kzalloc(size, GFP_KERNEL);
- pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
- if (!pages)
- return NULL;
-
- return page_address(pages);
+ /*
+ * Because of power of two requirements of alloc_pages(),
+ * we prefer vmalloc() in case we waste too much memory.
+ */
+ if (roundup_pow_of_two(size) - size <= PAGE_SIZE * 8) {
+ pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(size));
+ if (pages)
+ return page_address(pages);
+ }
+ tn = __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+ if (tn)
+ tn->vmalloced = 1;
+ return tn;
}
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
static void __tnode_free_rcu(struct rcu_head *head)
{
struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +384,30 @@ static void __tnode_free_rcu(struct rcu_head *head)
if (size <= PAGE_SIZE)
kfree(tn);
- else
+ else if (!tn->vmalloced)
free_pages((unsigned long)tn, get_order(size));
+ else {
+ spin_lock(&fb_vfree_lock);
+ tn->next = fb_vfree_list;
+ fb_vfree_list = tn;
+ schedule_work(&fb_vfree_work);
+ spin_unlock(&fb_vfree_lock);
+ }
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+ struct tnode *tn, *next;
+
+ spin_lock_bh(&fb_vfree_lock);
+ tn = fb_vfree_list;
+ fb_vfree_list = NULL;
+ spin_unlock_bh(&fb_vfree_lock);
+ while (tn) {
+ next = tn->next;
+ vfree(tn);
+ tn = next;
+ }
}
static inline void tnode_free(struct tnode *tn)
next prev parent reply other threads:[~2008-04-02 14:35 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-04-02 0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
2008-04-02 8:01 ` Eric Dumazet
2008-04-02 14:35 ` Eric Dumazet [this message]
2008-04-02 18:03 ` Stephen Hemminger
2008-04-02 19:36 ` Eric Dumazet
2008-04-04 16:02 ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
2008-04-07 6:55 ` Robert Olsson
2008-04-07 7:58 ` Andi Kleen
2008-04-07 14:42 ` Robert Olsson
2008-04-07 15:15 ` Andi Kleen
2008-04-07 15:36 ` Eric Dumazet
2008-04-07 16:46 ` Eric Dumazet
2008-04-07 22:48 ` Stephen Hemminger
2008-04-10 9:57 ` David Miller
2008-04-02 9:31 ` [RFC] fib_trie: flush improvement Robert Olsson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=47F39998.8040605@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=Robert.Olsson@data.slu.se \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=shemminger@vyatta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).