All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>,
	Robert Olsson <Robert.Olsson@data.slu.se>,
	David Miller <davem@davemloft.net>,
	netdev@vger.kernel.org
Subject: Re: [RFC] fib_trie: flush improvement
Date: Wed, 02 Apr 2008 16:35:04 +0200	[thread overview]
Message-ID: <47F39998.8040605@cosmosbay.com> (raw)
In-Reply-To: <47F33D42.9080302@cosmosbay.com>

[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]

Eric Dumazet a écrit :
> Stephen Hemminger a écrit :
>> This is an attempt to fix the problem described in:
>>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
>> I can reproduce this by loading lots and lots of routes and the taking
>> the interface down. This causes all entries in trie to be flushed, but
>> each leaf removal causes a rebalance of the trie. And since the removal
>> is depth first, it creates lots of needless work.
>>
>> Instead on flush, just walk the trie and prune as we go.
>> The implementation is for description only, it probably doesn't work 
>> yet.
>>
>>   
>
> I dont get it, since the bug reporter mentions with recent kernels :
>
> Fix inflate_threshold_root. Now=15 size=11 bits
>
> Is it what you get with your tests ?
>
> Pawel reports :
>
> cat /proc/net/fib_triestat
> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
> 18: 1
>
> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>
> warning messages comes from rootnode that cannot be expanded, since it 
> hits MAX_ORDER (on a 32bit x86)
>
>
>
> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded 
> to 4 << (bit + 1), ie 2 << 20
>
> For larger allocations Pawel has two choices :
>
> change MAX_ORDER from 11 to 13 or 14
> If this machine is a pure router, this change wont have performance 
> impact.
>
> Or (more difficult, but more appropriate for mainline) change 
> fib_trie.c to use vmalloc() for very big allocaions (for the root 
> only), and vfree()
>
> Since vfree() cannot be called from rcu callback, one has to setup a 
> struct work_struct helper.
>
Here is a patch (untested unfortunatly) to implement this.

[IPV4] fib_trie: root_tnode can benefit of vmalloc()

FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two 
rounding of tnode.

A switch to vmalloc() can improve FIB_TRIE performance by allowing root 
node to grow
past the alloc_pages() limit, while preserving memory.

Special care must be taken to free such zone, as rcu handler is not 
allowed to call vfree(),
we use a worker instead.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 2325 bytes --]

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..871e9e9 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -120,9 +120,13 @@ struct tnode {
 	t_key key;
 	unsigned char pos;		/* 2log(KEYLENGTH) bits needed */
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
+	unsigned char vmalloced;
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct tnode *next;
+	};
 	struct node *child[0];
 };
 
@@ -347,17 +351,31 @@ static inline void free_leaf_info(struct leaf_info *leaf)
 static struct tnode *tnode_alloc(size_t size)
 {
 	struct page *pages;
+	struct tnode *tn;
 
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
-		return NULL;
-
-	return page_address(pages);
+	/*
+	 * Because of power of two requirements of alloc_pages(),
+	 * we prefer vmalloc() in case we waste too much memory.
+	 */
+	if (roundup_pow_of_two(size) - size <= PAGE_SIZE * 8) {
+		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(size));
+		if (pages)
+			return page_address(pages);
+	}
+	tn = __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+	if (tn)
+		tn->vmalloced = 1;
+	return tn;
 }
 
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +384,30 @@ static void __tnode_free_rcu(struct rcu_head *head)
 
 	if (size <= PAGE_SIZE)
 		kfree(tn);
-	else
+	else if (!tn->vmalloced)
 		free_pages((unsigned long)tn, get_order(size));
+	else {
+		spin_lock(&fb_vfree_lock);
+		tn->next = fb_vfree_list;
+		fb_vfree_list = tn;
+		schedule_work(&fb_vfree_work);
+		spin_unlock(&fb_vfree_lock);
+	}
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+	struct tnode *tn, *next;
+
+	spin_lock_bh(&fb_vfree_lock);
+	tn = fb_vfree_list;
+	fb_vfree_list = NULL;
+	spin_unlock_bh(&fb_vfree_lock);
+	while (tn) {
+		next = tn->next;
+		vfree(tn);
+		tn = next;
+	}
 }
 
 static inline void tnode_free(struct tnode *tn)

  reply	other threads:[~2008-04-02 14:35 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-02  0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
2008-04-02  8:01 ` Eric Dumazet
2008-04-02 14:35   ` Eric Dumazet [this message]
2008-04-02 18:03     ` Stephen Hemminger
2008-04-02 19:36       ` Eric Dumazet
2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
2008-04-07  6:55           ` Robert Olsson
2008-04-07  7:58             ` Andi Kleen
2008-04-07 14:42               ` Robert Olsson
2008-04-07 15:15                 ` Andi Kleen
2008-04-07 15:36                   ` Eric Dumazet
2008-04-07 16:46           ` Eric Dumazet
2008-04-07 22:48             ` Stephen Hemminger
2008-04-10  9:57               ` David Miller
2008-04-02  9:31 ` [RFC] fib_trie: flush improvement Robert Olsson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47F39998.8040605@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=Robert.Olsson@data.slu.se \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=shemminger@vyatta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.