[RFC] fib_trie: flush improvement

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] fib_trie: flush improvement
@ 2008-04-02  0:27 Stephen Hemminger
  2008-04-02  8:01 ` Eric Dumazet
  2008-04-02  9:31 ` [RFC] fib_trie: flush improvement Robert Olsson
  0 siblings, 2 replies; 15+ messages in thread
From: Stephen Hemminger @ 2008-04-02  0:27 UTC (permalink / raw)
  To: Robert Olsson, David Miller; +Cc: netdev

This is an attempt to fix the problem described in:
     http://bugzilla.kernel.org/show_bug.cgi?id=6648
I can reproduce this by loading lots and lots of routes and the taking
the interface down. This causes all entries in trie to be flushed, but
each leaf removal causes a rebalance of the trie. And since the removal
is depth first, it creates lots of needless work.

Instead on flush, just walk the trie and prune as we go.
The implementation is for description only, it probably doesn't work yet.

--- a/net/ipv4/fib_trie.c	2008-04-01 13:54:53.000000000 -0700
+++ b/net/ipv4/fib_trie.c	2008-04-01 17:19:00.000000000 -0700
@@ -1665,7 +1665,7 @@ static int fn_trie_delete(struct fib_tab
 	return 0;
 }
 
-static int trie_flush_list(struct trie *t, struct list_head *head)
+static int trie_flush_list(struct list_head *head)
 {
 	struct fib_alias *fa, *fa_node;
 	int found = 0;
@@ -1683,24 +1683,74 @@ static int trie_flush_list(struct trie *
 	return found;
 }
 
-static int trie_flush_leaf(struct trie *t, struct leaf *l)
+static int trie_flush_leaf(struct leaf *l)
 {
-	int found = 0;
 	struct hlist_head *lih = &l->list;
 	struct hlist_node *node, *tmp;
 	struct leaf_info *li = NULL;
+	int found = 0;
 
 	hlist_for_each_entry_safe(li, node, tmp, lih, hlist) {
-		found += trie_flush_list(t, &li->falh);
+		found += trie_flush_list(&li->falh);
 
 		if (list_empty(&li->falh)) {
 			hlist_del_rcu(&li->hlist);
 			free_leaf_info(li);
 		}
 	}
+	tnode_free((struct tnode *) l);
+
 	return found;
 }
 
+static int trie_flush(struct tnode *p)
+{
+	struct node *c = NULL;
+	t_key idx = 0;
+	int found = 0;
+
+	for(;;) {
+		for ( ;idx < (1u << p->bits); ++idx) {
+			struct node *c = p->child[idx];
+			if (!c)
+				continue;
+
+			rcu_assign_pointer(p->child[idx], NULL);
+			if (IS_LEAF(c)) {
+				found += trie_flush_leaf((struct leaf *) c);
+				continue;
+			}
+
+			/* Descend one level */
+			p = (struct tnode *) c;
+			idx = 0;
+		}
+
+		/* Node empty, free and go back up */
+		tnode_free((struct tnode *) c);
+		c = (struct node *) p;
+		p = node_parent(c);
+		if (p == NULL)
+			return found; /* at root */
+
+		idx = tkey_extract_bits(c->key, p->pos, p->bits) + 1;
+	}
+}
+
+static int fn_trie_flush(struct fib_table *tb)
+{
+	struct trie *t = (struct trie *) tb->tb_data;
+	struct tnode *n = (struct tnode *) t->trie;
+
+	ASSERT_RTNL();
+	rcu_assign_pointer(t->trie, NULL);
+
+	if (IS_LEAF(n))
+		return trie_flush_leaf((struct leaf *) n);
+	else
+		return trie_flush(n);
+}
+
 /*
  * Scan for the next right leaf starting at node p->child[idx]
  * Since we have back pointer, no recursion necessary.
@@ -1772,30 +1822,6 @@ static struct leaf *trie_leafindex(struc
 }
 
 
-/*
- * Caller must hold RTNL.
- */
-static int fn_trie_flush(struct fib_table *tb)
-{
-	struct trie *t = (struct trie *) tb->tb_data;
-	struct leaf *l, *ll = NULL;
-	int found = 0;
-
-	for (l = trie_firstleaf(t); l; l = trie_nextleaf(l)) {
-		found += trie_flush_leaf(t, l);
-
-		if (ll && hlist_empty(&ll->list))
-			trie_leaf_remove(t, ll);
-		ll = l;
-	}
-
-	if (ll && hlist_empty(&ll->list))
-		trie_leaf_remove(t, ll);
-
-	pr_debug("trie_flush found=%d\n", found);
-	return found;
-}
-
 static void fn_trie_select_default(struct fib_table *tb,
 				   const struct flowi *flp,
 				   struct fib_result *res)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: flush improvement
  2008-04-02  0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
@ 2008-04-02  8:01 ` Eric Dumazet
  2008-04-02 14:35   ` Eric Dumazet
  2008-04-02  9:31 ` [RFC] fib_trie: flush improvement Robert Olsson
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2008-04-02  8:01 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Robert Olsson, David Miller, netdev

Stephen Hemminger a écrit :
> This is an attempt to fix the problem described in:
>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
> I can reproduce this by loading lots and lots of routes and the taking
> the interface down. This causes all entries in trie to be flushed, but
> each leaf removal causes a rebalance of the trie. And since the removal
> is depth first, it creates lots of needless work.
>
> Instead on flush, just walk the trie and prune as we go.
> The implementation is for description only, it probably doesn't work yet.
>
>   

I dont get it, since the bug reporter mentions with recent kernels :

Fix inflate_threshold_root. Now=15 size=11 bits

Is it what you get with your tests ?

Pawel reports :

cat /proc/net/fib_triestat
Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
18: 1

Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB

warning messages comes from rootnode that cannot be expanded, since it 
hits MAX_ORDER (on a 32bit x86)



(sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded to 
4 << (bit + 1), ie 2 << 20

For larger allocations Pawel has two choices :

change MAX_ORDER from 11 to 13 or 14
If this machine is a pure router, this change wont have performance impact.

Or (more difficult, but more appropriate for mainline) change fib_trie.c 
to use vmalloc() for very big allocaions (for the root only), and vfree()

Since vfree() cannot be called from rcu callback, one has to setup a 
struct work_struct helper.







^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC] fib_trie: flush improvement
  2008-04-02  0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
  2008-04-02  8:01 ` Eric Dumazet
@ 2008-04-02  9:31 ` Robert Olsson
  1 sibling, 0 replies; 15+ messages in thread
From: Robert Olsson @ 2008-04-02  9:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Robert Olsson, David Miller, netdev


Stephen Hemminger writes:
 > This is an attempt to fix the problem described in:
 >      http://bugzilla.kernel.org/show_bug.cgi?id=6648
 > I can reproduce this by loading lots and lots of routes and the taking
 > the interface down. This causes all entries in trie to be flushed, but
 > each leaf removal causes a rebalance of the trie. And since the removal
 > is depth first, it creates lots of needless work.

 OK! Havn't seen it on our boxes yet. Yes it useless work to rebalance
 if we are flushing.

 > Instead on flush, just walk the trie and prune as we go.

 Yes something like that should work. 

 An alternative could be to add an argument to trie_leaf_remove (and 
 fib_insert_node) to control if rebalance should be run. Of course
 from flush we tell trie_leaf_remove not to rebalance.

 In general I'll think the trie should look pretty OK even if we rebalance 
 every 2:nd or 3:rd etc insert/delete. Which could decrease the load and 
 speed up BGP route convergence even more.

 It could be idea to some tuning possiblites from userland. For example the 
 node threshold values and ev. rebalance threshold

 Cheers
					--ro

 > 
 > --- a/net/ipv4/fib_trie.c	2008-04-01 13:54:53.000000000 -0700
 > +++ b/net/ipv4/fib_trie.c	2008-04-01 17:19:00.000000000 -0700
 > @@ -1665,7 +1665,7 @@ static int fn_trie_delete(struct fib_tab
 >  	return 0;
 >  }
 >  
 > -static int trie_flush_list(struct trie *t, struct list_head *head)
 > +static int trie_flush_list(struct list_head *head)
 >  {
 >  	struct fib_alias *fa, *fa_node;
 >  	int found = 0;
 > @@ -1683,24 +1683,74 @@ static int trie_flush_list(struct trie *
 >  	return found;
 >  }
 >  
 > -static int trie_flush_leaf(struct trie *t, struct leaf *l)
 > +static int trie_flush_leaf(struct leaf *l)
 >  {
 > -	int found = 0;
 >  	struct hlist_head *lih = &l->list;
 >  	struct hlist_node *node, *tmp;
 >  	struct leaf_info *li = NULL;
 > +	int found = 0;
 >  
 >  	hlist_for_each_entry_safe(li, node, tmp, lih, hlist) {
 > -		found += trie_flush_list(t, &li->falh);
 > +		found += trie_flush_list(&li->falh);
 >  
 >  		if (list_empty(&li->falh)) {
 >  			hlist_del_rcu(&li->hlist);
 >  			free_leaf_info(li);
 >  		}
 >  	}
 > +	tnode_free((struct tnode *) l);
 > +
 >  	return found;
 >  }
 >  
 > +static int trie_flush(struct tnode *p)
 > +{
 > +	struct node *c = NULL;
 > +	t_key idx = 0;
 > +	int found = 0;
 > +
 > +	for(;;) {
 > +		for ( ;idx < (1u << p->bits); ++idx) {
 > +			struct node *c = p->child[idx];
 > +			if (!c)
 > +				continue;
 > +
 > +			rcu_assign_pointer(p->child[idx], NULL);
 > +			if (IS_LEAF(c)) {
 > +				found += trie_flush_leaf((struct leaf *) c);
 > +				continue;
 > +			}
 > +
 > +			/* Descend one level */
 > +			p = (struct tnode *) c;
 > +			idx = 0;
 > +		}
 > +
 > +		/* Node empty, free and go back up */
 > +		tnode_free((struct tnode *) c);
 > +		c = (struct node *) p;
 > +		p = node_parent(c);
 > +		if (p == NULL)
 > +			return found; /* at root */
 > +
 > +		idx = tkey_extract_bits(c->key, p->pos, p->bits) + 1;
 > +	}
 > +}
 > +
 > +static int fn_trie_flush(struct fib_table *tb)
 > +{
 > +	struct trie *t = (struct trie *) tb->tb_data;
 > +	struct tnode *n = (struct tnode *) t->trie;
 > +
 > +	ASSERT_RTNL();
 > +	rcu_assign_pointer(t->trie, NULL);
 > +
 > +	if (IS_LEAF(n))
 > +		return trie_flush_leaf((struct leaf *) n);
 > +	else
 > +		return trie_flush(n);
 > +}
 > +
 >  /*
 >   * Scan for the next right leaf starting at node p->child[idx]
 >   * Since we have back pointer, no recursion necessary.
 > @@ -1772,30 +1822,6 @@ static struct leaf *trie_leafindex(struc
 >  }
 >  
 >  
 > -/*
 > - * Caller must hold RTNL.
 > - */
 > -static int fn_trie_flush(struct fib_table *tb)
 > -{
 > -	struct trie *t = (struct trie *) tb->tb_data;
 > -	struct leaf *l, *ll = NULL;
 > -	int found = 0;
 > -
 > -	for (l = trie_firstleaf(t); l; l = trie_nextleaf(l)) {
 > -		found += trie_flush_leaf(t, l);
 > -
 > -		if (ll && hlist_empty(&ll->list))
 > -			trie_leaf_remove(t, ll);
 > -		ll = l;
 > -	}
 > -
 > -	if (ll && hlist_empty(&ll->list))
 > -		trie_leaf_remove(t, ll);
 > -
 > -	pr_debug("trie_flush found=%d\n", found);
 > -	return found;
 > -}
 > -
 >  static void fn_trie_select_default(struct fib_table *tb,
 >  				   const struct flowi *flp,
 >  				   struct fib_result *res)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: flush improvement
  2008-04-02  8:01 ` Eric Dumazet
@ 2008-04-02 14:35   ` Eric Dumazet
  2008-04-02 18:03     ` Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2008-04-02 14:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, Robert Olsson, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]

Eric Dumazet a écrit :
> Stephen Hemminger a écrit :
>> This is an attempt to fix the problem described in:
>>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
>> I can reproduce this by loading lots and lots of routes and the taking
>> the interface down. This causes all entries in trie to be flushed, but
>> each leaf removal causes a rebalance of the trie. And since the removal
>> is depth first, it creates lots of needless work.
>>
>> Instead on flush, just walk the trie and prune as we go.
>> The implementation is for description only, it probably doesn't work 
>> yet.
>>
>>   
>
> I dont get it, since the bug reporter mentions with recent kernels :
>
> Fix inflate_threshold_root. Now=15 size=11 bits
>
> Is it what you get with your tests ?
>
> Pawel reports :
>
> cat /proc/net/fib_triestat
> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
> 18: 1
>
> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>
> warning messages comes from rootnode that cannot be expanded, since it 
> hits MAX_ORDER (on a 32bit x86)
>
>
>
> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded 
> to 4 << (bit + 1), ie 2 << 20
>
> For larger allocations Pawel has two choices :
>
> change MAX_ORDER from 11 to 13 or 14
> If this machine is a pure router, this change wont have performance 
> impact.
>
> Or (more difficult, but more appropriate for mainline) change 
> fib_trie.c to use vmalloc() for very big allocaions (for the root 
> only), and vfree()
>
> Since vfree() cannot be called from rcu callback, one has to setup a 
> struct work_struct helper.
>
Here is a patch (untested unfortunatly) to implement this.

[IPV4] fib_trie: root_tnode can benefit of vmalloc()

FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two 
rounding of tnode.

A switch to vmalloc() can improve FIB_TRIE performance by allowing root 
node to grow
past the alloc_pages() limit, while preserving memory.

Special care must be taken to free such zone, as rcu handler is not 
allowed to call vfree(),
we use a worker instead.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 2325 bytes --]

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..871e9e9 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -120,9 +120,13 @@ struct tnode {
 	t_key key;
 	unsigned char pos;		/* 2log(KEYLENGTH) bits needed */
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
+	unsigned char vmalloced;
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct tnode *next;
+	};
 	struct node *child[0];
 };
 
@@ -347,17 +351,31 @@ static inline void free_leaf_info(struct leaf_info *leaf)
 static struct tnode *tnode_alloc(size_t size)
 {
 	struct page *pages;
+	struct tnode *tn;
 
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
-		return NULL;
-
-	return page_address(pages);
+	/*
+	 * Because of power of two requirements of alloc_pages(),
+	 * we prefer vmalloc() in case we waste too much memory.
+	 */
+	if (roundup_pow_of_two(size) - size <= PAGE_SIZE * 8) {
+		pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, get_order(size));
+		if (pages)
+			return page_address(pages);
+	}
+	tn = __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+	if (tn)
+		tn->vmalloced = 1;
+	return tn;
 }
 
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +384,30 @@ static void __tnode_free_rcu(struct rcu_head *head)
 
 	if (size <= PAGE_SIZE)
 		kfree(tn);
-	else
+	else if (!tn->vmalloced)
 		free_pages((unsigned long)tn, get_order(size));
+	else {
+		spin_lock(&fb_vfree_lock);
+		tn->next = fb_vfree_list;
+		fb_vfree_list = tn;
+		schedule_work(&fb_vfree_work);
+		spin_unlock(&fb_vfree_lock);
+	}
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+	struct tnode *tn, *next;
+
+	spin_lock_bh(&fb_vfree_lock);
+	tn = fb_vfree_list;
+	fb_vfree_list = NULL;
+	spin_unlock_bh(&fb_vfree_lock);
+	while (tn) {
+		next = tn->next;
+		vfree(tn);
+		tn = next;
+	}
 }
 
 static inline void tnode_free(struct tnode *tn)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: flush improvement
  2008-04-02 14:35   ` Eric Dumazet
@ 2008-04-02 18:03     ` Stephen Hemminger
  2008-04-02 19:36       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2008-04-02 18:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, Robert Olsson, David Miller, netdev

On Wed, 02 Apr 2008 16:35:04 +0200
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Eric Dumazet a écrit :
> > Stephen Hemminger a écrit :
> >> This is an attempt to fix the problem described in:
> >>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
> >> I can reproduce this by loading lots and lots of routes and the taking
> >> the interface down. This causes all entries in trie to be flushed, but
> >> each leaf removal causes a rebalance of the trie. And since the removal
> >> is depth first, it creates lots of needless work.
> >>
> >> Instead on flush, just walk the trie and prune as we go.
> >> The implementation is for description only, it probably doesn't work 
> >> yet.
> >>
> >>   
> >
> > I dont get it, since the bug reporter mentions with recent kernels :
> >
> > Fix inflate_threshold_root. Now=15 size=11 bits
> >
> > Is it what you get with your tests ?
> >
> > Pawel reports :
> >
> > cat /proc/net/fib_triestat
> > Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
> > Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
> > 18: 1
> >
> > Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
> >
> > warning messages comes from rootnode that cannot be expanded, since it 
> > hits MAX_ORDER (on a 32bit x86)
> >
> >
> >
> > (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded 
> > to 4 << (bit + 1), ie 2 << 20
> >
> > For larger allocations Pawel has two choices :
> >
> > change MAX_ORDER from 11 to 13 or 14
> > If this machine is a pure router, this change wont have performance 
> > impact.
> >
> > Or (more difficult, but more appropriate for mainline) change 
> > fib_trie.c to use vmalloc() for very big allocaions (for the root 
> > only), and vfree()
> >
> > Since vfree() cannot be called from rcu callback, one has to setup a 
> > struct work_struct helper.
> >
> Here is a patch (untested unfortunatly) to implement this.
> 
> [IPV4] fib_trie: root_tnode can benefit of vmalloc()
> 
> FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
> It also wastes about 50% of allocated size, because of power of two 
> rounding of tnode.
> 
> A switch to vmalloc() can improve FIB_TRIE performance by allowing root 
> node to grow
> past the alloc_pages() limit, while preserving memory.
> 
> Special care must be taken to free such zone, as rcu handler is not 
> allowed to call vfree(),
> we use a worker instead.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> 

Rather than switching between three allocation strategies, I would rather
just have kmalloc and vmalloc.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: flush improvement
  2008-04-02 18:03     ` Stephen Hemminger
@ 2008-04-02 19:36       ` Eric Dumazet
  2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2008-04-02 19:36 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Robert Olsson, David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]

Stephen Hemminger a écrit :
> On Wed, 02 Apr 2008 16:35:04 +0200
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Eric Dumazet a écrit :
>>> Stephen Hemminger a écrit :
>>>> This is an attempt to fix the problem described in:
>>>>      http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>>> I can reproduce this by loading lots and lots of routes and the taking
>>>> the interface down. This causes all entries in trie to be flushed, but
>>>> each leaf removal causes a rebalance of the trie. And since the removal
>>>> is depth first, it creates lots of needless work.
>>>>
>>>> Instead on flush, just walk the trie and prune as we go.
>>>> The implementation is for description only, it probably doesn't work 
>>>> yet.
>>>>
>>>>   
>>> I dont get it, since the bug reporter mentions with recent kernels :
>>>
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>
>>> Is it what you get with your tests ?
>>>
>>> Pawel reports :
>>>
>>> cat /proc/net/fib_triestat
>>> Main: Aver depth: 2.26 Max depth: 6 Leaves: 235924
>>> Internal nodes: 57854 1: 31632 2: 11422 3: 8475 4: 3755 5: 1676 6: 893 
>>> 18: 1
>>>
>>> Pointers: 609760 Null ptrs: 315983 Total size: 16240 kB
>>>
>>> warning messages comes from rootnode that cannot be expanded, since it 
>>> hits MAX_ORDER (on a 32bit x86)
>>>
>>>
>>>
>>> (sizeof(struct tnode) + (sizeof(struct node *) << bits);) is rounded 
>>> to 4 << (bit + 1), ie 2 << 20
>>>
>>> For larger allocations Pawel has two choices :
>>>
>>> change MAX_ORDER from 11 to 13 or 14
>>> If this machine is a pure router, this change wont have performance 
>>> impact.
>>>
>>> Or (more difficult, but more appropriate for mainline) change 
>>> fib_trie.c to use vmalloc() for very big allocaions (for the root 
>>> only), and vfree()
>>>
>>> Since vfree() cannot be called from rcu callback, one has to setup a 
>>> struct work_struct helper.
>>>
>> Here is a patch (untested unfortunatly) to implement this.
>>
>> [IPV4] fib_trie: root_tnode can benefit of vmalloc()
>>
>> FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
>> It also wastes about 50% of allocated size, because of power of two 
>> rounding of tnode.
>>
>> A switch to vmalloc() can improve FIB_TRIE performance by allowing root 
>> node to grow
>> past the alloc_pages() limit, while preserving memory.
>>
>> Special care must be taken to free such zone, as rcu handler is not 
>> allowed to call vfree(),
>> we use a worker instead.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>>
>>
> 
> Rather than switching between three allocation strategies, I would rather
> just have kmalloc and vmalloc.

Yes, probably :)

[IPV4] fib_trie: root_tnode can benefit of vmalloc()

FIB_TRIE root node can be very large and currently hits MAX_ORDER limit.
It also wastes about 50% of allocated size, because of power of two
rounding of tnode.

A switch to vmalloc() can improve FIB_TRIE performance by allowing root
node to grow past the alloc_pages() limit, while preserving memory.

Special care must be taken to free such zone, as rcu handler is not
allowed to call vfree(), we use a worker instead.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: trie_vmalloc.patch --]
[-- Type: text/plain, Size: 1837 bytes --]

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9e491e7..c7d7d9e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -122,7 +122,10 @@ struct tnode {
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct tnode *next;
+	};
 	struct node *child[0];
 };
 
@@ -346,18 +349,17 @@ static inline void free_leaf_info(struct leaf_info *leaf)
 
 static struct tnode *tnode_alloc(size_t size)
 {
-	struct page *pages;
-
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
-		return NULL;
-
-	return page_address(pages);
+	return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 }
 
+static void fb_worker_func(struct work_struct *work);
+static DECLARE_WORK(fb_vfree_work, fb_worker_func);
+static DEFINE_SPINLOCK(fb_vfree_lock);
+static struct tnode *fb_vfree_list;
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -366,8 +368,28 @@ static void __tnode_free_rcu(struct rcu_head *head)
 
 	if (size <= PAGE_SIZE)
 		kfree(tn);
-	else
-		free_pages((unsigned long)tn, get_order(size));
+	else {
+		spin_lock(&fb_vfree_lock);
+		tn->next = fb_vfree_list;
+		fb_vfree_list = tn;
+		schedule_work(&fb_vfree_work);
+		spin_unlock(&fb_vfree_lock);
+	}
+}
+
+static void fb_worker_func(struct work_struct *work)
+{
+	struct tnode *tn, *next;
+
+	spin_lock_bh(&fb_vfree_lock);
+	tn = fb_vfree_list;
+	fb_vfree_list = NULL;
+	spin_unlock_bh(&fb_vfree_lock);
+	while (tn) {
+		next = tn->next;
+		vfree(tn);
+		tn = next;
+	}
 }
 
 static inline void tnode_free(struct tnode *tn)

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RFC] fib_trie: memory waste solutions
  2008-04-02 19:36       ` Eric Dumazet
@ 2008-04-04 16:02         ` Stephen Hemminger
  2008-04-07  6:55           ` Robert Olsson
  2008-04-07 16:46           ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Stephen Hemminger @ 2008-04-04 16:02 UTC (permalink / raw)
  To: David Miller; +Cc: Eric Dumazet, Robert Olsson, netdev

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

Eric wisely pointed out that for larger sizes of nodes, the
current allocation in fib_trie wastes lots of memory.  For a sample
of routes extracted from the bugzilla bug the largest node grows
to 2M bytes on 64 bit system. This leads to 2044K of wasted memory.

There are two possible solutions (see attached). One uses vmalloc()
rather than alloc_pages, but has to add complexity on freeing.
The other adds a layer of indirection to the tnode lookup.

Both have been tested on net-2.6.26 with the huge route table.
I slightly prefer the vmalloc version, but both work fine.


[-- Attachment #2: fib-trie-vmalloc.patch --]
[-- Type: text/x-patch, Size: 1992 bytes --]

IPV4: fib_trie use vmalloc for large tnodes

Use vmalloc rather than alloc_pages to avoid wasting memory.
The problem is that tnode structure has a power of 2 sized array,
plus a header. So the current code wastes almost half the memory
allocated because it always needs the next bigger size to hold
that small header.

This is similar to an earlier patch by Eric, but instead of a list
and lock, I used a workqueue to handle the fact that vfree can't
be done in interrupt context.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

---
 net/ipv4/fib_trie.c |   25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/net/ipv4/fib_trie.c	2008-04-04 08:57:01.000000000 -0700
+++ b/net/ipv4/fib_trie.c	2008-04-04 08:57:03.000000000 -0700
@@ -122,7 +122,10 @@ struct tnode {
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct work_struct work;
+	};
 	struct node *child[0];
 };
 
@@ -350,16 +353,16 @@ static inline void free_leaf_info(struct
 
 static struct tnode *tnode_alloc(size_t size)
 {
-	struct page *pages;
-
 	if (size <= PAGE_SIZE)
 		return kzalloc(size, GFP_KERNEL);
+	else
+		return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
+}
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
-		return NULL;
-
-	return page_address(pages);
+static void __tnode_vfree(struct work_struct *arg)
+{
+	struct tnode *tn = container_of(arg, struct tnode, work);
+	vfree(tn);
 }
 
 static void __tnode_free_rcu(struct rcu_head *head)
@@ -370,8 +373,10 @@ static void __tnode_free_rcu(struct rcu_
 
 	if (size <= PAGE_SIZE)
 		kfree(tn);
-	else
-		free_pages((unsigned long)tn, get_order(size));
+	else {
+		INIT_WORK(&tn->work, __tnode_vfree);
+		schedule_work(&tn->work);
+	}
 }
 
 static inline void tnode_free(struct tnode *tn)

[-- Attachment #3: fib-trie-split-child.patch --]
[-- Type: text/x-patch, Size: 2794 bytes --]

IPV4: fib_trie split child off from tnode

For large tnode's the power of 2 allocation wastes almost half the space
since the tnode is allocated with the power of 2 sized child pointers.
To solve this add a layer of indirection. For small nodes (the common case)
can just use a single allocation, for larger use pages like before.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


---
 net/ipv4/fib_trie.c |   43 ++++++++++++++++++++++++++-----------------
 1 file changed, 26 insertions(+), 17 deletions(-)

--- a/net/ipv4/fib_trie.c	2008-04-03 09:09:45.000000000 -0700
+++ b/net/ipv4/fib_trie.c	2008-04-03 22:08:33.000000000 -0700
@@ -120,10 +120,10 @@ struct tnode {
 	t_key key;
 	unsigned char pos;		/* 2log(KEYLENGTH) bits needed */
 	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
+	struct node **child;
 	unsigned int full_children;	/* KEYLENGTH bits needed */
 	unsigned int empty_children;	/* KEYLENGTH bits needed */
 	struct rcu_head rcu;
-	struct node *child[0];
 };
 
 #ifdef CONFIG_IP_FIB_TRIE_STATS
@@ -348,30 +348,40 @@ static inline void free_leaf_info(struct
 	call_rcu(&leaf->rcu, __leaf_info_free_rcu);
 }
 
-static struct tnode *tnode_alloc(size_t size)
+static struct tnode *tnode_alloc(int bits)
 {
-	struct page *pages;
-
-	if (size <= PAGE_SIZE)
-		return kzalloc(size, GFP_KERNEL);
+	size_t space = sizeof(struct node *) << bits;
+	size_t size = space + sizeof(struct tnode);
+	struct tnode *tn;
 
-	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
-	if (!pages)
+	tn = kzalloc(size > PAGE_SIZE ? sizeof(struct tnode) : size, GFP_KERNEL);
+	if (!tn)
 		return NULL;
 
-	return page_address(pages);
+	if (size <= PAGE_SIZE)
+		tn->child = (struct node **) (tn + 1);
+	else {
+		struct page *pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
+						 get_order(space));
+		if (!pages) {
+			kfree(tn);
+			return NULL;
+		}
+		tn->child = page_address(pages);
+	}
+
+	return tn;
 }
 
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
-	size_t size = sizeof(struct tnode) +
-		      (sizeof(struct node *) << tn->bits);
+	size_t space = sizeof(struct node *) << tn->bits;
+	size_t size = space + sizeof(struct tnode);
 
-	if (size <= PAGE_SIZE)
-		kfree(tn);
-	else
-		free_pages((unsigned long)tn, get_order(size));
+	if (size > PAGE_SIZE)
+		free_pages((unsigned long)tn->child, get_order(space));
+	kfree(tn);
 }
 
 static inline void tnode_free(struct tnode *tn)
@@ -404,8 +414,7 @@ static struct leaf_info *leaf_info_new(i
 
 static struct tnode *tnode_new(t_key key, int pos, int bits)
 {
-	size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits);
-	struct tnode *tn = tnode_alloc(sz);
+	struct tnode *tn = tnode_alloc(bits);
 
 	if (tn) {
 		tn->parent = T_TNODE;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC] fib_trie: memory waste solutions
  2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
@ 2008-04-07  6:55           ` Robert Olsson
  2008-04-07  7:58             ` Andi Kleen
  2008-04-07 16:46           ` Eric Dumazet
  1 sibling, 1 reply; 15+ messages in thread
From: Robert Olsson @ 2008-04-07  6:55 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Eric Dumazet, Robert Olsson, netdev


Stephen Hemminger writes:
 > Eric wisely pointed out that for larger sizes of nodes, the
 > current allocation in fib_trie wastes lots of memory.  For a sample
 > of routes extracted from the bugzilla bug the largest node grows
 > to 2M bytes on 64 bit system. This leads to 2044K of wasted memory.
 > 
 > There are two possible solutions (see attached). One uses vmalloc()
 > rather than alloc_pages, but has to add complexity on freeing.
 > The other adds a layer of indirection to the tnode lookup.
 > 
 > Both have been tested on net-2.6.26 with the huge route table.
 > I slightly prefer the vmalloc version, but both work fine.

 Do we get slower with vmalloc due to TLB-lookups etc? Guess this
 should be investigated.

 http://mail.nl.linux.org/kernelnewbies/2005-12/msg00212.html
 
 Memory savings (different)

 Cheers
					--ro


Current BGP here.

        Aver depth:     2.58
        Max depth:      6
        Leaves:         238903
        Internal nodes: 58049
          1: 30668  2: 11458  3: 9214  4: 3904  5: 1738  6: 721  7: 271  8: 74  16: 1
        Pointers: 464272
Null ptrs: 167321
Total size: 7841  kB


 > 
 > IPV4: fib_trie use vmalloc for large tnodes
 > 
 > Use vmalloc rather than alloc_pages to avoid wasting memory.
 > The problem is that tnode structure has a power of 2 sized array,
 > plus a header. So the current code wastes almost half the memory
 > allocated because it always needs the next bigger size to hold
 > that small header.
 > 
 > This is similar to an earlier patch by Eric, but instead of a list
 > and lock, I used a workqueue to handle the fact that vfree can't
 > be done in interrupt context.
 > 
 > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
 > 
 > ---
 >  net/ipv4/fib_trie.c |   25 +++++++++++++++----------
 >  1 file changed, 15 insertions(+), 10 deletions(-)
 > 
 > --- a/net/ipv4/fib_trie.c	2008-04-04 08:57:01.000000000 -0700
 > +++ b/net/ipv4/fib_trie.c	2008-04-04 08:57:03.000000000 -0700
 > @@ -122,7 +122,10 @@ struct tnode {
 >  	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
 >  	unsigned int full_children;	/* KEYLENGTH bits needed */
 >  	unsigned int empty_children;	/* KEYLENGTH bits needed */
 > -	struct rcu_head rcu;
 > +	union {
 > +		struct rcu_head rcu;
 > +		struct work_struct work;
 > +	};
 >  	struct node *child[0];
 >  };
 >  
 > @@ -350,16 +353,16 @@ static inline void free_leaf_info(struct
 >  
 >  static struct tnode *tnode_alloc(size_t size)
 >  {
 > -	struct page *pages;
 > -
 >  	if (size <= PAGE_SIZE)
 >  		return kzalloc(size, GFP_KERNEL);
 > +	else
 > +		return __vmalloc(size, GFP_KERNEL | __GFP_ZERO, PAGE_KERNEL);
 > +}
 >  
 > -	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
 > -	if (!pages)
 > -		return NULL;
 > -
 > -	return page_address(pages);
 > +static void __tnode_vfree(struct work_struct *arg)
 > +{
 > +	struct tnode *tn = container_of(arg, struct tnode, work);
 > +	vfree(tn);
 >  }
 >  
 >  static void __tnode_free_rcu(struct rcu_head *head)
 > @@ -370,8 +373,10 @@ static void __tnode_free_rcu(struct rcu_
 >  
 >  	if (size <= PAGE_SIZE)
 >  		kfree(tn);
 > -	else
 > -		free_pages((unsigned long)tn, get_order(size));
 > +	else {
 > +		INIT_WORK(&tn->work, __tnode_vfree);
 > +		schedule_work(&tn->work);
 > +	}
 >  }
 >  
 >  static inline void tnode_free(struct tnode *tn)
 > IPV4: fib_trie split child off from tnode
 > 
 > For large tnode's the power of 2 allocation wastes almost half the space
 > since the tnode is allocated with the power of 2 sized child pointers.
 > To solve this add a layer of indirection. For small nodes (the common case)
 > can just use a single allocation, for larger use pages like before.
 > 
 > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
 > 
 > 
 > ---
 >  net/ipv4/fib_trie.c |   43 ++++++++++++++++++++++++++-----------------
 >  1 file changed, 26 insertions(+), 17 deletions(-)
 > 
 > --- a/net/ipv4/fib_trie.c	2008-04-03 09:09:45.000000000 -0700
 > +++ b/net/ipv4/fib_trie.c	2008-04-03 22:08:33.000000000 -0700
 > @@ -120,10 +120,10 @@ struct tnode {
 >  	t_key key;
 >  	unsigned char pos;		/* 2log(KEYLENGTH) bits needed */
 >  	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
 > +	struct node **child;
 >  	unsigned int full_children;	/* KEYLENGTH bits needed */
 >  	unsigned int empty_children;	/* KEYLENGTH bits needed */
 >  	struct rcu_head rcu;
 > -	struct node *child[0];
 >  };
 >  
 >  #ifdef CONFIG_IP_FIB_TRIE_STATS
 > @@ -348,30 +348,40 @@ static inline void free_leaf_info(struct
 >  	call_rcu(&leaf->rcu, __leaf_info_free_rcu);
 >  }
 >  
 > -static struct tnode *tnode_alloc(size_t size)
 > +static struct tnode *tnode_alloc(int bits)
 >  {
 > -	struct page *pages;
 > -
 > -	if (size <= PAGE_SIZE)
 > -		return kzalloc(size, GFP_KERNEL);
 > +	size_t space = sizeof(struct node *) << bits;
 > +	size_t size = space + sizeof(struct tnode);
 > +	struct tnode *tn;
 >  
 > -	pages = alloc_pages(GFP_KERNEL|__GFP_ZERO, get_order(size));
 > -	if (!pages)
 > +	tn = kzalloc(size > PAGE_SIZE ? sizeof(struct tnode) : size, GFP_KERNEL);
 > +	if (!tn)
 >  		return NULL;
 >  
 > -	return page_address(pages);
 > +	if (size <= PAGE_SIZE)
 > +		tn->child = (struct node **) (tn + 1);
 > +	else {
 > +		struct page *pages = alloc_pages(GFP_KERNEL|__GFP_ZERO,
 > +						 get_order(space));
 > +		if (!pages) {
 > +			kfree(tn);
 > +			return NULL;
 > +		}
 > +		tn->child = page_address(pages);
 > +	}
 > +
 > +	return tn;
 >  }
 >  
 >  static void __tnode_free_rcu(struct rcu_head *head)
 >  {
 >  	struct tnode *tn = container_of(head, struct tnode, rcu);
 > -	size_t size = sizeof(struct tnode) +
 > -		      (sizeof(struct node *) << tn->bits);
 > +	size_t space = sizeof(struct node *) << tn->bits;
 > +	size_t size = space + sizeof(struct tnode);
 >  
 > -	if (size <= PAGE_SIZE)
 > -		kfree(tn);
 > -	else
 > -		free_pages((unsigned long)tn, get_order(size));
 > +	if (size > PAGE_SIZE)
 > +		free_pages((unsigned long)tn->child, get_order(space));
 > +	kfree(tn);
 >  }
 >  
 >  static inline void tnode_free(struct tnode *tn)
 > @@ -404,8 +414,7 @@ static struct leaf_info *leaf_info_new(i
 >  
 >  static struct tnode *tnode_new(t_key key, int pos, int bits)
 >  {
 > -	size_t sz = sizeof(struct tnode) + (sizeof(struct node *) << bits);
 > -	struct tnode *tn = tnode_alloc(sz);
 > +	struct tnode *tn = tnode_alloc(bits);
 >  
 >  	if (tn) {
 >  		tn->parent = T_TNODE;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07  6:55           ` Robert Olsson
@ 2008-04-07  7:58             ` Andi Kleen
  2008-04-07 14:42               ` Robert Olsson
  0 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2008-04-07  7:58 UTC (permalink / raw)
  To: Robert Olsson; +Cc: Stephen Hemminger, David Miller, Eric Dumazet, netdev

Robert Olsson <Robert.Olsson@data.slu.se> writes:
>
>  Do we get slower with vmalloc due to TLB-lookups etc? Guess this
>  should be investigated.

In some cases it might even go faster because a lot of x86 CPUs
have far more 4K TLBs than 2M TLBs. vmalloc is just quite expensive
in setup/free time, but that shouldn't be a big issue here.

The memory savings are impressive.

-Andi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07  7:58             ` Andi Kleen
@ 2008-04-07 14:42               ` Robert Olsson
  2008-04-07 15:15                 ` Andi Kleen
  0 siblings, 1 reply; 15+ messages in thread
From: Robert Olsson @ 2008-04-07 14:42 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Robert Olsson, Stephen Hemminger, David Miller, Eric Dumazet,
	netdev


Andi Kleen writes:

 > >  Do we get slower with vmalloc due to TLB-lookups etc? Guess this
 > >  should be investigated.
 > 
 > In some cases it might even go faster because a lot of x86 CPUs
 > have far more 4K TLBs than 2M TLBs. vmalloc is just quite expensive
 > in setup/free time, but that shouldn't be a big issue here.


 I've did some rDoS testing and the lookup performance is the same or 
 slightly better. So it should be fine.  


 Cheers
					--ro

 

  

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07 14:42               ` Robert Olsson
@ 2008-04-07 15:15                 ` Andi Kleen
  2008-04-07 15:36                   ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2008-04-07 15:15 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Andi Kleen, Stephen Hemminger, David Miller, Eric Dumazet, netdev

On Mon, Apr 07, 2008 at 04:42:35PM +0200, Robert Olsson wrote:
> 
> Andi Kleen writes:
> 
>  > >  Do we get slower with vmalloc due to TLB-lookups etc? Guess this
>  > >  should be investigated.
>  > 
>  > In some cases it might even go faster because a lot of x86 CPUs
>  > have far more 4K TLBs than 2M TLBs. vmalloc is just quite expensive
>  > in setup/free time, but that shouldn't be a big issue here.
> 
> 
>  I've did some rDoS testing and the lookup performance is the same or 
>  slightly better. So it should be fine.  

If you want more realistic worst case numbers run something in user space in 
the background that thrashes the TLBs constantly and then see how
the numbers change.

The main advantage of using large pages is that they tend to be separated
from 4K TLBs and since most user space doesn't use large pages it 
gives the kernel an effective private TLB pool.

With vmalloc it will now compete with whatever other TLB pigs are active.

-Andi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07 15:15                 ` Andi Kleen
@ 2008-04-07 15:36                   ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2008-04-07 15:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Robert Olsson, Stephen Hemminger, David Miller, netdev

Andi Kleen a écrit :
> On Mon, Apr 07, 2008 at 04:42:35PM +0200, Robert Olsson wrote:
>   
>> Andi Kleen writes:
>>
>>  > >  Do we get slower with vmalloc due to TLB-lookups etc? Guess this
>>  > >  should be investigated.
>>  > 
>>  > In some cases it might even go faster because a lot of x86 CPUs
>>  > have far more 4K TLBs than 2M TLBs. vmalloc is just quite expensive
>>  > in setup/free time, but that shouldn't be a big issue here.
>>
>>
>>  I've did some rDoS testing and the lookup performance is the same or 
>>  slightly better. So it should be fine.  
>>     
>
> If you want more realistic worst case numbers run something in user space in 
> the background that thrashes the TLBs constantly and then see how
> the numbers change.
>
> The main advantage of using large pages is that they tend to be separated
> from 4K TLBs and since most user space doesn't use large pages it 
> gives the kernel an effective private TLB pool.
>
> With vmalloc it will now compete with whatever other TLB pigs are active.
>   
Yes, but with vmalloc(), NUMA machines have some chance to distribute 
this big area on several nodes

(only if the process currently expanding the fib_trie root node has an 
appropriate numa_policy.... ah well :) :) )





^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
  2008-04-07  6:55           ` Robert Olsson
@ 2008-04-07 16:46           ` Eric Dumazet
  2008-04-07 22:48             ` Stephen Hemminger
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2008-04-07 16:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, Robert Olsson, netdev

Stephen Hemminger a écrit :
> Eric wisely pointed out that for larger sizes of nodes, the
> current allocation in fib_trie wastes lots of memory.  For a sample
> of routes extracted from the bugzilla bug the largest node grows
> to 2M bytes on 64 bit system. This leads to 2044K of wasted memory.
>
> There are two possible solutions (see attached). One uses vmalloc()
> rather than alloc_pages, but has to add complexity on freeing.
> The other adds a layer of indirection to the tnode lookup.
>
> Both have been tested on net-2.6.26 with the huge route table.
> I slightly prefer the vmalloc version, but both work fine.
>
>   
> ------------------------------------------------------------------------
>
> IPV4: fib_trie use vmalloc for large tnodes
>
> Use vmalloc rather than alloc_pages to avoid wasting memory.
> The problem is that tnode structure has a power of 2 sized array,
> plus a header. So the current code wastes almost half the memory
> allocated because it always needs the next bigger size to hold
> that small header.
>
> This is similar to an earlier patch by Eric, but instead of a list
> and lock, I used a workqueue to handle the fact that vfree can't
> be done in interrupt context.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> ---
>  net/ipv4/fib_trie.c |   25 +++++++++++++++----------
>  1 file changed, 15 insertions(+), 10 deletions(-)
>
> --- a/net/ipv4/fib_trie.c	2008-04-04 08:57:01.000000000 -0700
> +++ b/net/ipv4/fib_trie.c	2008-04-04 08:57:03.000000000 -0700
> @@ -122,7 +122,10 @@ struct tnode {
>  	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
>  	unsigned int full_children;	/* KEYLENGTH bits needed */
>  	unsigned int empty_children;	/* KEYLENGTH bits needed */
> -	struct rcu_head rcu;
> +	union {
> +		struct rcu_head rcu;
> +		struct work_struct work;
> +	};
>  	struct node *child[0];
>   
Hum...

I prefer my patch Stephen, as your version enlarges every tnode with an 
embedded "struct work_struct" which can be larger than a "struct rcu_head"

And as your 2nd patch doesnt use vmalloc() at all, we only can gain one 
order for the max bits in root node (19 instead of 18) :
We will hit the 'bug' again in a couple of months, or if router memory 
is somehow fragmented.






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07 16:46           ` Eric Dumazet
@ 2008-04-07 22:48             ` Stephen Hemminger
  2008-04-10  9:57               ` David Miller
  0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2008-04-07 22:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, David Miller, Robert Olsson, netdev

On Mon, 07 Apr 2008 18:46:29 +0200
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Stephen Hemminger a écrit :
> > Eric wisely pointed out that for larger sizes of nodes, the
> > current allocation in fib_trie wastes lots of memory.  For a sample
> > of routes extracted from the bugzilla bug the largest node grows
> > to 2M bytes on 64 bit system. This leads to 2044K of wasted memory.
> >
> > There are two possible solutions (see attached). One uses vmalloc()
> > rather than alloc_pages, but has to add complexity on freeing.
> > The other adds a layer of indirection to the tnode lookup.
> >
> > Both have been tested on net-2.6.26 with the huge route table.
> > I slightly prefer the vmalloc version, but both work fine.
> >
> >   
> > ------------------------------------------------------------------------
> >
> > IPV4: fib_trie use vmalloc for large tnodes
> >
> > Use vmalloc rather than alloc_pages to avoid wasting memory.
> > The problem is that tnode structure has a power of 2 sized array,
> > plus a header. So the current code wastes almost half the memory
> > allocated because it always needs the next bigger size to hold
> > that small header.
> >
> > This is similar to an earlier patch by Eric, but instead of a list
> > and lock, I used a workqueue to handle the fact that vfree can't
> > be done in interrupt context.
> >
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> >
> > ---
> >  net/ipv4/fib_trie.c |   25 +++++++++++++++----------
> >  1 file changed, 15 insertions(+), 10 deletions(-)
> >
> > --- a/net/ipv4/fib_trie.c	2008-04-04 08:57:01.000000000 -0700
> > +++ b/net/ipv4/fib_trie.c	2008-04-04 08:57:03.000000000 -0700
> > @@ -122,7 +122,10 @@ struct tnode {
> >  	unsigned char bits;		/* 2log(KEYLENGTH) bits needed */
> >  	unsigned int full_children;	/* KEYLENGTH bits needed */
> >  	unsigned int empty_children;	/* KEYLENGTH bits needed */
> > -	struct rcu_head rcu;
> > +	union {
> > +		struct rcu_head rcu;
> > +		struct work_struct work;
> > +	};
> >  	struct node *child[0];
> >   
> Hum...
> 
> I prefer my patch Stephen, as your version enlarges every tnode with an 
> embedded "struct work_struct" which can be larger than a "struct rcu_head"

The number of tnode's is small and the size growth of 2*(unsigned long) is not
worth worrying about. Also theoretically, my version could have multiple
work elements processed at once.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC] fib_trie: memory waste solutions
  2008-04-07 22:48             ` Stephen Hemminger
@ 2008-04-10  9:57               ` David Miller
  0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2008-04-10  9:57 UTC (permalink / raw)
  To: stephen.hemminger; +Cc: dada1, shemminger, Robert.Olsson, netdev

From: Stephen Hemminger <stephen.hemminger@vyatta.com>
Date: Mon, 7 Apr 2008 15:48:00 -0700

> On Mon, 07 Apr 2008 18:46:29 +0200
> Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > I prefer my patch Stephen, as your version enlarges every tnode with an 
> > embedded "struct work_struct" which can be larger than a "struct rcu_head"
> 
> The number of tnode's is small and the size growth of 2*(unsigned long) is not
> worth worrying about. Also theoretically, my version could have multiple
> work elements processed at once.

I pretty much agree with Stephen and thus applied his patch
to net-2.6.26.

We can tweak this further if new data supports that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-04-10  9:57 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-02  0:27 [RFC] fib_trie: flush improvement Stephen Hemminger
2008-04-02  8:01 ` Eric Dumazet
2008-04-02 14:35   ` Eric Dumazet
2008-04-02 18:03     ` Stephen Hemminger
2008-04-02 19:36       ` Eric Dumazet
2008-04-04 16:02         ` [RFC] fib_trie: memory waste solutions Stephen Hemminger
2008-04-07  6:55           ` Robert Olsson
2008-04-07  7:58             ` Andi Kleen
2008-04-07 14:42               ` Robert Olsson
2008-04-07 15:15                 ` Andi Kleen
2008-04-07 15:36                   ` Eric Dumazet
2008-04-07 16:46           ` Eric Dumazet
2008-04-07 22:48             ` Stephen Hemminger
2008-04-10  9:57               ` David Miller
2008-04-02  9:31 ` [RFC] fib_trie: flush improvement Robert Olsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).