[PATCH] net: implement emergency route cache rebulds when gc

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
@ 2008-09-29 19:12 Neil Horman
  2008-09-29 20:22 ` Eric Dumazet
                   ` (3 more replies)
  0 siblings, 4 replies; 64+ messages in thread
From: Neil Horman @ 2008-09-29 19:12 UTC (permalink / raw)
  To: netdev; +Cc: kuznet, davem, pekkas, jmorris, yoshfuji, kaber, nhorman

Hey all-
	We currently have the ability to disable our route cache secret interval
rebuild timer (by setting it to zero), but if we do that its possible for an
attacker (if they guess our route cache hash secret, to fill our system with
routes that all hash to the same bucket, destroying our performance.  This patch
provides a backstop for that issues.  In the event that our rebuild interval is
disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
an emergency hash rebuild.  During the hash rebuild we:
1) warn the user of the emergency
2) disable the rebuild timer
3) invalidate the route caches
4) re-enable the rebuild timer with its old value

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 route.c |   36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)


diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..b95e02a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
-
+static void rt_emergency_hash_rebuild(void);
 
 static struct dst_ops ipv4_dst_ops = {
 	.family =		AF_INET,
@@ -1056,6 +1056,15 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else if (chain_length > ip_rt_gc_elasticity) {
+		/*
+		 * We didn't find a route entry we could safely free,
+		 * yet our chain length is over our elasticity value.
+		 * Someone may have guessed our hash secret and is artifically
+		 * filling up our route cache.  Lets do an emergency rebuild
+		 * to be safe
+		 */
+		rt_emergency_hash_rebuild();
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
 	rtnl_unlock();
 }
 
+static void rt_secret_rebuild_oneshot(void) {
+	struct net *net;
+	int old_interval = ip_rt_secret_interval;
+
+	ip_rt_secret_interval = 0;
+
+	rt_secret_reschedule(old_interval);
+
+	for_each_net(net) 
+		rt_cache_invalidate(net);
+
+	ip_rt_secret_interval = old_interval;
+
+	rt_secret_reschedule(0);
+}
+
+static void rt_emergency_hash_rebuild(void) {
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot();
+}
+
 static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
 					  struct file *filp,
 					  void __user *buffer, size_t *lenp,
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 19:12 [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Neil Horman
@ 2008-09-29 20:22 ` Eric Dumazet
  2008-09-29 20:27   ` Neil Horman
  2008-09-30 14:08   ` David Miller
  2008-09-30 14:08 ` David Miller
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-09-29 20:22 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

Neil Horman a écrit :
> Hey all-
> 	We currently have the ability to disable our route cache secret interval
> rebuild timer (by setting it to zero), but if we do that its possible for an
> attacker (if they guess our route cache hash secret, to fill our system with
> routes that all hash to the same bucket, destroying our performance.  This patch
> provides a backstop for that issues.  In the event that our rebuild interval is
> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
> an emergency hash rebuild.  During the hash rebuild we:
> 1) warn the user of the emergency
> 2) disable the rebuild timer
> 3) invalidate the route caches
> 4) re-enable the rebuild timer with its old value
> 
> Regards
> Neil

This sounds not good at all to me.

1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
   you give attackers infinite time to break your machine.

To quote Herbert (who allowed to set this interval to 0)

    "Let me first state that disabling the route cache hash rebuild
     should not be done without extensive analysis on the risk profile
     and careful deliberation.

     However, there are times when this can be done safely or for
     testing.  For example, when you have mechanisms for ensuring
     that offending parties do not exist in your network."


2) Many machines have ip_rt_gc_elasticity set to 2,
   because they have a huge hash table, but low chain depths.


> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  route.c |   36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..b95e02a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>  static void		 ipv4_link_failure(struct sk_buff *skb);
>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>  static int rt_garbage_collect(struct dst_ops *ops);
> -
> +static void rt_emergency_hash_rebuild(void);
>  
>  static struct dst_ops ipv4_dst_ops = {
>  	.family =		AF_INET,
> @@ -1056,6 +1056,15 @@ restart:
>  			*candp = cand->u.dst.rt_next;
>  			rt_free(cand);
>  		}
> +	} else if (chain_length > ip_rt_gc_elasticity) {
> +		/*
> +		 * We didn't find a route entry we could safely free,
> +		 * yet our chain length is over our elasticity value.
> +		 * Someone may have guessed our hash secret and is artifically
> +		 * filling up our route cache.  Lets do an emergency rebuild
> +		 * to be safe
> +		 */
> +		rt_emergency_hash_rebuild();
>  	}
>  
>  	/* Try to bind route to arp only if it is output
> @@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
>  	rtnl_unlock();
>  }
>  
> +static void rt_secret_rebuild_oneshot(void) {
> +	struct net *net;
> +	int old_interval = ip_rt_secret_interval;
> +
> +	ip_rt_secret_interval = 0;
> +
> +	rt_secret_reschedule(old_interval);
> +
> +	for_each_net(net) 
> +		rt_cache_invalidate(net);
> +
> +	ip_rt_secret_interval = old_interval;
> +
> +	rt_secret_reschedule(0);
> +}
> +
> +static void rt_emergency_hash_rebuild(void) {
> +	if (net_ratelimit()) {
> +		printk(KERN_WARNING "Route hash chain too long!\n");
> +		printk(KERN_WARNING "Adjust your secret_interval!\n");
> +	}
> +
> +	rt_secret_rebuild_oneshot();
> +}
> +
>  static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
>  					  struct file *filp,
>  					  void __user *buffer, size_t *lenp,





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 20:22 ` Eric Dumazet
@ 2008-09-29 20:27   ` Neil Horman
  2008-09-29 21:00     ` Eric Dumazet
  2008-09-30 14:08   ` David Miller
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-09-29 20:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> Hey all-
>> 	We currently have the ability to disable our route cache secret interval
>> rebuild timer (by setting it to zero), but if we do that its possible for an
>> attacker (if they guess our route cache hash secret, to fill our system with
>> routes that all hash to the same bucket, destroying our performance.  This patch
>> provides a backstop for that issues.  In the event that our rebuild interval is
>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>> an emergency hash rebuild.  During the hash rebuild we:
>> 1) warn the user of the emergency
>> 2) disable the rebuild timer
>> 3) invalidate the route caches
>> 4) re-enable the rebuild timer with its old value
>>
>> Regards
>> Neil
>
> This sounds not good at all to me.
>
> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>   you give attackers infinite time to break your machine.
>
> To quote Herbert (who allowed to set this interval to 0)
>
>    "Let me first state that disabling the route cache hash rebuild
>     should not be done without extensive analysis on the risk profile
>     and careful deliberation.
>
>     However, there are times when this can be done safely or for
>     testing.  For example, when you have mechanisms for ensuring
>     that offending parties do not exist in your network."
>
Thats really rather the motivation behind this.  The patch that Herbert
submitted with that commit explicitly lets one disable their rebuild timer.  I
agree its stupid to do that, but we added code to allow it.  This provides a
patch to help people who are victimized because they've done exactly this
(additionaly providing them a warning to stop doing it).


>
> 2) Many machines have ip_rt_gc_elasticity set to 2,
>   because they have a huge hash table, but low chain depths.
Ok, that seem reasonable, and this isn't going to disallow that.  By the same
resoning, people who have huge hash tables, and low chain depths won't
want their low chain length being violated, would they?  This patch will warn
them if their assumptions are being violated.

Neil


-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 20:27   ` Neil Horman
@ 2008-09-29 21:00     ` Eric Dumazet
  2008-09-29 22:38       ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-09-29 21:00 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

Neil Horman a écrit :
> On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote:
>> Neil Horman a écrit :
>>> Hey all-
>>> 	We currently have the ability to disable our route cache secret interval
>>> rebuild timer (by setting it to zero), but if we do that its possible for an
>>> attacker (if they guess our route cache hash secret, to fill our system with
>>> routes that all hash to the same bucket, destroying our performance.  This patch
>>> provides a backstop for that issues.  In the event that our rebuild interval is
>>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>>> an emergency hash rebuild.  During the hash rebuild we:
>>> 1) warn the user of the emergency
>>> 2) disable the rebuild timer
>>> 3) invalidate the route caches
>>> 4) re-enable the rebuild timer with its old value
>>>
>>> Regards
>>> Neil
>> This sounds not good at all to me.
>>
>> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>>   you give attackers infinite time to break your machine.
>>
>> To quote Herbert (who allowed to set this interval to 0)
>>
>>    "Let me first state that disabling the route cache hash rebuild
>>     should not be done without extensive analysis on the risk profile
>>     and careful deliberation.
>>
>>     However, there are times when this can be done safely or for
>>     testing.  For example, when you have mechanisms for ensuring
>>     that offending parties do not exist in your network."
>>
> Thats really rather the motivation behind this.  The patch that Herbert
> submitted with that commit explicitly lets one disable their rebuild timer.  I
> agree its stupid to do that, but we added code to allow it.  This provides a
> patch to help people who are victimized because they've done exactly this
> (additionaly providing them a warning to stop doing it).

Strange... many kernel parameters can be set to hazardous values that make machine unusable...

ip_rt_gc_interval can also be set to a very large value : No more route cache gc

> 
> 
>> 2) Many machines have ip_rt_gc_elasticity set to 2,
>>   because they have a huge hash table, but low chain depths.
> Ok, that seem reasonable, and this isn't going to disallow that.  By the same
> resoning, people who have huge hash tables, and low chain depths won't
> want their low chain length being violated, would they?  This patch will warn
> them if their assumptions are being violated.

Warn only ? If I read your patch, you not only warn in this case.
(you invalidate cache for each struct net, potentially wraping rt_genid)

When you have 2^20 slots in route cache hash table, you dont care if few slots have 3 or 4 elements.
And chance is very high that more than one slot has 3 or even 4 elements, no need for an attacker.

Now if you change your code to something like

 if (unlikely(chain_length > some_quite_big_number &&
     ip_rt_secret_interval == 0)) {
 	do_something();
 }

some_quite_big_number could be >= 30 or something...

then it might be OK (at least it wont break common setups)




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 21:00     ` Eric Dumazet
@ 2008-09-29 22:38       ` Neil Horman
  2008-09-30  6:02         ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-09-29 22:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Mon, Sep 29, 2008 at 11:00:35PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote:
>>> Neil Horman a écrit :
>>>> Hey all-
>>>> 	We currently have the ability to disable our route cache secret interval
>>>> rebuild timer (by setting it to zero), but if we do that its possible for an
>>>> attacker (if they guess our route cache hash secret, to fill our system with
>>>> routes that all hash to the same bucket, destroying our performance.  This patch
>>>> provides a backstop for that issues.  In the event that our rebuild interval is
>>>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>>>> an emergency hash rebuild.  During the hash rebuild we:
>>>> 1) warn the user of the emergency
>>>> 2) disable the rebuild timer
>>>> 3) invalidate the route caches
>>>> 4) re-enable the rebuild timer with its old value
>>>>
>>>> Regards
>>>> Neil
>>> This sounds not good at all to me.
>>>
>>> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>>>   you give attackers infinite time to break your machine.
>>>
>>> To quote Herbert (who allowed to set this interval to 0)
>>>
>>>    "Let me first state that disabling the route cache hash rebuild
>>>     should not be done without extensive analysis on the risk profile
>>>     and careful deliberation.
>>>
>>>     However, there are times when this can be done safely or for
>>>     testing.  For example, when you have mechanisms for ensuring
>>>     that offending parties do not exist in your network."
>>>
>> Thats really rather the motivation behind this.  The patch that Herbert
>> submitted with that commit explicitly lets one disable their rebuild timer.  I
>> agree its stupid to do that, but we added code to allow it.  This provides a
>> patch to help people who are victimized because they've done exactly this
>> (additionaly providing them a warning to stop doing it).
>
> Strange... many kernel parameters can be set to hazardous values that make machine unusable...
>
> ip_rt_gc_interval can also be set to a very large value : No more route cache gc
>
If you want to talk philosophy, then I accept your premise: we can tune the
kernel in ways that make it unusable.  Does that mean we should avoid doing
things to prevent that and maintain usibility.  Ithink this patch accomplishes
that goal in its small way.
 
>>
>>
>>> 2) Many machines have ip_rt_gc_elasticity set to 2,
>>>   because they have a huge hash table, but low chain depths.
>> Ok, that seem reasonable, and this isn't going to disallow that.  By the same
>> resoning, people who have huge hash tables, and low chain depths won't
>> want their low chain length being violated, would they?  This patch will warn
>> them if their assumptions are being violated.
>
> Warn only ? If I read your patch, you not only warn in this case.
No, not warn only, Warn and correct is the clear intent here

> (you invalidate cache for each struct net, potentially wraping rt_genid)
>
If you overflow your elasticity 2^16 times, yes (I think rt_genid is a 16 bit
value, but I don't have the kernel in front of me).  I would hope that would be
enough warnings to make a sysadmin address the problem

> When you have 2^20 slots in route cache hash table, you dont care if few slots have 3 or 4 elements.
> And chance is very high that more than one slot has 3 or even 4 elements, no need for an attacker.
Ok, then I would assert if your ok with a few chains getting to be X entries in
length, then you should set your elasticity to be X.

>
> Now if you change your code to something like
>
> if (unlikely(chain_length > some_quite_big_number &&
>     ip_rt_secret_interval == 0)) {
> 	do_something();
> }
>
> some_quite_big_number could be >= 30 or something...
>
I'd be ok with that, if some others chimed in with consensus on the matter.  I
felt that we had a value definining what chain length should be
(ip_rt_gc_elasticity is already used in comparison on chain length in
rt_intern_hash, so I was keeping with that).  But if some others speak up and
support a new sysctl, I can get behind that

> then it might be OK (at least it wont break common setups)
>
I don't think it will break many setups, the default value is 8 for this sysctl.
If someone has set it lower, and sees a warning, they won't loose their system
altogether, they'll just see a momentary reduction in throughput, and a
warning to increase the value they have set.  Its going to far to say anything
will 'break'.

Thanks & Regards
Neil

>
>
>

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 22:38       ` Neil Horman
@ 2008-09-30  6:02         ` Eric Dumazet
  2008-09-30 11:23           ` Neil Horman
  2008-09-30 14:10           ` David Miller
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-09-30  6:02 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

Neil Horman a écrit :
> On Mon, Sep 29, 2008 at 11:00:35PM +0200, Eric Dumazet wrote:
>> Neil Horman a écrit :
>>> On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote:
>>>> Neil Horman a écrit :
>>>>> Hey all-
>>>>> 	We currently have the ability to disable our route cache secret interval
>>>>> rebuild timer (by setting it to zero), but if we do that its possible for an
>>>>> attacker (if they guess our route cache hash secret, to fill our system with
>>>>> routes that all hash to the same bucket, destroying our performance.  This patch
>>>>> provides a backstop for that issues.  In the event that our rebuild interval is
>>>>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>>>>> an emergency hash rebuild.  During the hash rebuild we:
>>>>> 1) warn the user of the emergency
>>>>> 2) disable the rebuild timer
>>>>> 3) invalidate the route caches
>>>>> 4) re-enable the rebuild timer with its old value
>>>>>
>>>>> Regards
>>>>> Neil
>>>> This sounds not good at all to me.
>>>>
>>>> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>>>>   you give attackers infinite time to break your machine.
>>>>
>>>> To quote Herbert (who allowed to set this interval to 0)
>>>>
>>>>    "Let me first state that disabling the route cache hash rebuild
>>>>     should not be done without extensive analysis on the risk profile
>>>>     and careful deliberation.
>>>>
>>>>     However, there are times when this can be done safely or for
>>>>     testing.  For example, when you have mechanisms for ensuring
>>>>     that offending parties do not exist in your network."
>>>>
>>> Thats really rather the motivation behind this.  The patch that Herbert
>>> submitted with that commit explicitly lets one disable their rebuild timer.  I
>>> agree its stupid to do that, but we added code to allow it.  This provides a
>>> patch to help people who are victimized because they've done exactly this
>>> (additionaly providing them a warning to stop doing it).
>> Strange... many kernel parameters can be set to hazardous values that make machine unusable...
>>
>> ip_rt_gc_interval can also be set to a very large value : No more route cache gc
>>
> If you want to talk philosophy, then I accept your premise: we can tune the
> kernel in ways that make it unusable.  Does that mean we should avoid doing
> things to prevent that and maintain usibility.  Ithink this patch accomplishes
> that goal in its small way.
>  
>>>
>>>> 2) Many machines have ip_rt_gc_elasticity set to 2,
>>>>   because they have a huge hash table, but low chain depths.
>>> Ok, that seem reasonable, and this isn't going to disallow that.  By the same
>>> resoning, people who have huge hash tables, and low chain depths won't
>>> want their low chain length being violated, would they?  This patch will warn
>>> them if their assumptions are being violated.
>> Warn only ? If I read your patch, you not only warn in this case.
> No, not warn only, Warn and correct is the clear intent here
> 
>> (you invalidate cache for each struct net, potentially wraping rt_genid)
>>
> If you overflow your elasticity 2^16 times, yes (I think rt_genid is a 16 bit
> value, but I don't have the kernel in front of me).  I would hope that would be
> enough warnings to make a sysadmin address the problem
> 
>> When you have 2^20 slots in route cache hash table, you dont care if few slots have 3 or 4 elements.
>> And chance is very high that more than one slot has 3 or even 4 elements, no need for an attacker.
> Ok, then I would assert if your ok with a few chains getting to be X entries in
> length, then you should set your elasticity to be X.

Nope. We set elasticity to 2 when we want to trigger code starting at line 1048

        if (cand) {
                if (chain_length > ip_rt_gc_elasticity) {
                        *candp = cand->u.dst.rt_next;
                        rt_free(cand);
                }
        }

That is, freeing one entry if chain length is > 2 AND one entry is freeable.
This means that *some* chains eventually can have live entries and length > 2
without any problem. This is not an emergency case. I have seen on real 
machines some chains hitting length of 13 (elasticity=2), that was normal traffic,
and rt cache was on equilibrium.

Your patch adds to the "if (chain_length > elasticity)" case :

If no entry is freeable in this slot, invalidate *all* cache and put a warning.

Invalidating live entries is puting a high pressure on dst_garbage.list.
Having 1.000.000 entries in this list is not very cool, and should be done
only if really necessary.

When you know you have about 1.000.000 live entries in rt cache, you
can safely make your hash table sized to 2^20 slots, and elasticity set to 2
so that average chain length is <= 2, reducing cache misses at lookup time.

That is quite important to be able to handle 100.000 packets per second

> 
>> Now if you change your code to something like
>>
>> if (unlikely(chain_length > some_quite_big_number &&
>>     ip_rt_secret_interval == 0)) {
>> 	do_something();
>> }
>>
>> some_quite_big_number could be >= 30 or something...
>>
> I'd be ok with that, if some others chimed in with consensus on the matter.  I
> felt that we had a value definining what chain length should be
> (ip_rt_gc_elasticity is already used in comparison on chain length in
> rt_intern_hash, so I was keeping with that).  But if some others speak up and
> support a new sysctl, I can get behind that

I said 30, but could have said 100. No need for a sysctl.
If only one chain is really that long (and attacked), all its entries are hot.
If many chains are hit, then other rt...sysctl_params will control and limit cache grow.

> 
>> then it might be OK (at least it wont break common setups)
>>
> I don't think it will break many setups, the default value is 8 for this sysctl.
> If someone has set it lower, and sees a warning, they won't loose their system
> altogether, they'll just see a momentary reduction in throughput, and a
> warning to increase the value they have set.  Its going to far to say anything
> will 'break'.

I spent many days so that route cache doesnt crash some big machines I have around,
I have feelings your patch will make them crawl again, unless sysadmin is smart
enough to change once again rt sysctls.
So my replies are not just random things from me.

When a machine is targeted by a DDOS attack, about all slots of the hash table
are fully loaded (ie chain length >= elasticity). We dont need to invalidate 
the cache, but find an equilibrium, with small adjustements.

Thank you





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30  6:02         ` Eric Dumazet
@ 2008-09-30 11:23           ` Neil Horman
  2008-09-30 14:10           ` David Miller
  1 sibling, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-09-30 11:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Tue, Sep 30, 2008 at 08:02:44AM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> On Mon, Sep 29, 2008 at 11:00:35PM +0200, Eric Dumazet wrote:
>>> Neil Horman a écrit :
>>>> On Mon, Sep 29, 2008 at 10:22:03PM +0200, Eric Dumazet wrote:
>>>>> Neil Horman a écrit :
>>>>>> Hey all-
>>>>>> 	We currently have the ability to disable our route cache secret interval
>>>>>> rebuild timer (by setting it to zero), but if we do that its possible for an
>>>>>> attacker (if they guess our route cache hash secret, to fill our system with
>>>>>> routes that all hash to the same bucket, destroying our performance.  This patch
>>>>>> provides a backstop for that issues.  In the event that our rebuild interval is
>>>>>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>>>>>> an emergency hash rebuild.  During the hash rebuild we:
>>>>>> 1) warn the user of the emergency
>>>>>> 2) disable the rebuild timer
>>>>>> 3) invalidate the route caches
>>>>>> 4) re-enable the rebuild timer with its old value
>>>>>>
>>>>>> Regards
>>>>>> Neil
>>>>> This sounds not good at all to me.
>>>>>
>>>>> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>>>>>   you give attackers infinite time to break your machine.
>>>>>
>>>>> To quote Herbert (who allowed to set this interval to 0)
>>>>>
>>>>>    "Let me first state that disabling the route cache hash rebuild
>>>>>     should not be done without extensive analysis on the risk profile
>>>>>     and careful deliberation.
>>>>>
>>>>>     However, there are times when this can be done safely or for
>>>>>     testing.  For example, when you have mechanisms for ensuring
>>>>>     that offending parties do not exist in your network."
>>>>>
>>>> Thats really rather the motivation behind this.  The patch that Herbert
>>>> submitted with that commit explicitly lets one disable their rebuild timer.  I
>>>> agree its stupid to do that, but we added code to allow it.  This provides a
>>>> patch to help people who are victimized because they've done exactly this
>>>> (additionaly providing them a warning to stop doing it).
>>> Strange... many kernel parameters can be set to hazardous values that make machine unusable...
>>>
>>> ip_rt_gc_interval can also be set to a very large value : No more route cache gc
>>>
>> If you want to talk philosophy, then I accept your premise: we can tune the
>> kernel in ways that make it unusable.  Does that mean we should avoid doing
>> things to prevent that and maintain usibility.  Ithink this patch accomplishes
>> that goal in its small way.
>>  
>>>>
>>>>> 2) Many machines have ip_rt_gc_elasticity set to 2,
>>>>>   because they have a huge hash table, but low chain depths.
>>>> Ok, that seem reasonable, and this isn't going to disallow that.  By the same
>>>> resoning, people who have huge hash tables, and low chain depths won't
>>>> want their low chain length being violated, would they?  This patch will warn
>>>> them if their assumptions are being violated.
>>> Warn only ? If I read your patch, you not only warn in this case.
>> No, not warn only, Warn and correct is the clear intent here
>>
>>> (you invalidate cache for each struct net, potentially wraping rt_genid)
>>>
>> If you overflow your elasticity 2^16 times, yes (I think rt_genid is a 16 bit
>> value, but I don't have the kernel in front of me).  I would hope that would be
>> enough warnings to make a sysadmin address the problem
>>
>>> When you have 2^20 slots in route cache hash table, you dont care if few slots have 3 or 4 elements.
>>> And chance is very high that more than one slot has 3 or even 4 elements, no need for an attacker.
>> Ok, then I would assert if your ok with a few chains getting to be X entries in
>> length, then you should set your elasticity to be X.
>
> Nope. We set elasticity to 2 when we want to trigger code starting at line 1048
>
>        if (cand) {
>                if (chain_length > ip_rt_gc_elasticity) {
>                        *candp = cand->u.dst.rt_next;
>                        rt_free(cand);
>                }
>        }
>
> That is, freeing one entry if chain length is > 2 AND one entry is freeable.
> This means that *some* chains eventually can have live entries and length > 2
> without any problem. This is not an emergency case. I have seen on real  
> machines some chains hitting length of 13 (elasticity=2), that was normal 
> traffic,
> and rt cache was on equilibrium.
>
> Your patch adds to the "if (chain_length > elasticity)" case :
>
Not quite, the above is in en else clause to if (cand), that is to say if there
is no freeable entry, and our chain length is greater than our elasticity, which
I believe is an emergency case.

> If no entry is freeable in this slot, invalidate *all* cache and put a warning.
>
> Invalidating live entries is puting a high pressure on dst_garbage.list.
> Having 1.000.000 entries in this list is not very cool, and should be done
> only if really necessary.
>
> When you know you have about 1.000.000 live entries in rt cache, you
> can safely make your hash table sized to 2^20 slots, and elasticity set to 2
> so that average chain length is <= 2, reducing cache misses at lookup time.
>
> That is quite important to be able to handle 100.000 packets per second
>
I agree, but this should not affect that ability (although I don't have the
equipment to check such high throughput.  If you do, I would appreciate the
test.

>>
>>> Now if you change your code to something like
>>>
>>> if (unlikely(chain_length > some_quite_big_number &&
>>>     ip_rt_secret_interval == 0)) {
>>> 	do_something();
>>> }
>>>
>>> some_quite_big_number could be >= 30 or something...
>>>
>> I'd be ok with that, if some others chimed in with consensus on the matter.  I
>> felt that we had a value definining what chain length should be
>> (ip_rt_gc_elasticity is already used in comparison on chain length in
>> rt_intern_hash, so I was keeping with that).  But if some others speak up and
>> support a new sysctl, I can get behind that
>
> I said 30, but could have said 100. No need for a sysctl.
> If only one chain is really that long (and attacked), all its entries are hot.
> If many chains are hit, then other rt...sysctl_params will control and limit cache grow.
>
>>
>>> then it might be OK (at least it wont break common setups)
>>>
>> I don't think it will break many setups, the default value is 8 for this sysctl.
>> If someone has set it lower, and sees a warning, they won't loose their system
>> altogether, they'll just see a momentary reduction in throughput, and a
>> warning to increase the value they have set.  Its going to far to say anything
>> will 'break'.
>
> I spent many days so that route cache doesnt crash some big machines I have around,
> I have feelings your patch will make them crawl again, unless sysadmin is smart
> enough to change once again rt sysctls.
> So my replies are not just random things from me.
>


Again, if you have the ability to run such a high volume test and demonstrate
that this will happen, I'll gladly recind the patch and go back to the drawing
board, but I think you're wrong.

Regards
Neil

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 19:12 [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Neil Horman
  2008-09-29 20:22 ` Eric Dumazet
@ 2008-09-30 14:08 ` David Miller
  2008-09-30 17:47   ` Eric Dumazet
  2008-10-05  3:26   ` Herbert Xu
  2008-09-30 14:17 ` Denis V. Lunev
  2008-10-05  3:17 ` Herbert Xu
  3 siblings, 2 replies; 64+ messages in thread
From: David Miller @ 2008-09-30 14:08 UTC (permalink / raw)
  To: nhorman; +Cc: netdev, kuznet, pekkas, jmorris, yoshfuji, kaber

From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 29 Sep 2008 15:12:54 -0400

> 	We currently have the ability to disable our route cache secret interval
> rebuild timer (by setting it to zero), but if we do that its possible for an
> attacker (if they guess our route cache hash secret, to fill our system with
> routes that all hash to the same bucket, destroying our performance.  This patch
> provides a backstop for that issues.  In the event that our rebuild interval is
> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
> an emergency hash rebuild.  During the hash rebuild we:
> 1) warn the user of the emergency
> 2) disable the rebuild timer
> 3) invalidate the route caches
> 4) re-enable the rebuild timer with its old value

I just want to clarify what my intentions were when I spoke
with Neil about this stuff last week.

The idea is that we can by default not rebuild the secret
at all.

And only when we notice that chains are growing larger than
"(NUM_RTCACHE_ENTRIES / NUM_HASH_CHAINS) * N", only then
do we do this secret rebuild and flush.  Where N is some
constant of configurable value, the GC elasticity is some
example.

Normally this whole hash secret business is totally unnecessary and
there is zero reason to do it until we notice there is actually some
kind of deep hash chain growth problem.

It's expensive, we flush the whole routing cache, so doing it
every so often by default makes no sense and it is causing
performance problems for people.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 20:22 ` Eric Dumazet
  2008-09-29 20:27   ` Neil Horman
@ 2008-09-30 14:08   ` David Miller
  1 sibling, 0 replies; 64+ messages in thread
From: David Miller @ 2008-09-30 14:08 UTC (permalink / raw)
  To: dada1; +Cc: nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 29 Sep 2008 22:22:03 +0200

> This sounds not good at all to me.
> 
> 1) Dont set ip_rt_secret_interval to zero, this is plain silly, since
>    you give attackers infinite time to break your machine.
> 

It makes a ton of sense, even by default, if we set things up so that
we turn it back on when necessary.

And that's the final intended idea, to do exactly that.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30  6:02         ` Eric Dumazet
  2008-09-30 11:23           ` Neil Horman
@ 2008-09-30 14:10           ` David Miller
  2008-09-30 17:16             ` Eric Dumazet
  1 sibling, 1 reply; 64+ messages in thread
From: David Miller @ 2008-09-30 14:10 UTC (permalink / raw)
  To: dada1; +Cc: nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 30 Sep 2008 08:02:44 +0200

> When a machine is targeted by a DDOS attack, about all slots of the
> hash table are fully loaded (ie chain length >= elasticity). We dont
> need to invalidate the cache, but find an equilibrium, with small
> adjustements.

Sure, but it is possible to determine that some hash chains
are unevenly growing out of control compared to others,
and that is the algorithm that Neil is trying to discover.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 19:12 [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Neil Horman
  2008-09-29 20:22 ` Eric Dumazet
  2008-09-30 14:08 ` David Miller
@ 2008-09-30 14:17 ` Denis V. Lunev
  2008-09-30 14:35   ` Neil Horman
  2008-10-05  3:17 ` Herbert Xu
  3 siblings, 1 reply; 64+ messages in thread
From: Denis V. Lunev @ 2008-09-30 14:17 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

The patch is wrong in respect to
  for_each_net
usage.

It should be used under rtnl. It is not held at the moment :(

Regards,
	Den

On Mon, 2008-09-29 at 15:12 -0400, Neil Horman wrote:
> Hey all-
> 	We currently have the ability to disable our route cache secret interval
> rebuild timer (by setting it to zero), but if we do that its possible for an
> attacker (if they guess our route cache hash secret, to fill our system with
> routes that all hash to the same bucket, destroying our performance.  This patch
> provides a backstop for that issues.  In the event that our rebuild interval is
> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
> an emergency hash rebuild.  During the hash rebuild we:
> 1) warn the user of the emergency
> 2) disable the rebuild timer
> 3) invalidate the route caches
> 4) re-enable the rebuild timer with its old value
> 
> Regards
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  route.c |   36 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..b95e02a 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>  static void		 ipv4_link_failure(struct sk_buff *skb);
>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>  static int rt_garbage_collect(struct dst_ops *ops);
> -
> +static void rt_emergency_hash_rebuild(void);
>  
>  static struct dst_ops ipv4_dst_ops = {
>  	.family =		AF_INET,
> @@ -1056,6 +1056,15 @@ restart:
>  			*candp = cand->u.dst.rt_next;
>  			rt_free(cand);
>  		}
> +	} else if (chain_length > ip_rt_gc_elasticity) {
> +		/*
> +		 * We didn't find a route entry we could safely free,
> +		 * yet our chain length is over our elasticity value.
> +		 * Someone may have guessed our hash secret and is artifically
> +		 * filling up our route cache.  Lets do an emergency rebuild
> +		 * to be safe
> +		 */
> +		rt_emergency_hash_rebuild();
>  	}
>  
>  	/* Try to bind route to arp only if it is output
> @@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
>  	rtnl_unlock();
>  }
>  
> +static void rt_secret_rebuild_oneshot(void) {
> +	struct net *net;
> +	int old_interval = ip_rt_secret_interval;
> +
> +	ip_rt_secret_interval = 0;
> +
> +	rt_secret_reschedule(old_interval);
> +
> +	for_each_net(net) 
> +		rt_cache_invalidate(net);
> +
> +	ip_rt_secret_interval = old_interval;
> +
> +	rt_secret_reschedule(0);
> +}
> +
> +static void rt_emergency_hash_rebuild(void) {
> +	if (net_ratelimit()) {
> +		printk(KERN_WARNING "Route hash chain too long!\n");
> +		printk(KERN_WARNING "Adjust your secret_interval!\n");
> +	}
> +
> +	rt_secret_rebuild_oneshot();
> +}
> +
>  static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
>  					  struct file *filp,
>  					  void __user *buffer, size_t *lenp,


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 14:17 ` Denis V. Lunev
@ 2008-09-30 14:35   ` Neil Horman
  2008-09-30 14:49     ` Denis V. Lunev
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-09-30 14:35 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Tue, Sep 30, 2008 at 06:17:12PM +0400, Denis V. Lunev wrote:
> The patch is wrong in respect to
>   for_each_net
> usage.
> 
> It should be used under rtnl. It is not held at the moment :(
> 
Thank you, I'll be sure to take that into consideration when we determine what
the proper algorithm to implement here is.

Although, I used the rt_secret_rebuild timer function as my guide here, and it
doesn't hold the rtnl lock either, or is that because its running in a softirq
context?

Neil

> Regards,
> 	Den
> 
> On Mon, 2008-09-29 at 15:12 -0400, Neil Horman wrote:
> > Hey all-
> > 	We currently have the ability to disable our route cache secret interval
> > rebuild timer (by setting it to zero), but if we do that its possible for an
> > attacker (if they guess our route cache hash secret, to fill our system with
> > routes that all hash to the same bucket, destroying our performance.  This patch
> > provides a backstop for that issues.  In the event that our rebuild interval is
> > disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
> > an emergency hash rebuild.  During the hash rebuild we:
> > 1) warn the user of the emergency
> > 2) disable the rebuild timer
> > 3) invalidate the route caches
> > 4) re-enable the rebuild timer with its old value
> > 
> > Regards
> > Neil
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > 
> > 
> >  route.c |   36 +++++++++++++++++++++++++++++++++++-
> >  1 file changed, 35 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index 6ee5354..b95e02a 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
> >  static void		 ipv4_link_failure(struct sk_buff *skb);
> >  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
> >  static int rt_garbage_collect(struct dst_ops *ops);
> > -
> > +static void rt_emergency_hash_rebuild(void);
> >  
> >  static struct dst_ops ipv4_dst_ops = {
> >  	.family =		AF_INET,
> > @@ -1056,6 +1056,15 @@ restart:
> >  			*candp = cand->u.dst.rt_next;
> >  			rt_free(cand);
> >  		}
> > +	} else if (chain_length > ip_rt_gc_elasticity) {
> > +		/*
> > +		 * We didn't find a route entry we could safely free,
> > +		 * yet our chain length is over our elasticity value.
> > +		 * Someone may have guessed our hash secret and is artifically
> > +		 * filling up our route cache.  Lets do an emergency rebuild
> > +		 * to be safe
> > +		 */
> > +		rt_emergency_hash_rebuild();
> >  	}
> >  
> >  	/* Try to bind route to arp only if it is output
> > @@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
> >  	rtnl_unlock();
> >  }
> >  
> > +static void rt_secret_rebuild_oneshot(void) {
> > +	struct net *net;
> > +	int old_interval = ip_rt_secret_interval;
> > +
> > +	ip_rt_secret_interval = 0;
> > +
> > +	rt_secret_reschedule(old_interval);
> > +
> > +	for_each_net(net) 
> > +		rt_cache_invalidate(net);
> > +
> > +	ip_rt_secret_interval = old_interval;
> > +
> > +	rt_secret_reschedule(0);
> > +}
> > +
> > +static void rt_emergency_hash_rebuild(void) {
> > +	if (net_ratelimit()) {
> > +		printk(KERN_WARNING "Route hash chain too long!\n");
> > +		printk(KERN_WARNING "Adjust your secret_interval!\n");
> > +	}
> > +
> > +	rt_secret_rebuild_oneshot();
> > +}
> > +
> >  static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
> >  					  struct file *filp,
> >  					  void __user *buffer, size_t *lenp,
> 
> 

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 14:35   ` Neil Horman
@ 2008-09-30 14:49     ` Denis V. Lunev
  0 siblings, 0 replies; 64+ messages in thread
From: Denis V. Lunev @ 2008-09-30 14:49 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Tue, 2008-09-30 at 10:35 -0400, Neil Horman wrote:
> On Tue, Sep 30, 2008 at 06:17:12PM +0400, Denis V. Lunev wrote:
> > The patch is wrong in respect to
> >   for_each_net
> > usage.
> > 
> > It should be used under rtnl. It is not held at the moment :(
> > 
> Thank you, I'll be sure to take that into consideration when we determine what
> the proper algorithm to implement here is.
> 
> Although, I used the rt_secret_rebuild timer function as my guide here, and it
> doesn't hold the rtnl lock either, or is that because its running in a softirq
> context?
no. There is no lock as there is no iteration over net namespace list.
Each namespace has its own timer.

The simplest approach I see right now is to re-iterate over current dst
cache chain and mark only those namespaces. This is at least fair as
only one namespace can be attacked (not entire node).

All locking for this is in place.

Regards,
	Den

> Neil
> 
> > Regards,
> > 	Den
> > 
> > On Mon, 2008-09-29 at 15:12 -0400, Neil Horman wrote:
> > > Hey all-
> > > 	We currently have the ability to disable our route cache secret interval
> > > rebuild timer (by setting it to zero), but if we do that its possible for an
> > > attacker (if they guess our route cache hash secret, to fill our system with
> > > routes that all hash to the same bucket, destroying our performance.  This patch
> > > provides a backstop for that issues.  In the event that our rebuild interval is
> > > disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
> > > an emergency hash rebuild.  During the hash rebuild we:
> > > 1) warn the user of the emergency
> > > 2) disable the rebuild timer
> > > 3) invalidate the route caches
> > > 4) re-enable the rebuild timer with its old value
> > > 
> > > Regards
> > > Neil
> > > 
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > 
> > > 
> > >  route.c |   36 +++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 35 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index 6ee5354..b95e02a 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -145,7 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
> > >  static void		 ipv4_link_failure(struct sk_buff *skb);
> > >  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
> > >  static int rt_garbage_collect(struct dst_ops *ops);
> > > -
> > > +static void rt_emergency_hash_rebuild(void);
> > >  
> > >  static struct dst_ops ipv4_dst_ops = {
> > >  	.family =		AF_INET,
> > > @@ -1056,6 +1056,15 @@ restart:
> > >  			*candp = cand->u.dst.rt_next;
> > >  			rt_free(cand);
> > >  		}
> > > +	} else if (chain_length > ip_rt_gc_elasticity) {
> > > +		/*
> > > +		 * We didn't find a route entry we could safely free,
> > > +		 * yet our chain length is over our elasticity value.
> > > +		 * Someone may have guessed our hash secret and is artifically
> > > +		 * filling up our route cache.  Lets do an emergency rebuild
> > > +		 * to be safe
> > > +		 */
> > > +		rt_emergency_hash_rebuild();
> > >  	}
> > >  
> > >  	/* Try to bind route to arp only if it is output
> > > @@ -2946,6 +2955,31 @@ static void rt_secret_reschedule(int old)
> > >  	rtnl_unlock();
> > >  }
> > >  
> > > +static void rt_secret_rebuild_oneshot(void) {
> > > +	struct net *net;
> > > +	int old_interval = ip_rt_secret_interval;
> > > +
> > > +	ip_rt_secret_interval = 0;
> > > +
> > > +	rt_secret_reschedule(old_interval);
> > > +
> > > +	for_each_net(net) 
> > > +		rt_cache_invalidate(net);
> > > +
> > > +	ip_rt_secret_interval = old_interval;
> > > +
> > > +	rt_secret_reschedule(0);
> > > +}
> > > +
> > > +static void rt_emergency_hash_rebuild(void) {
> > > +	if (net_ratelimit()) {
> > > +		printk(KERN_WARNING "Route hash chain too long!\n");
> > > +		printk(KERN_WARNING "Adjust your secret_interval!\n");
> > > +	}
> > > +
> > > +	rt_secret_rebuild_oneshot();
> > > +}
> > > +
> > >  static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
> > >  					  struct file *filp,
> > >  					  void __user *buffer, size_t *lenp,
> > 
> > 
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 14:10           ` David Miller
@ 2008-09-30 17:16             ` Eric Dumazet
  2008-09-30 18:42               ` Neil Horman
                                 ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-09-30 17:16 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber,
	Evgeniy Polyakov

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 30 Sep 2008 08:02:44 +0200
> 
>> When a machine is targeted by a DDOS attack, about all slots of the
>> hash table are fully loaded (ie chain length >= elasticity). We dont
>> need to invalidate the cache, but find an equilibrium, with small
>> adjustements.
> 
> Sure, but it is possible to determine that some hash chains
> are unevenly growing out of control compared to others,
> and that is the algorithm that Neil is trying to discover.
> 
> 

No problem, but my suggestion to use a separate threshold than elasticity
was apparently not taken into consideration.

I ran an experiment on a big stable machine with 2^19 rtcache slots,
 scanning all chains and found *many* of them having length > elasticity,
maximum being 13. I am sure its allowed by statistics laws.

(average chain length : 3.55)

In order to avoid unecessary cache invalidation, we need some 
calculation from a statistics expert.

Given rt_hash_size and elasticity (or rt_max_size), compute the "maximum reasonable" 
chain length, ie some X number where probability(chain_length < X) > 0.9999

(CCed Evgeniy Polyakov :) )



MemTotal:   32963064 kB
8 CPUS
/proc/sys/net/ipv4/route/max_size:8388608 (default at boot time)
/proc/sys/net/ipv4/route/gc_thresh:2000000
/proc/sys/net/ipv4/route/gc_elasticity:4
/proc/sys/net/ipv4/route/gc_interval:1

Linux version 2.6.24.5
cat /proc/net/sockstat
sockets: used 957514
TCP: inuse 963998 orphan 7100 tw 10393 alloc 964646 mem 376389

rtstat -c5 -i5 (minus first sample)
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
 entries|  in_hit|in_slow_|in_slow_|in_no_ro|  in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
        |        |     tot|      mc|     ute|        |  an_dst|  an_src|        |    _tot|     _mc|        |      ed|    miss| verflow| _search|t_search|

 1862477|   23758|    4400|       0|       0|       0|       0|       0|    4142|    1134|       0|       0|       0|       0|       0|   45754|   11785|
 1863310|   24794|    4504|       0|       0|       0|       0|       0|    4089|    1183|       0|       0|       0|       0|       0|   47558|   11640|
 1863604|   24183|    4383|       0|       0|       0|       0|       0|    3910|    1061|       0|       0|       0|       0|       0|   46002|   10819|
 1864473|   23899|    4510|       0|       0|       0|       0|       0|    4113|    1193|       0|       0|       0|       0|       0|   46451|   11798|

grep ip_dst_cache /proc/slabinfo
ip_dst_cache      1863543 1868660    384   10    1 : tunables    0    0    0 : slabdata 186866 186866      0

On this machine, a single "ip route flush cache" is risky 
(commit 29e75252da20f3ab9e132c68c9aed156b87beae6 [IPV4] route cache: Introduce rt_genid for smooth cache invalidation)
not yet included in kernel)





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 14:08 ` David Miller
@ 2008-09-30 17:47   ` Eric Dumazet
  2008-10-05  3:26   ` Herbert Xu
  1 sibling, 0 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-09-30 17:47 UTC (permalink / raw)
  To: David Miller; +Cc: nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber

David Miller a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 29 Sep 2008 15:12:54 -0400
> 
>> 	We currently have the ability to disable our route cache secret interval
>> rebuild timer (by setting it to zero), but if we do that its possible for an
>> attacker (if they guess our route cache hash secret, to fill our system with
>> routes that all hash to the same bucket, destroying our performance.  This patch
>> provides a backstop for that issues.  In the event that our rebuild interval is
>> disabled (or very large), if any hash chain exceeds ip_rt_gc_elasticity, we do
>> an emergency hash rebuild.  During the hash rebuild we:
>> 1) warn the user of the emergency
>> 2) disable the rebuild timer
>> 3) invalidate the route caches
>> 4) re-enable the rebuild timer with its old value
> 
> I just want to clarify what my intentions were when I spoke
> with Neil about this stuff last week.
> 
> The idea is that we can by default not rebuild the secret
> at all.
> 
> And only when we notice that chains are growing larger than
> "(NUM_RTCACHE_ENTRIES / NUM_HASH_CHAINS) * N", only then
> do we do this secret rebuild and flush.  Where N is some
> constant of configurable value, the GC elasticity is some
> example.
> 
> Normally this whole hash secret business is totally unnecessary and
> there is zero reason to do it until we notice there is actually some
> kind of deep hash chain growth problem.
> 
> It's expensive, we flush the whole routing cache, so doing it
> every so often by default makes no sense and it is causing
> performance problems for people.

Intentions are very good, thanks for clarifying and letting us know.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 17:16             ` Eric Dumazet
@ 2008-09-30 18:42               ` Neil Horman
  2008-10-02  7:16                 ` Evgeniy Polyakov
  2008-10-01 18:08               ` Neil Horman
  2008-10-02  7:13               ` Evgeniy Polyakov
  2 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-09-30 18:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber,
	Evgeniy Polyakov

On Tue, Sep 30, 2008 at 07:16:50PM +0200, Eric Dumazet wrote:
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Tue, 30 Sep 2008 08:02:44 +0200
>>
>>> When a machine is targeted by a DDOS attack, about all slots of the
>>> hash table are fully loaded (ie chain length >= elasticity). We dont
>>> need to invalidate the cache, but find an equilibrium, with small
>>> adjustements.
>>
>> Sure, but it is possible to determine that some hash chains
>> are unevenly growing out of control compared to others,
>> and that is the algorithm that Neil is trying to discover.
>>
>>
>
> No problem, but my suggestion to use a separate threshold than elasticity
> was apparently not taken into consideration.
>
> I ran an experiment on a big stable machine with 2^19 rtcache slots,
> scanning all chains and found *many* of them having length > elasticity,
> maximum being 13. I am sure its allowed by statistics laws.
>
> (average chain length : 3.55)
>
> In order to avoid unecessary cache invalidation, we need some  
> calculation from a statistics expert.
>
> Given rt_hash_size and elasticity (or rt_max_size), compute the "maximum 
> reasonable" chain length, ie some X number where probability(chain_length 
> < X) > 0.9999
>
I think what you're looking for here is a simple standard deviation, isn't it?
Compute the mean chain legnth, sum the squares of the deviations of each chain
and take the square root.  Any individual chain longer than the mean chain
length + 1 standard deviation can be considered an 'outlier' and therefore
trigger a rebuild of the table for that net namespace.

I full well realize that thats easier said than done, but does that seem about
right?  If so, I can start working on trying to build something to accomplish
that.

Regards
Neil


-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 17:16             ` Eric Dumazet
  2008-09-30 18:42               ` Neil Horman
@ 2008-10-01 18:08               ` Neil Horman
  2008-10-02  5:01                 ` Bill Fink
  2008-10-02  7:13               ` Evgeniy Polyakov
  2 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-01 18:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber,
	Evgeniy Polyakov

Hey all-
	Since Eric mentioned the use of statistical analysis to determine if
hash chains were growing improperly, I thought I would take a stab at such an
approach.  I'm no statistics expert, but it would seem to me that simply
computing the standard deviation of all the non-zero chain lengths would give a
good guide point to determine if we needed to invalidate our hash table.  I've
written the below implementation.  I've not tested it (I'll be doing that here
shortly for the next few days), but I wanted to post it to get feedback from you
all on the subject.  The highlights are:

1) We add a counter to rt_hash_bucket structure to track the length of each
individual chain.  I realize this is sub-optimal, as it adds potentially lots of
memory to hash table as a whole (4 bytes * number of hash buckets).  I'm afraid
I've not come up with a better way to track that yet.  We also track the total
number of route entries and the number of non-zero length chains.  Lastly we
also maintain a global maximum chain length which defines the longest chain we
will tolerate in the route table.  This patch defines it as the mean chain
length plus one standard deviation.

2) The above new counters are updated as we add/delete entries from the hash
table

3) in rt_intern_hash if we don't find an entry to delete while walking a given
chain that is over its elasticity, we check the chain length against the maximum
length, which we defined in 1.  If it exceeds the max length, we recompute the
mean and standard deviation value, and check the length again (see
rt_hash_sd_update).  If the length is still longer than the new max length, we
execute an emergency hash rebuild.

Again, I've not started testing it, but if this works, we should be able to
eliminate the need for periodic cache rebuilds entirely, replacing it with on
demand detection.  I could see a usefulness for add a control to tweak the
multiple count of standard deviations allowed from the mean, for environments
where route entries just happened to hash to the same bucket, but I figure, let
me get this working properly first.

I'll be doing basic testing over the next few days.  Any thoughts & comments are
appreciated, as well as testing in more strenuous, high route entry/high
throughput environments.

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 route.c |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 100 insertions(+), 2 deletions(-)


diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..8e59900 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
 
 struct rt_hash_bucket {
 	struct rtable	*chain;
+	atomic_t	chain_length;
 };
+
+atomic_t rt_hash_total_count;
+atomic_t rt_hash_nz_count;
+
+static int rt_chain_length_max;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -601,6 +609,53 @@ static inline int ip_rt_proc_init(void)
 }
 #endif /* CONFIG_PROC_FS */
 
+static void rt_hash_sd_update(void)
+{
+	int temp, i;
+	unsigned long long sd;
+	int average = atomic_read(&rt_hash_total_count);
+	int nzcount = atomic_read(&rt_hash_nz_count);
+
+	average = DIV_ROUND_UP(average, nzcount);
+
+	sd = 0;
+	for (i = 0; i < (1 << rt_hash_log); i++) {
+		temp = atomic_read(&rt_hash_table[i].chain_length);
+		if (unlikely(temp))
+			sd += (temp-average)^2;
+	}
+
+	sd = DIV_ROUND_UP(sd, nzcount);
+
+	/*
+	 * Note: 46340 squared is 0x7fffffff
+	 */
+	for (i = 1; i < 46340; i++)
+		if ((i^2) > sd)
+			break;
+
+	/*
+	 * The maximum allowable length of a hash
+	 * chain is the average plus one standard
+	 * deviation
+	 */
+	rt_chain_length_max = average + sd;
+}
+
+static inline void rt_hash_sd_dec(int hash)
+{
+	atomic_dec(&rt_hash_table[hash].chain_length);
+	if (atomic_dec_and_test(&rt_hash_total_count))
+		atomic_dec(&rt_hash_nz_count);
+}
+
+static inline void rt_hash_sd_inc(int hash)
+{
+	atomic_inc(&rt_hash_table[hash].chain_length);
+	if (atomic_inc_return(&rt_hash_total_count) == 1)
+		atomic_inc(&rt_hash_nz_count);
+}
+
 static inline void rt_free(struct rtable *rt)
 {
 	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
@@ -731,6 +786,7 @@ static void rt_do_flush(int process_context)
 			} else {
 				*prev = next;
 				rt_free(p);
+				rt_hash_sd_dec(i);
 			}
 		}
 		}
@@ -744,7 +800,9 @@ static void rt_do_flush(int process_context)
 		for (; rth != tail; rth = next) {
 			next = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(i);
 		}
+
 	}
 }
 
@@ -777,6 +835,7 @@ static void rt_check_expire(void)
 			if (rt_is_expired(rth)) {
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
+				rt_hash_sd_dec(i);
 				continue;
 			}
 			if (rth->u.dst.expires) {
@@ -795,6 +854,7 @@ static void rt_check_expire(void)
 			/* Cleanup aged off entries. */
 			*rthp = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(i);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
 	}
@@ -927,6 +987,7 @@ static int rt_garbage_collect(struct dst_ops *ops)
 				}
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
+				rt_hash_sd_dec(i);
 				goal--;
 			}
 			spin_unlock_bh(rt_hash_lock_addr(k));
@@ -991,7 +1052,6 @@ static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 	int attempts = !in_softirq();
 
 restart:
-	chain_length = 0;
 	min_score = ~(u32)0;
 	cand = NULL;
 	candp = NULL;
@@ -1004,6 +1064,7 @@ restart:
 		if (rt_is_expired(rth)) {
 			*rthp = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(hash);
 			continue;
 		}
 		if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
@@ -1040,11 +1101,11 @@ restart:
 			}
 		}
 
-		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
 	}
 
+	chain_length = atomic_read(&rt_hash_table[hash].chain_length);
 	if (cand) {
 		/* ip_rt_gc_elasticity used to be average length of chain
 		 * length, when exceeded gc becomes really aggressive.
@@ -1055,6 +1116,23 @@ restart:
 		if (chain_length > ip_rt_gc_elasticity) {
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
+			rt_hash_sd_dec(hash);
+		}
+	} else {
+		if (chain_length && (chain_length > rt_chain_length_max)) {
+			/*
+			 * This chain is longer than our computed
+			 * max allowable length, lets recompute to 
+			 * be sure its still up to date
+			 */
+			rt_hash_sd_update();
+			if (chain_length > rt_chain_length_max) {
+				/*
+				 * We're still to long, do an emergency rebuild
+				 */
+				rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+			}
+			
 		}
 	}
 
@@ -1106,6 +1184,7 @@ restart:
 #endif
 	rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
+	rt_hash_sd_inc(hash);
 	*rp = rt;
 	return 0;
 }
@@ -1180,6 +1259,7 @@ static void rt_del(unsigned hash, struct rtable *rt)
 		if (aux == rt || rt_is_expired(aux)) {
 			*rthp = aux->u.dst.rt_next;
 			rt_free(aux);
+			rt_hash_sd_dec(hash);
 			continue;
 		}
 		rthp = &aux->u.dst.rt_next;
@@ -2946,6 +3026,24 @@ static void rt_secret_reschedule(int old)
 	rtnl_unlock();
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net) {
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net) {
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
 					  struct file *filp,
 					  void __user *buffer, size_t *lenp,
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-01 18:08               ` Neil Horman
@ 2008-10-02  5:01                 ` Bill Fink
  2008-10-02  6:56                   ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Bill Fink @ 2008-10-02  5:01 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Wed, 1 Oct 2008, Neil Horman wrote:

> Hey all-
> 	Since Eric mentioned the use of statistical analysis to determine if
> hash chains were growing improperly, I thought I would take a stab at such an
> approach.  I'm no statistics expert, but it would seem to me that simply
> computing the standard deviation of all the non-zero chain lengths would give a
> good guide point to determine if we needed to invalidate our hash table.  I've
> written the below implementation.  I've not tested it (I'll be doing that here
> shortly for the next few days), but I wanted to post it to get feedback from you
> all on the subject.  The highlights are:
> 
> 1) We add a counter to rt_hash_bucket structure to track the length of each
> individual chain.  I realize this is sub-optimal, as it adds potentially lots of
> memory to hash table as a whole (4 bytes * number of hash buckets).  I'm afraid
> I've not come up with a better way to track that yet.  We also track the total
> number of route entries and the number of non-zero length chains.  Lastly we
> also maintain a global maximum chain length which defines the longest chain we
> will tolerate in the route table.  This patch defines it as the mean chain
> length plus one standard deviation.

I believe the general rule of thumb for something like this is at
least two standard deviations.  For a normal distribution, one standard
deviation covers about 68 % of the sample universe, while two standard
deviations covers about 95 % (three standard deviations covers 99.73 %).
See the Wikipedia entry:

	http://en.wikipedia.org/wiki/Standard_deviation

						-Bill

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-02  5:01                 ` Bill Fink
@ 2008-10-02  6:56                   ` Eric Dumazet
  2008-10-02  8:15                     ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-02  6:56 UTC (permalink / raw)
  To: Bill Fink
  Cc: Neil Horman, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Bill Fink a écrit :
> On Wed, 1 Oct 2008, Neil Horman wrote:
> 
>> Hey all-
>> 	Since Eric mentioned the use of statistical analysis to determine if
>> hash chains were growing improperly, I thought I would take a stab at such an
>> approach.  I'm no statistics expert, but it would seem to me that simply
>> computing the standard deviation of all the non-zero chain lengths would give a
>> good guide point to determine if we needed to invalidate our hash table.  I've
>> written the below implementation.  I've not tested it (I'll be doing that here
>> shortly for the next few days), but I wanted to post it to get feedback from you
>> all on the subject.  The highlights are:
>>
>> 1) We add a counter to rt_hash_bucket structure to track the length of each
>> individual chain.  I realize this is sub-optimal, as it adds potentially lots of
>> memory to hash table as a whole (4 bytes * number of hash buckets).  I'm afraid
>> I've not come up with a better way to track that yet.  We also track the total
>> number of route entries and the number of non-zero length chains.  Lastly we
>> also maintain a global maximum chain length which defines the longest chain we
>> will tolerate in the route table.  This patch defines it as the mean chain
>> length plus one standard deviation.
> 
> I believe the general rule of thumb for something like this is at
> least two standard deviations.  For a normal distribution, one standard
> deviation covers about 68 % of the sample universe, while two standard
> deviations covers about 95 % (three standard deviations covers 99.73 %).
> See the Wikipedia entry:
> 
> 	http://en.wikipedia.org/wiki/Standard_deviation
> 

Thanks Bill for the pointer, this is the trick.

I believe we should target "4σ 99.993666% " case.

But we dont need to really compute Standard deviation at runtime, only find an (upper) approximation of it.

For elasticity=4 and 512*1024 samples (mean < 4), I guess 4σ can be approximated by 20 or something.

Thank you




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 17:16             ` Eric Dumazet
  2008-09-30 18:42               ` Neil Horman
  2008-10-01 18:08               ` Neil Horman
@ 2008-10-02  7:13               ` Evgeniy Polyakov
  2 siblings, 0 replies; 64+ messages in thread
From: Evgeniy Polyakov @ 2008-10-02  7:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

Hi.

On Tue, Sep 30, 2008 at 07:16:50PM +0200, Eric Dumazet (dada1@cosmosbay.com) wrote:
> No problem, but my suggestion to use a separate threshold than elasticity
> was apparently not taken into consideration.
> 
> I ran an experiment on a big stable machine with 2^19 rtcache slots,
> scanning all chains and found *many* of them having length > elasticity,
> maximum being 13. I am sure its allowed by statistics laws.
> 
> (average chain length : 3.55)
> 
> In order to avoid unecessary cache invalidation, we need some 
> calculation from a statistics expert.
> 
> Given rt_hash_size and elasticity (or rt_max_size), compute the "maximum 
> reasonable" chain length, ie some X number where probability(chain_length < 
> X) > 0.9999
> 
> (CCed Evgeniy Polyakov :) )

:)

If it is a hash table with fixed size, and hash is being produced like
some function result modulo size of the table, then there will be always
fluctuations with longer chain lenghts than 'usual', since, for example,
hash function produces result in 2^32 field, which hash table size is
only 2^20 or something smaller than maxiumum hash result. When I
analyzed Jenkin's hash I mistakenly concluded that such pikes are result
of the hash distribution itself and not final modulo operation.

Plus, it is always a gaussian distribution, so there will be (although
depending on parameters amount may be small enough) entries with longer
and smaller than average chains.

I tested list walking speed compared to rb-tree and trie on p4 machines
without prefetch and lists with upto 10 element are processed faster
than anything else, so if your average chain lengths are less than 10
elements, you are in a good condition.

Also I have a tricky algorithm to automatically scale RCU protected hash
table (with additional overhead at scale time though, but I believe it
is not a problem), which just waits to be implemented in netchannels, so
long chain length problem can be fixed, if algorithm works. Ok, I've
finished my advertisement rants :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 18:42               ` Neil Horman
@ 2008-10-02  7:16                 ` Evgeniy Polyakov
  2008-10-02 13:14                   ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Evgeniy Polyakov @ 2008-10-02  7:16 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber

Hi.

On Tue, Sep 30, 2008 at 02:42:49PM -0400, Neil Horman (nhorman@tuxdriver.com) wrote:
> I think what you're looking for here is a simple standard deviation, isn't it?
> Compute the mean chain legnth, sum the squares of the deviations of each chain
> and take the square root.  Any individual chain longer than the mean chain
> length + 1 standard deviation can be considered an 'outlier' and therefore
> trigger a rebuild of the table for that net namespace.
> 
> I full well realize that thats easier said than done, but does that seem about
> right?  If so, I can start working on trying to build something to accomplish
> that.

You are right, but in kernel it may not that simple to get a square
root... We can probably play with logarithms though.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-02  6:56                   ` Eric Dumazet
@ 2008-10-02  8:15                     ` Eric Dumazet
  2008-10-02 14:20                       ` Eric Dumazet
  2008-10-03  0:31                       ` Neil Horman
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-10-02  8:15 UTC (permalink / raw)
  To: Bill Fink
  Cc: Neil Horman, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Eric Dumazet a écrit :
> Bill Fink a écrit :
>> On Wed, 1 Oct 2008, Neil Horman wrote:
>>
>>> Hey all-
>>>     Since Eric mentioned the use of statistical analysis to determine if
>>> hash chains were growing improperly, I thought I would take a stab at 
>>> such an
>>> approach.  I'm no statistics expert, but it would seem to me that simply
>>> computing the standard deviation of all the non-zero chain lengths 
>>> would give a
>>> good guide point to determine if we needed to invalidate our hash 
>>> table.  I've
>>> written the below implementation.  I've not tested it (I'll be doing 
>>> that here
>>> shortly for the next few days), but I wanted to post it to get 
>>> feedback from you
>>> all on the subject.  The highlights are:
>>>
>>> 1) We add a counter to rt_hash_bucket structure to track the length 
>>> of each
>>> individual chain.  I realize this is sub-optimal, as it adds 
>>> potentially lots of
>>> memory to hash table as a whole (4 bytes * number of hash buckets).  
>>> I'm afraid
>>> I've not come up with a better way to track that yet.  We also track 
>>> the total
>>> number of route entries and the number of non-zero length chains.  
>>> Lastly we
>>> also maintain a global maximum chain length which defines the longest 
>>> chain we
>>> will tolerate in the route table.  This patch defines it as the mean 
>>> chain
>>> length plus one standard deviation.
>>
>> I believe the general rule of thumb for something like this is at
>> least two standard deviations.  For a normal distribution, one standard
>> deviation covers about 68 % of the sample universe, while two standard
>> deviations covers about 95 % (three standard deviations covers 99.73 %).
>> See the Wikipedia entry:
>>
>>     http://en.wikipedia.org/wiki/Standard_deviation
>>
> 
> Thanks Bill for the pointer, this is the trick.
> 
> I believe we should target "4σ 99.993666% " case.
> 
> But we dont need to really compute Standard deviation at runtime, only 
> find an (upper) approximation of it.
> 
> For elasticity=4 and 512*1024 samples (mean < 4), I guess 4σ can be 
> approximated by 20 or something.
> 

Good estimation of Standard Deviation could be computed for free
in rt_check_expire(). (each runs scans 20% of hash table with 
default tunables timeout & ip_rt_gc_interval)

We could update 4σ estimation in this function, every minute (ip_rt_gc_interval)

At softirq time we then can detect a particular hash chain is
longer than 4σ estimation and trigger an appropriate action.

This action is to : flush table, and while we do that, expand hash table
if its current size is under ip_rt_max_size/elasticity...





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-02  7:16                 ` Evgeniy Polyakov
@ 2008-10-02 13:14                   ` Neil Horman
  0 siblings, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-02 13:14 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Eric Dumazet, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber

On Thu, Oct 02, 2008 at 11:16:31AM +0400, Evgeniy Polyakov wrote:
> Hi.
> 
> On Tue, Sep 30, 2008 at 02:42:49PM -0400, Neil Horman (nhorman@tuxdriver.com) wrote:
> > I think what you're looking for here is a simple standard deviation, isn't it?
> > Compute the mean chain legnth, sum the squares of the deviations of each chain
> > and take the square root.  Any individual chain longer than the mean chain
> > length + 1 standard deviation can be considered an 'outlier' and therefore
> > trigger a rebuild of the table for that net namespace.
> > 
> > I full well realize that thats easier said than done, but does that seem about
> > right?  If so, I can start working on trying to build something to accomplish
> > that.
> 
> You are right, but in kernel it may not that simple to get a square
> root... We can probably play with logarithms though.
> 

I don't in truth, need a quare root operation (at least I don't think I do).
I'm abel to compute the variance rather easily, with a walk of the hash table,
then I approximate the square root by use of a for loop in which I square the
iterator and compare it to the variance.  The iterator when squared that is
closest to the variance is taken as my standard deviation.  As mentioned
previously 4*sd should cover almost all of my sample universe, identifying any
remaining outliers as the trigger for a rebuild.

Neil


-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-02  8:15                     ` Eric Dumazet
@ 2008-10-02 14:20                       ` Eric Dumazet
  2008-10-03  0:31                       ` Neil Horman
  1 sibling, 0 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-10-02 14:20 UTC (permalink / raw)
  To: Bill Fink
  Cc: Neil Horman, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Bill Fink a écrit :
>>> I believe the general rule of thumb for something like this is at
>>> least two standard deviations.  For a normal distribution, one standard
>>> deviation covers about 68 % of the sample universe, while two standard
>>> deviations covers about 95 % (three standard deviations covers 99.73 %).
>>> See the Wikipedia entry:
>>>
>>>     http://en.wikipedia.org/wiki/Standard_deviation
>>>
>>
>> Thanks Bill for the pointer, this is the trick.
>>
>> I believe we should target "4σ 99.993666% " case.
>>
>> But we dont need to really compute Standard deviation at runtime, only 
>> find an (upper) approximation of it.
>>
>> For elasticity=4 and 512*1024 samples (mean < 4), I guess 4σ can be 
>> approximated by 20 or something.
>>
> 
> Good estimation of Standard Deviation could be computed for free
> in rt_check_expire(). (each runs scans 20% of hash table with default 
> tunables timeout & ip_rt_gc_interval)
> 
> We could update 4σ estimation in this function, every minute 
> (ip_rt_gc_interval)
> 
> At softirq time we then can detect a particular hash chain is
> longer than 4σ estimation and trigger an appropriate action.
> 
> This action is to : flush table, and while we do that, expand hash table
> if its current size is under ip_rt_max_size/elasticity...
> 

I ran again my litle dirty program that reads /proc/kcore to explore rt
hash table on a busy server and found :


hptr=0xffff81082ec00000 hsize=524288
total=2299242 dst entries
4306 chains of length 0
23807 chains of length 1
60119 chains of length 2
95112 chains of length 3
106364 chains of length 4
92307 chains of length 5
65743 chains of length 6
39710 chains of length 7
20997 chains of length 8
15775 chains of length 9
39 chains of length 10
7 chains of length 11
2 chains of length 12

avg = 4.38546
sd = 1.94614
X = avg + 4*sd = 12.17

I tried various elasticity settings and found litle variations of X

This is because lot of entries are in use (refcnt>1) and can
not be freed, regardless of elasticity.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-02  8:15                     ` Eric Dumazet
  2008-10-02 14:20                       ` Eric Dumazet
@ 2008-10-03  0:31                       ` Neil Horman
  2008-10-03 20:36                         ` Neil Horman
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-03  0:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Thu, Oct 02, 2008 at 10:15:38AM +0200, Eric Dumazet wrote:
> Eric Dumazet a écrit :
>> Bill Fink a écrit :
>>> On Wed, 1 Oct 2008, Neil Horman wrote:
>>>
>>>> Hey all-
>>>>     Since Eric mentioned the use of statistical analysis to determine if
>>>> hash chains were growing improperly, I thought I would take a stab 
>>>> at such an
>>>> approach.  I'm no statistics expert, but it would seem to me that simply
>>>> computing the standard deviation of all the non-zero chain lengths  
>>>> would give a
>>>> good guide point to determine if we needed to invalidate our hash  
>>>> table.  I've
>>>> written the below implementation.  I've not tested it (I'll be 
>>>> doing that here
>>>> shortly for the next few days), but I wanted to post it to get  
>>>> feedback from you
>>>> all on the subject.  The highlights are:
>>>>
>>>> 1) We add a counter to rt_hash_bucket structure to track the length 
>>>> of each
>>>> individual chain.  I realize this is sub-optimal, as it adds  
>>>> potentially lots of
>>>> memory to hash table as a whole (4 bytes * number of hash buckets). 
>>>>  I'm afraid
>>>> I've not come up with a better way to track that yet.  We also 
>>>> track the total
>>>> number of route entries and the number of non-zero length chains.   
>>>> Lastly we
>>>> also maintain a global maximum chain length which defines the 
>>>> longest chain we
>>>> will tolerate in the route table.  This patch defines it as the 
>>>> mean chain
>>>> length plus one standard deviation.
>>>
>>> I believe the general rule of thumb for something like this is at
>>> least two standard deviations.  For a normal distribution, one standard
>>> deviation covers about 68 % of the sample universe, while two standard
>>> deviations covers about 95 % (three standard deviations covers 99.73 %).
>>> See the Wikipedia entry:
>>>
>>>     http://en.wikipedia.org/wiki/Standard_deviation
>>>
>>
>> Thanks Bill for the pointer, this is the trick.
>>
>> I believe we should target "4σ 99.993666% " case.
>>
>> But we dont need to really compute Standard deviation at runtime, only  
>> find an (upper) approximation of it.
>>
>> For elasticity=4 and 512*1024 samples (mean < 4), I guess 4σ can be  
>> approximated by 20 or something.
>>
>
> Good estimation of Standard Deviation could be computed for free
> in rt_check_expire(). (each runs scans 20% of hash table with default 
> tunables timeout & ip_rt_gc_interval)
>
> We could update 4σ estimation in this function, every minute (ip_rt_gc_interval)
>
> At softirq time we then can detect a particular hash chain is
> longer than 4σ estimation and trigger an appropriate action.
>
> This action is to : flush table, and while we do that, expand hash table
> if its current size is under ip_rt_max_size/elasticity...
>

That is a good idea. I'm still testing the last patch I posted, and Its going
pretty well (I did find a few bugs, so I'll be reposting when I'm done).  Doing
the update on demand when seem to be pretty quick, but I'll rewrite to gather
stats on how long it take specifically, and then rewrite/compare to your
suggested implementation above.  It seems like the above would save space in the
hash table, as I could eliminate the chain_length counter per hash bucket.
On the down side though, I don't know if a 1 minute garbage collection interval
would allow too much of a window to grow a chain in (i.e. we could check chain
lengths against a too small standard deviation value, and trigger a cache
rebuild unnecessecarily).  Just thinking out loud, I've not got any evidence to
support or contradict that

Best
Neil

>
>
>
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-03  0:31                       ` Neil Horman
@ 2008-10-03 20:36                         ` Neil Horman
  2008-10-06 10:49                           ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-03 20:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Hey all-
	So, I've been doing some testing here with this patch, and am
comfortable that the sd estimation is working reasonably well. For a hash table
with an average chain length of 1, it computes the stadard deviation to be 2,
which gives us a max chain length of 9 (4*sd + avg), and it manages to do that
in about 7 jiffies over about 524000 hash buckets.  I'm reasonably pleased with
that speed I think, and after thinking about it, I like this implementation
somewhat better, as it doesn't create a window in which chains can be
artifically overrun (until the nect gc round) (although I'm happy to hear
arguments against my implementation).  Anywho here it is, comments and thoughts
welcome!

Thanks & Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 route.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 118 insertions(+), 3 deletions(-)


diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..4f8c5b5 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
 
 struct rt_hash_bucket {
 	struct rtable	*chain;
+	atomic_t	chain_length;
 };
+
+atomic_t rt_hash_total_count;
+atomic_t rt_hash_nz_count;
+
+static int rt_chain_length_max;
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -601,6 +609,68 @@ static inline int ip_rt_proc_init(void)
 }
 #endif /* CONFIG_PROC_FS */
 
+static void rt_hash_sd_update(void)
+{
+	int temp, i;
+	unsigned long long sd;
+	int average = atomic_read(&rt_hash_total_count);
+	int nzcount = atomic_read(&rt_hash_nz_count);
+
+	/*
+ 	 * Don't divide by zero
+ 	 */
+	if (!nzcount)
+		return;
+
+	average = DIV_ROUND_UP(average, nzcount);
+
+	sd = 0;
+	for (i = 0; i < (1 << rt_hash_log); i++) {
+		temp = atomic_read(&rt_hash_table[i].chain_length);
+		/*
+ 		 * Don't count zero entries, as most of the table
+ 		 * will likely be empty.  We don't want to unfairly
+ 		 * bias our average chain length down so far
+ 		 */
+		if (unlikely(temp))
+			sd += (temp-average)^2;
+	}
+
+	sd = DIV_ROUND_UP(sd, nzcount);
+
+	/*
+	 * Note: 46340 squared is 0x7fffffff
+	 */
+	for (i = 1; i < 46340; i++)
+		if ((i^2) > sd)
+			break;
+
+	/*
+	 * The maximum allowable length of a hash
+	 * chain is the average plus 4 standard
+	 * deviations.  That should cover about 
+	 * %99.9936 of our samples in a normal
+	 * distribution.  Anything outside that
+	 * can be considered an outlier and cause
+	 * for a rebuild
+	 */
+	rt_chain_length_max = average + (sd*4);
+}
+
+static inline void rt_hash_sd_dec(int hash)
+{
+	atomic_dec(&rt_hash_table[hash].chain_length);
+	if (atomic_dec_and_test(&rt_hash_total_count))
+		atomic_dec(&rt_hash_nz_count);
+}
+
+static inline void rt_hash_sd_inc(int hash)
+{
+	atomic_inc(&rt_hash_table[hash].chain_length);
+	if (atomic_inc_return(&rt_hash_total_count) == 1)
+		atomic_inc(&rt_hash_nz_count);
+}
+
 static inline void rt_free(struct rtable *rt)
 {
 	call_rcu_bh(&rt->u.dst.rcu_head, dst_rcu_free);
@@ -731,6 +801,7 @@ static void rt_do_flush(int process_context)
 			} else {
 				*prev = next;
 				rt_free(p);
+				rt_hash_sd_dec(i);
 			}
 		}
 		}
@@ -744,7 +815,9 @@ static void rt_do_flush(int process_context)
 		for (; rth != tail; rth = next) {
 			next = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(i);
 		}
+
 	}
 }
 
@@ -777,6 +850,7 @@ static void rt_check_expire(void)
 			if (rt_is_expired(rth)) {
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
+				rt_hash_sd_dec(i);
 				continue;
 			}
 			if (rth->u.dst.expires) {
@@ -795,6 +869,7 @@ static void rt_check_expire(void)
 			/* Cleanup aged off entries. */
 			*rthp = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(i);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
 	}
@@ -927,11 +1002,14 @@ static int rt_garbage_collect(struct dst_ops *ops)
 				}
 				*rthp = rth->u.dst.rt_next;
 				rt_free(rth);
+				rt_hash_sd_dec(i);
 				goal--;
 			}
 			spin_unlock_bh(rt_hash_lock_addr(k));
-			if (goal <= 0)
+			if (goal <= 0) {
+				rt_hash_sd_update();
 				break;
+			}
 		}
 		rover = k;
 
@@ -991,7 +1069,6 @@ static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 	int attempts = !in_softirq();
 
 restart:
-	chain_length = 0;
 	min_score = ~(u32)0;
 	cand = NULL;
 	candp = NULL;
@@ -1004,6 +1081,7 @@ restart:
 		if (rt_is_expired(rth)) {
 			*rthp = rth->u.dst.rt_next;
 			rt_free(rth);
+			rt_hash_sd_dec(hash);
 			continue;
 		}
 		if (compare_keys(&rth->fl, &rt->fl) && compare_netns(rth, rt)) {
@@ -1040,11 +1118,11 @@ restart:
 			}
 		}
 
-		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
 	}
 
+	chain_length = atomic_read(&rt_hash_table[hash].chain_length);
 	if (cand) {
 		/* ip_rt_gc_elasticity used to be average length of chain
 		 * length, when exceeded gc becomes really aggressive.
@@ -1055,6 +1133,23 @@ restart:
 		if (chain_length > ip_rt_gc_elasticity) {
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
+			rt_hash_sd_dec(hash);
+		}
+	} else {
+		if (chain_length && (chain_length > rt_chain_length_max)) {
+			/*
+			 * This chain is longer than our computed
+			 * max allowable length, lets recompute to 
+			 * be sure its still up to date
+			 */
+			rt_hash_sd_update();
+			if (chain_length > rt_chain_length_max) {
+				/*
+				 * We're still to long, do an emergency rebuild
+				 */
+				rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+			}
+			
 		}
 	}
 
@@ -1106,6 +1201,7 @@ restart:
 #endif
 	rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
+	rt_hash_sd_inc(hash);
 	*rp = rt;
 	return 0;
 }
@@ -1180,6 +1276,7 @@ static void rt_del(unsigned hash, struct rtable *rt)
 		if (aux == rt || rt_is_expired(aux)) {
 			*rthp = aux->u.dst.rt_next;
 			rt_free(aux);
+			rt_hash_sd_dec(hash);
 			continue;
 		}
 		rthp = &aux->u.dst.rt_next;
@@ -2946,6 +3043,24 @@ static void rt_secret_reschedule(int old)
 	rtnl_unlock();
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net) {
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net) {
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
 					  struct file *filp,
 					  void __user *buffer, size_t *lenp,
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-29 19:12 [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Neil Horman
                   ` (2 preceding siblings ...)
  2008-09-30 14:17 ` Denis V. Lunev
@ 2008-10-05  3:17 ` Herbert Xu
  2008-10-05  3:20   ` Herbert Xu
  3 siblings, 1 reply; 64+ messages in thread
From: Herbert Xu @ 2008-10-05  3:17 UTC (permalink / raw)
  To: Neil Horman
  Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber, nhorman

Neil Horman <nhorman@tuxdriver.com> wrote:
>        We currently have the ability to disable our route cache secret interval
> rebuild timer (by setting it to zero), but if we do that its possible for an
> attacker (if they guess our route cache hash secret, to fill our system with
> routes that all hash to the same bucket, destroying our performance.  This patch

This is completely bogus.  We never allow any chain to grow beyond
the elasticity.  So in the worst case we just bypass the route cache.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05  3:17 ` Herbert Xu
@ 2008-10-05  3:20   ` Herbert Xu
  2008-10-06  0:52     ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Herbert Xu @ 2008-10-05  3:20 UTC (permalink / raw)
  To: Neil Horman; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Sun, Oct 05, 2008 at 11:17:27AM +0800, Herbert Xu wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
> >        We currently have the ability to disable our route cache secret interval
> > rebuild timer (by setting it to zero), but if we do that its possible for an
> > attacker (if they guess our route cache hash secret, to fill our system with
> > routes that all hash to the same bucket, destroying our performance.  This patch
> 
> This is completely bogus.  We never allow any chain to grow beyond
> the elasticity.  So in the worst case we just bypass the route cache.

OK we don't actually enforce that as it stands, but perhaps we
should have a maximum chain length that is enforced.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-09-30 14:08 ` David Miller
  2008-09-30 17:47   ` Eric Dumazet
@ 2008-10-05  3:26   ` Herbert Xu
  2008-10-05  4:45     ` Andrew Dickinson
  1 sibling, 1 reply; 64+ messages in thread
From: Herbert Xu @ 2008-10-05  3:26 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber, whydna

David Miller <davem@davemloft.net> wrote:
> 
> The idea is that we can by default not rebuild the secret
> at all.

Actually Andrew Dickson <whydna@whydna.net> came up with this idea
quite a while ago: Keep the rehash interval but do nothing until
some chain hits a specified length.  This is quite similar to
what is being discussed here.

Andrew, could you post the patch please?

In addition to this, we should probably enforce that limit as
well by simply not adding the newly created entry or deleting
one forcibly.

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05  3:26   ` Herbert Xu
@ 2008-10-05  4:45     ` Andrew Dickinson
  2008-10-05 17:34       ` David Miller
  0 siblings, 1 reply; 64+ messages in thread
From: Andrew Dickinson @ 2008-10-05  4:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

Here's the patch that Herbert's referring to.  The basic idea is that
we have a flag which indicates whether or not we need to invalidate
the route cache.  If any chain exceeds gc_elasticity, we set the flag
and reschedule the timer.  In the worst-case, we'll invalidate the
route cache once every secret_interval; in the best-case, we never
invalidate the cache.

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..82baf68 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -48,6 +48,7 @@ struct netns_ipv4 {
        int sysctl_icmp_errors_use_inbound_ifaddr;

        struct timer_list rt_secret_timer;
+       int rt_secret_flag;
        atomic_t rt_genid;
 };
 #endif
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e91bafe..83a1b43 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -837,13 +837,49 @@ void rt_cache_flush(struct net *net, int delay)
 }

 /*
- * We change rt_genid and let gc do the cleanup
+ * We set rt_secret_flag indicating that we can invalidate the cache if needed.
  */
 static void rt_secret_rebuild(unsigned long __net)
 {
        struct net *net = (struct net *)__net;
-       rt_cache_invalidate(net);
-       mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
+       net->ipv4.rt_secret_flag = 1;
+}
+
+static void rt_secret_reschedule(int old)
+{
+       struct net *net;
+       int new = ip_rt_secret_interval;
+       int diff = new - old;
+
+       if (!diff)
+               return;
+
+       rtnl_lock();
+       for_each_net(net) {
+               int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
+
+               if (!new) {
+                       net->ipv4.rt_secret_flag = 0;
+                       continue;
+               }
+
+               if(net->ipv4.rt_secret_flag)
+                       continue;
+
+               if (old && deleted) {
+                       long time = net->ipv4.rt_secret_timer.expires - jiffies;
+
+                       if (time <= 0 || (time += diff) <= 0)
+                               time = 0;
+
+                       net->ipv4.rt_secret_timer.expires = time;
+               } else
+                       net->ipv4.rt_secret_timer.expires = new;
+
+               net->ipv4.rt_secret_timer.expires += jiffies;
+               add_timer(&net->ipv4.rt_secret_timer);
+       }
+       rtnl_unlock();
 }

 /*
@@ -1045,17 +1081,19 @@ restart:
                rthp = &rth->u.dst.rt_next;
        }

-       if (cand) {
-               /* ip_rt_gc_elasticity used to be average length of chain
-                * length, when exceeded gc becomes really aggressive.
-                *
-                * The second limit is less certain. At the moment it allows
-                * only 2 entries per bucket. We will see.
-                */
-               if (chain_length > ip_rt_gc_elasticity) {
+       if (chain_length > ip_rt_gc_elasticity) {
+               struct net *net = dev_net(rth->u.dst.dev);
+
+               if (cand) {
                        *candp = cand->u.dst.rt_next;
                        rt_free(cand);
                }
+
+               if (net->ipv4.rt_secret_flag &&
+                   xchg(&net->ipv4.rt_secret_flag, 0)) {
+                       rt_cache_invalidate(net);
+                       rt_secret_reschedule(0);
+               }
        }

        /* Try to bind route to arp only if it is output
@@ -2914,38 +2952,6 @@ static int
ipv4_sysctl_rtcache_flush_strategy(ctl_table *table,
        return 0;
 }

-static void rt_secret_reschedule(int old)
-{
-       struct net *net;
-       int new = ip_rt_secret_interval;
-       int diff = new - old;
-
-       if (!diff)
-               return;
-
-       rtnl_lock();
-       for_each_net(net) {
-               int deleted = del_timer_sync(&net->ipv4.rt_secret_timer);
-
-               if (!new)
-                       continue;
-
-               if (deleted) {
-                       long time = net->ipv4.rt_secret_timer.expires - jiffies;
-
-                       if (time <= 0 || (time += diff) <= 0)
-                               time = 0;
-
-                       net->ipv4.rt_secret_timer.expires = time;
-               } else
-                       net->ipv4.rt_secret_timer.expires = new;
-
-               net->ipv4.rt_secret_timer.expires += jiffies;
-               add_timer(&net->ipv4.rt_secret_timer);
-       }
-       rtnl_unlock();
-}
-
 static int ipv4_sysctl_rt_secret_interval(ctl_table *ctl, int write,
                                          struct file *filp,
                                          void __user *buffer, size_t *lenp,
@@ -3200,6 +3206,8 @@ static __net_init int
rt_secret_timer_init(struct net *net)
                        (int) ((num_physpages ^ (num_physpages>>8)) ^
                        (jiffies ^ (jiffies >> 7))));

+       net->ipv4.rt_secret_flag = 0;
+
        net->ipv4.rt_secret_timer.function = rt_secret_rebuild;
        net->ipv4.rt_secret_timer.data = (unsigned long)net;
        init_timer_deferrable(&net->ipv4.rt_secret_timer);









On Sat, Oct 4, 2008 at 8:26 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> David Miller <davem@davemloft.net> wrote:
>>
>> The idea is that we can by default not rebuild the secret
>> at all.
>
> Actually Andrew Dickson <whydna@whydna.net> came up with this idea
> quite a while ago: Keep the rehash interval but do nothing until
> some chain hits a specified length.  This is quite similar to
> what is being discussed here.
>
> Andrew, could you post the patch please?
>
> In addition to this, we should probably enforce that limit as
> well by simply not adding the newly created entry or deleting
> one forcibly.
>
> Thanks,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05  4:45     ` Andrew Dickinson
@ 2008-10-05 17:34       ` David Miller
  2008-10-05 18:06         ` Andrew Dickinson
  2008-10-06  4:21         ` Herbert Xu
  0 siblings, 2 replies; 64+ messages in thread
From: David Miller @ 2008-10-05 17:34 UTC (permalink / raw)
  To: whydna; +Cc: herbert, nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

From: "Andrew Dickinson" <whydna@whydna.net>
Date: Sat, 4 Oct 2008 21:45:27 -0700

> Here's the patch that Herbert's referring to.  The basic idea is that
> we have a flag which indicates whether or not we need to invalidate
> the route cache.  If any chain exceeds gc_elasticity, we set the flag
> and reschedule the timer.  In the worst-case, we'll invalidate the
> route cache once every secret_interval; in the best-case, we never
> invalidate the cache.

This is a very interesting patch and idea, but...

Eric showed clearly that on a completely normal well loaded
system, the chain lengths exceed the elasticity all the time
and it's not like these are entries we can get rid of because
their refcounts are all > 1

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05 17:34       ` David Miller
@ 2008-10-05 18:06         ` Andrew Dickinson
  2008-10-06  4:21         ` Herbert Xu
  1 sibling, 0 replies; 64+ messages in thread
From: Andrew Dickinson @ 2008-10-05 18:06 UTC (permalink / raw)
  To: David Miller
  Cc: herbert, nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

I've got another patch that takes a different approach...  Instead of
disabling the secret_interval timer or trying to heuristically guess
when we're under attack, we continue to invalidate the cache; we just
invalidate it with kid-gloves instead of a sledge hammer.

Like we do today, we continue to update the genid every time the
secret_interval timer expires.  Instead of simply creating a new value
(and thus invalidating the entire cache), we keep a short history of
genid values (I'm thinking on the order of 2-4 previous values).  In
rt_intern_hash(), when we do the check to see if we already have an
existing hash entry, we'll check each of the previous genid versions
(hence the desire to keep the history short) before declaring it as
not there.  If we do find the entry in the hash with an older genid
value, we'll re-bucket it into the correct location for the latest
genid.

Basically, we're allowing entries to continue to exist in the hash
after the route cache has been invalidated (they can still be pruned
by GC).  Happy to send the patch along if you'd like, although I'm not
as confident that this approach is really desirable.

-A

On Sun, Oct 5, 2008 at 10:34 AM, David Miller <davem@davemloft.net> wrote:
> From: "Andrew Dickinson" <whydna@whydna.net>
> Date: Sat, 4 Oct 2008 21:45:27 -0700
>
>> Here's the patch that Herbert's referring to.  The basic idea is that
>> we have a flag which indicates whether or not we need to invalidate
>> the route cache.  If any chain exceeds gc_elasticity, we set the flag
>> and reschedule the timer.  In the worst-case, we'll invalidate the
>> route cache once every secret_interval; in the best-case, we never
>> invalidate the cache.
>
> This is a very interesting patch and idea, but...
>
> Eric showed clearly that on a completely normal well loaded
> system, the chain lengths exceed the elasticity all the time
> and it's not like these are entries we can get rid of because
> their refcounts are all > 1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05  3:20   ` Herbert Xu
@ 2008-10-06  0:52     ` Neil Horman
  0 siblings, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-06  0:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev, kuznet, davem, pekkas, jmorris, yoshfuji, kaber

On Sun, Oct 05, 2008 at 11:20:47AM +0800, Herbert Xu wrote:
> On Sun, Oct 05, 2008 at 11:17:27AM +0800, Herbert Xu wrote:
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> > >        We currently have the ability to disable our route cache secret interval
> > > rebuild timer (by setting it to zero), but if we do that its possible for an
> > > attacker (if they guess our route cache hash secret, to fill our system with
> > > routes that all hash to the same bucket, destroying our performance.  This patch
> > 
> > This is completely bogus.  We never allow any chain to grow beyond
> > the elasticity.  So in the worst case we just bypass the route cache.
> 
> OK we don't actually enforce that as it stands, but perhaps we
> should have a maximum chain length that is enforced.
> 
Thank you :).  I was just about to respond with a note asking you for a
reference to your previous assertion.  We definately don't enforce that now.

I agree we should have a maximum chain legth that we enforce.  Unfortunately
according to eric, the gc_elasticity value can't be it, because we very
routinely in nominal systems have a limited number of chains that violate the
elasticity, and that should be expected.  Thats why my latest patch tries to
cover that by computing the standard deviation of the set of chains and allowing
chains within avg+4*SD to exist.  With that allowance, we should be able to
remove the periodic rehash entirely.

Best
Neil

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-05 17:34       ` David Miller
  2008-10-05 18:06         ` Andrew Dickinson
@ 2008-10-06  4:21         ` Herbert Xu
  2008-10-06 10:50           ` Neil Horman
  1 sibling, 1 reply; 64+ messages in thread
From: Herbert Xu @ 2008-10-06  4:21 UTC (permalink / raw)
  To: David Miller
  Cc: whydna, nhorman, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber

On Sun, Oct 05, 2008 at 10:34:54AM -0700, David Miller wrote:
>
> Eric showed clearly that on a completely normal well loaded
> system, the chain lengths exceed the elasticity all the time
> and it's not like these are entries we can get rid of because
> their refcounts are all > 1

I think there are two orthogonal issues here.

1) The way we count the chain length is wrong.  There are keys
which do not form part of the hash computation.  Entries that
only differ by them will always end up in the same bucket.

We should count all entries that only differ by those keys as
a single entry for the purposes of detecting an attack.

FWIW we could even reorganise the storage inside a bucket such
that it is a 2-level list where the first level only contained
entries that differ by saddr/daddr.

2) What do we do when we get a long chain just after a rehash.

This is an indication that the attacker has more knowledge about
us than we expected.  Continuing to rehash is probably no going
to help.

We need to decide whether we care about this scenario.

If yes, then we'll need to come up with a way to bypass the
route cache, or at least act as if it was bypassed.

If not, then we can just kill the rehash interval altogether.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-03 20:36                         ` Neil Horman
@ 2008-10-06 10:49                           ` Eric Dumazet
  2008-10-06 13:14                             ` Neil Horman
  2008-10-06 20:54                             ` Neil Horman
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-10-06 10:49 UTC (permalink / raw)
  To: Neil Horman
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

[-- Attachment #1: Type: text/plain, Size: 3369 bytes --]

Neil Horman a écrit :
> Hey all-
> 	So, I've been doing some testing here with this patch, and am
> comfortable that the sd estimation is working reasonably well. For a hash table
> with an average chain length of 1, it computes the stadard deviation to be 2,
> which gives us a max chain length of 9 (4*sd + avg), and it manages to do that
> in about 7 jiffies over about 524000 hash buckets.  I'm reasonably pleased with
> that speed I think, and after thinking about it, I like this implementation
> somewhat better, as it doesn't create a window in which chains can be
> artifically overrun (until the nect gc round) (although I'm happy to hear
> arguments against my implementation).  Anywho here it is, comments and thoughts
> welcome!
> 
> Thanks & Regards
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  route.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 118 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..4f8c5b5 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>  static void		 ipv4_link_failure(struct sk_buff *skb);
>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>  static int rt_garbage_collect(struct dst_ops *ops);
> +static void rt_emergency_hash_rebuild(struct net *net);
>  
>  
>  static struct dst_ops ipv4_dst_ops = {
> @@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
>  
>  struct rt_hash_bucket {
>  	struct rtable	*chain;
> +	atomic_t	chain_length;
>  };
> +
> +atomic_t rt_hash_total_count;
> +atomic_t rt_hash_nz_count;
> +
> +static int rt_chain_length_max;
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
>  	defined(CONFIG_PROVE_LOCKING)
>  /*
> @@ -601,6 +609,68 @@ static inline int ip_rt_proc_init(void)
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +static void rt_hash_sd_update(void)
> +{
> +	int temp, i;
> +	unsigned long long sd;
> +	int average = atomic_read(&rt_hash_total_count);
> +	int nzcount = atomic_read(&rt_hash_nz_count);
> +
> +	/*
> + 	 * Don't divide by zero
> + 	 */
> +	if (!nzcount)
> +		return;
> +
> +	average = DIV_ROUND_UP(average, nzcount);
> +
> +	sd = 0;
> +	for (i = 0; i < (1 << rt_hash_log); i++) {
> +		temp = atomic_read(&rt_hash_table[i].chain_length);
> +		/*
> + 		 * Don't count zero entries, as most of the table
> + 		 * will likely be empty.  We don't want to unfairly
> + 		 * bias our average chain length down so far
> + 		 */

Empty chains should be accounted for, or average and standard
deviation are not correct.

> +		if (unlikely(temp))
> +			sd += (temp-average)^2;

Out of curiosity, what do you expect to do here ?

(temp-average) XOR 2
or 
(temp-average) * (temp-average) 

Also, your computations use integer arithmetic.

If avg = 2.5 and sd = 1.9, avg+4*sd you'll find 6 instead of 10 

Anyway, we wont add so many atomic operations and double
hash table size in order to be able to compute sd.

If we really want to be smart, we can have a pretty good
 estimate of average and sd for free in rt_check_expire()

Something like this untested patch. (We should make sure
we dont overflow sum2 for example)


[-- Attachment #2: avg_sd.patch --]
[-- Type: text/plain, Size: 2353 bytes --]

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..85182d9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -125,6 +125,7 @@ static int ip_rt_redirect_silence __read_mostly	= ((HZ / 50) << (9 + 1));
 static int ip_rt_error_cost __read_mostly	= HZ;
 static int ip_rt_error_burst __read_mostly	= 5 * HZ;
 static int ip_rt_gc_elasticity __read_mostly	= 8;
+static int rt_chain_length_max __read_mostly    = 32;
 static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
@@ -748,11 +749,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long sum = 0, sum2 = 0;
+	unsigned long length, samples = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -770,8 +784,10 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
 		if (*rthp == NULL)
 			continue;
+		length = 0;
 		spin_lock_bh(rt_hash_lock_addr(i));
 		while ((rth = *rthp) != NULL) {
 			if (rt_is_expired(rth)) {
@@ -784,11 +800,13 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				length += ONE;
 				continue;
 			}
 
@@ -797,6 +815,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					    ip_rt_gc_elasticity,
+					    (avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06  4:21         ` Herbert Xu
@ 2008-10-06 10:50           ` Neil Horman
  2008-10-06 11:02             ` Herbert Xu
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-06 10:50 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, whydna, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

On Mon, Oct 06, 2008 at 12:21:08PM +0800, Herbert Xu wrote:
> On Sun, Oct 05, 2008 at 10:34:54AM -0700, David Miller wrote:
> >
> > Eric showed clearly that on a completely normal well loaded
> > system, the chain lengths exceed the elasticity all the time
> > and it's not like these are entries we can get rid of because
> > their refcounts are all > 1
> 
> I think there are two orthogonal issues here.
> 
> 1) The way we count the chain length is wrong.  There are keys
> which do not form part of the hash computation.  Entries that
> only differ by them will always end up in the same bucket.
> 
> We should count all entries that only differ by those keys as
> a single entry for the purposes of detecting an attack.
> 
> FWIW we could even reorganise the storage inside a bucket such
> that it is a 2-level list where the first level only contained
> entries that differ by saddr/daddr.
> 

I'm not sure I follow what your saying here.  I understand that some keys will
wind up hashing to the same bucket, but from what I see a change to the saddr
and daddr parameters to rt_hash, will change what bucket you hash too.  What am
I missing?

> 2) What do we do when we get a long chain just after a rehash.
> 
> This is an indication that the attacker has more knowledge about
> us than we expected.  Continuing to rehash is probably no going
> to help.
> 
Seems like it might be ambiguous to me.  perhaps we just got a series of
collisions in the firs few entries after a  rebuild?  I dont know, Im just
playing devils advocate.

> We need to decide whether we care about this scenario.
> 
I expect we should.

> If yes, then we'll need to come up with a way to bypass the
> route cache, or at least act as if it was bypassed.
> 
Why don't we just add a count to the number of times we call
rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
rate determined by jiffies since the last emergency rebuild), we can set a flag
to not always return a failed lookup in the cache, so as to force routing into
the slow path.


Does that seem reasonable to you?


Best
Neil

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 10:50           ` Neil Horman
@ 2008-10-06 11:02             ` Herbert Xu
  2008-10-06 12:43               ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Herbert Xu @ 2008-10-06 11:02 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, whydna, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

On Mon, Oct 06, 2008 at 06:50:22AM -0400, Neil Horman wrote:
>
> > 1) The way we count the chain length is wrong.  There are keys
> > which do not form part of the hash computation.  Entries that
> > only differ by them will always end up in the same bucket.
>
> I'm not sure I follow what your saying here.  I understand that some keys will
> wind up hashing to the same bucket, but from what I see a change to the saddr
> and daddr parameters to rt_hash, will change what bucket you hash too.  What am
> I missing?

We've got keys other than saddr/daddr, for example, all route
entries that differ only by TOS hash to the same bucket.  There
are quite a few possible TOS values.

Have a look at ip r l c.

> > 2) What do we do when we get a long chain just after a rehash.
> > 
> > This is an indication that the attacker has more knowledge about
> > us than we expected.  Continuing to rehash is probably no going
> > to help.
>
> Seems like it might be ambiguous to me.  perhaps we just got a series of
> collisions in the firs few entries after a  rebuild?  I dont know, Im just
> playing devils advocate.

Have you computed the probability of that? If the limit is sensible
this should be extremely unlikely.  Note that I'm also assuming
that you've got 1) solved as otherwise all bets are off.

Assuming that we are counting it correctly, and that the length
limit is set to a sensible value, then the probability of the
attacker reaching that limit soon after a rehash is very small.

Therefore we can deduce that if it does happen then chances are
that our RNG has been compromised and as such rehashing again
will not help.

> Why don't we just add a count to the number of times we call
> rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
> rate determined by jiffies since the last emergency rebuild), we can set a flag
> to not always return a failed lookup in the cache, so as to force routing into
> the slow path.

Really, if we're hitting this when nobody is attacking us then
something is screwed up, notably 1) as it stands.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 11:02             ` Herbert Xu
@ 2008-10-06 12:43               ` Neil Horman
  0 siblings, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-06 12:43 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, whydna, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber

On Mon, Oct 06, 2008 at 07:02:49PM +0800, Herbert Xu wrote:
> On Mon, Oct 06, 2008 at 06:50:22AM -0400, Neil Horman wrote:
> >
> > > 1) The way we count the chain length is wrong.  There are keys
> > > which do not form part of the hash computation.  Entries that
> > > only differ by them will always end up in the same bucket.
> >
> > I'm not sure I follow what your saying here.  I understand that some keys will
> > wind up hashing to the same bucket, but from what I see a change to the saddr
> > and daddr parameters to rt_hash, will change what bucket you hash too.  What am
> > I missing?
> 
> We've got keys other than saddr/daddr, for example, all route
> entries that differ only by TOS hash to the same bucket.  There
> are quite a few possible TOS values.
> 
> Have a look at ip r l c.
Ok, that makes a bit more sense.  Thank you.   I agree that in that case we
probably do need to bias our chain length accounting such that entries that
differ only by TOS or other fields that do not affect the hash value only get
counted once.

> 
> > > 2) What do we do when we get a long chain just after a rehash.
> > > 
> > > This is an indication that the attacker has more knowledge about
> > > us than we expected.  Continuing to rehash is probably no going
> > > to help.
> >
> > Seems like it might be ambiguous to me.  perhaps we just got a series of
> > collisions in the firs few entries after a  rebuild?  I dont know, Im just
> > playing devils advocate.
> 
> Have you computed the probability of that? If the limit is sensible
> this should be extremely unlikely.  Note that I'm also assuming
> that you've got 1) solved as otherwise all bets are off.
> 
> Assuming that we are counting it correctly, and that the length
> limit is set to a sensible value, then the probability of the
> attacker reaching that limit soon after a rehash is very small.
> 
> Therefore we can deduce that if it does happen then chances are
> that our RNG has been compromised and as such rehashing again
> will not help.
> 
Like I said, I'm just playing devils advocate on this.  Clearly the hash
function should be such that an early data set hashes to the same chain is very
low.

> > Why don't we just add a count to the number of times we call
> > rt_emergency_hash_rebuild?  If we cross a threshold on that count (or perhaps a
> > rate determined by jiffies since the last emergency rebuild), we can set a flag
> > to not always return a failed lookup in the cache, so as to force routing into
> > the slow path.
> 
> Really, if we're hitting this when nobody is attacking us then
> something is screwed up, notably 1) as it stands.
> 

Thats rather the point though.  If we're not being attacked, then we'll never
hit this (ideally).

I'll post a new patch when I figure out how best to handle entries which differ
by a field irrelevant to the hash function.

Neil

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 10:49                           ` Eric Dumazet
@ 2008-10-06 13:14                             ` Neil Horman
  2008-10-06 20:54                             ` Neil Horman
  1 sibling, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-06 13:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Mon, Oct 06, 2008 at 12:49:33PM +0200, Eric Dumazet wrote:
>
> Empty chains should be accounted for, or average and standard
> deviation are not correct.

Fair enough, although for such a large hash table, I would assume that this
would unfairly bias our average to zero, although as I think about that, our
standard deviation would make up for that.

>
>> +		if (unlikely(temp))
>> +			sd += (temp-average)^2;
>
> Out of curiosity, what do you expect to do here ?
>

Doh!  Show off how stupid I can be apparently.  I was doing too much at once,
and was thinking exponentiation there, rather than bitwise XOR.  My bad.

> (temp-average) XOR 2
> or (temp-average) * (temp-average) 
>
> Also, your computations use integer arithmetic.
>
> If avg = 2.5 and sd = 1.9, avg+4*sd you'll find 6 instead of 10 
>
Yes, but I think we're going to have to tolerate some error, since we're using
integer arithmatic here.

> Anyway, we wont add so many atomic operations and double
> hash table size in order to be able to compute sd.
>
That I have to agree with.  The growth in size at the very least doesn't look
overly acceptible.

> If we really want to be smart, we can have a pretty good
> estimate of average and sd for free in rt_check_expire()
>
> Something like this untested patch. (We should make sure
> we dont overflow sum2 for example)
>

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..85182d9 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -125,6 +125,7 @@ static int ip_rt_redirect_silence __read_mostly	= ((HZ / 50) << (9 + 1));
>  static int ip_rt_error_cost __read_mostly	= HZ;
>  static int ip_rt_error_burst __read_mostly	= 5 * HZ;
>  static int ip_rt_gc_elasticity __read_mostly	= 8;
> +static int rt_chain_length_max __read_mostly    = 32;
>  static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
>  static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
>  static int ip_rt_min_advmss __read_mostly	= 256;
> @@ -748,11 +749,24 @@ static void rt_do_flush(int process_context)
>  	}
>  }
>  
> +/*
> + * While freeing expired entries, we compute average chain length
> + * and standard deviation, using fixed-point arithmetic.
> + * This to have an estimation of rt_chain_length_max
> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
> + */
> +
> +#define FRACT_BITS 3
> +#define ONE (1UL << FRACT_BITS)
> +
>  static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
>  	struct rtable *rth, **rthp;
> +	unsigned long sum = 0, sum2 = 0;
> +	unsigned long length, samples = 0;
>  	u64 mult;
>  
>  	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
> @@ -770,8 +784,10 @@ static void rt_check_expire(void)
>  		if (need_resched())
>  			cond_resched();
>  
> +		samples++;
>  		if (*rthp == NULL)
>  			continue;
> +		length = 0;
>  		spin_lock_bh(rt_hash_lock_addr(i));
>  		while ((rth = *rthp) != NULL) {
>  			if (rt_is_expired(rth)) {
> @@ -784,11 +800,13 @@ static void rt_check_expire(void)
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
> +					length += ONE;
>  					continue;
>  				}
>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>  				tmo >>= 1;
>  				rthp = &rth->u.dst.rt_next;
> +				length += ONE;
>  				continue;
>  			}
>  
> @@ -797,6 +815,15 @@ static void rt_check_expire(void)
>  			rt_free(rth);
>  		}
>  		spin_unlock_bh(rt_hash_lock_addr(i));
> +		sum += length;
> +		sum2 += length*length;
> +	}
> +	if (samples) {
> +		unsigned long avg = sum / samples;
> +		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
> +		rt_chain_length_max = max_t(unsigned long,
> +					    ip_rt_gc_elasticity,
> +					    (avg + 4*sd) >> FRACT_BITS);

Oh!  Thats pretty nifty, I hadn't considered breaking the interger up like that
to avoid round off error.

Although Like computing sd during garbage collection, this method leaves a hole
during which the addition of several entries to one chain may provide a false
positive before the next run of rt_check_expire.  Or is the likelyhood of that
low enough that its not particularly relevant?

Regards
Neil


-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 10:49                           ` Eric Dumazet
  2008-10-06 13:14                             ` Neil Horman
@ 2008-10-06 20:54                             ` Neil Horman
  2008-10-06 21:21                               ` Eric Dumazet
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-06 20:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Mon, Oct 06, 2008 at 12:49:33PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> Hey all-
>> 	So, I've been doing some testing here with this patch, and am
>> comfortable that the sd estimation is working reasonably well. For a hash table
>> with an average chain length of 1, it computes the stadard deviation to be 2,
>> which gives us a max chain length of 9 (4*sd + avg), and it manages to do that
>> in about 7 jiffies over about 524000 hash buckets.  I'm reasonably pleased with
>> that speed I think, and after thinking about it, I like this implementation
>> somewhat better, as it doesn't create a window in which chains can be
>> artifically overrun (until the nect gc round) (although I'm happy to hear
>> arguments against my implementation).  Anywho here it is, comments and thoughts
>> welcome!
>>
>> Thanks & Regards
>> Neil
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>>
>>
>>  route.c |  121 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 118 insertions(+), 3 deletions(-)
>>
>>
>> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
>> index 6ee5354..4f8c5b5 100644
>> --- a/net/ipv4/route.c
>> +++ b/net/ipv4/route.c
>> @@ -145,6 +145,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>>  static void		 ipv4_link_failure(struct sk_buff *skb);
>>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>>  static int rt_garbage_collect(struct dst_ops *ops);
>> +static void rt_emergency_hash_rebuild(struct net *net);
>>    static struct dst_ops ipv4_dst_ops = {
>> @@ -200,7 +201,14 @@ const __u8 ip_tos2prio[16] = {
>>   struct rt_hash_bucket {
>>  	struct rtable	*chain;
>> +	atomic_t	chain_length;
>>  };
>> +
>> +atomic_t rt_hash_total_count;
>> +atomic_t rt_hash_nz_count;
>> +
>> +static int rt_chain_length_max;
>> +
>>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
>>  	defined(CONFIG_PROVE_LOCKING)
>>  /*
>> @@ -601,6 +609,68 @@ static inline int ip_rt_proc_init(void)
>>  }
>>  #endif /* CONFIG_PROC_FS */
>>  +static void rt_hash_sd_update(void)
>> +{
>> +	int temp, i;
>> +	unsigned long long sd;
>> +	int average = atomic_read(&rt_hash_total_count);
>> +	int nzcount = atomic_read(&rt_hash_nz_count);
>> +
>> +	/*
>> + 	 * Don't divide by zero
>> + 	 */
>> +	if (!nzcount)
>> +		return;
>> +
>> +	average = DIV_ROUND_UP(average, nzcount);
>> +
>> +	sd = 0;
>> +	for (i = 0; i < (1 << rt_hash_log); i++) {
>> +		temp = atomic_read(&rt_hash_table[i].chain_length);
>> +		/*
>> + 		 * Don't count zero entries, as most of the table
>> + 		 * will likely be empty.  We don't want to unfairly
>> + 		 * bias our average chain length down so far
>> + 		 */
>
> Empty chains should be accounted for, or average and standard
> deviation are not correct.
>
>> +		if (unlikely(temp))
>> +			sd += (temp-average)^2;
>
> Out of curiosity, what do you expect to do here ?
>
> (temp-average) XOR 2
> or (temp-average) * (temp-average) 
>
> Also, your computations use integer arithmetic.
>
> If avg = 2.5 and sd = 1.9, avg+4*sd you'll find 6 instead of 10 
>
> Anyway, we wont add so many atomic operations and double
> hash table size in order to be able to compute sd.
>
> If we really want to be smart, we can have a pretty good
> estimate of average and sd for free in rt_check_expire()
>
> Something like this untested patch. (We should make sure
> we dont overflow sum2 for example)
>

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..85182d9 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -125,6 +125,7 @@ static int ip_rt_redirect_silence __read_mostly	= ((HZ / 50) << (9 + 1));
>  static int ip_rt_error_cost __read_mostly	= HZ;
>  static int ip_rt_error_burst __read_mostly	= 5 * HZ;
>  static int ip_rt_gc_elasticity __read_mostly	= 8;
> +static int rt_chain_length_max __read_mostly    = 32;
>  static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
>  static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
>  static int ip_rt_min_advmss __read_mostly	= 256;
> @@ -748,11 +749,24 @@ static void rt_do_flush(int process_context)
>  	}
>  }
>  
> +/*
> + * While freeing expired entries, we compute average chain length
> + * and standard deviation, using fixed-point arithmetic.
> + * This to have an estimation of rt_chain_length_max
> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
> + */
> +
> +#define FRACT_BITS 3
> +#define ONE (1UL << FRACT_BITS)
> +
>  static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
>  	struct rtable *rth, **rthp;
> +	unsigned long sum = 0, sum2 = 0;
> +	unsigned long length, samples = 0;
>  	u64 mult;
>  
>  	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
> @@ -770,8 +784,10 @@ static void rt_check_expire(void)
>  		if (need_resched())
>  			cond_resched();
>  
> +		samples++;
>  		if (*rthp == NULL)
>  			continue;
> +		length = 0;
>  		spin_lock_bh(rt_hash_lock_addr(i));
>  		while ((rth = *rthp) != NULL) {
>  			if (rt_is_expired(rth)) {
> @@ -784,11 +800,13 @@ static void rt_check_expire(void)
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
> +					length += ONE;
>  					continue;
>  				}
>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>  				tmo >>= 1;
>  				rthp = &rth->u.dst.rt_next;
> +				length += ONE;
>  				continue;
>  			}
>  
> @@ -797,6 +815,15 @@ static void rt_check_expire(void)
>  			rt_free(rth);
>  		}
>  		spin_unlock_bh(rt_hash_lock_addr(i));
> +		sum += length;
> +		sum2 += length*length;
> +	}
> +	if (samples) {
> +		unsigned long avg = sum / samples;
> +		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
> +		rt_chain_length_max = max_t(unsigned long,
> +					    ip_rt_gc_elasticity,
> +					    (avg + 4*sd) >> FRACT_BITS);

So, I've been playing with this patch, and I've not figured out eactly whats
bothering me yet, since the math seems right, but something doesn't seem right
about the outcome of this algorithm.  I've tested with my local system, and all
works well, because the route cache is well behaved, and the sd value always
works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
working through some scenarios by hand to see what this looks like using larger
numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
of the buckets have one entry on the chain, and all the rest are zeros, my hand
calculations result in a standard deviation of approximately 140 and an average
of .4.  That imples that in that sample set any one chain could be almost 500
entires long before it triggered a cache rebuld.  Does that seem reasonable?

Best
Neil



-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 20:54                             ` Neil Horman
@ 2008-10-06 21:21                               ` Eric Dumazet
  2008-10-06 22:52                                 ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-06 21:21 UTC (permalink / raw)
  To: Neil Horman
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Neil Horman a écrit :
> So, I've been playing with this patch, and I've not figured out eactly whats
> bothering me yet, since the math seems right, but something doesn't seem right
> about the outcome of this algorithm.  I've tested with my local system, and all
> works well, because the route cache is well behaved, and the sd value always
> works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
> working through some scenarios by hand to see what this looks like using larger
> numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
> count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
> of the buckets have one entry on the chain, and all the rest are zeros, my hand
> calculations result in a standard deviation of approximately 140 and an average
> of .4.  That imples that in that sample set any one chain could be almost 500
> entires long before it triggered a cache rebuld.  Does that seem reasonable?
> 

if rt_hash_log is 17, and interval is 60, then you should scan 
(60 << 17)/300 slots. That's 26214 slots. (ie 20% of the 2^17 slots)

I have no idea how you can have sd = 140, even if scaled by (1 << 3)
with slots being empty or with one entry only...

If 4% of your slots have one element, then average length is 0.04 :)






^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 21:21                               ` Eric Dumazet
@ 2008-10-06 22:52                                 ` Neil Horman
  2008-10-07  5:13                                   ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-06 22:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Mon, Oct 06, 2008 at 11:21:38PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> So, I've been playing with this patch, and I've not figured out eactly whats
>> bothering me yet, since the math seems right, but something doesn't seem right
>> about the outcome of this algorithm.  I've tested with my local system, and all
>> works well, because the route cache is well behaved, and the sd value always
>> works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
>> working through some scenarios by hand to see what this looks like using larger
>> numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
>> count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
>> of the buckets have one entry on the chain, and all the rest are zeros, my hand
>> calculations result in a standard deviation of approximately 140 and an average
>> of .4.  That imples that in that sample set any one chain could be almost 500
>> entires long before it triggered a cache rebuld.  Does that seem reasonable?
>>
>
> if rt_hash_log is 17, and interval is 60, then you should scan (60 << 
> 17)/300 slots. That's 26214 slots. (ie 20% of the 2^17 slots)
>
> I have no idea how you can have sd = 140, even if scaled by (1 << 3)
> with slots being empty or with one entry only...
>
I don't either, that was my concern :).

> If 4% of your slots have one element, then average length is 0.04 :)
>
Yes, and the average worked out properly, which is why I was concerned.

If you take an even simpler case, like you state above (I admit I miseed the
/300 part of the sample, but no matter).

samples = 26214
Assume each sample has a chain length of 1

sum = 26214 * (1 << 3) = 209712
sum2 = sum * sum = s09712 * 209712 = 43979122944
avg = sum / samples = 209712 / 26214 = 8 (correct)
sd = sqrt(sum2 / samples - avg*avg) = sqrt(43979122944/26214 - 64) = 1295
sd >> 3 = 1295.23 >> 3 = 161


Clearly, given the assumption that every chain in the sample set has 1 entry,
giving us an average of one, the traditional method of computing standard
deviation should have yielded an sd of 0 exactly, since every sample was
precisely the average. However, the math above gives us something significantly
larger.  I'm hoping I missed something, but I don't think I have.  

Neil

>
>
>
>
>

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-06 22:52                                 ` Neil Horman
@ 2008-10-07  5:13                                   ` Eric Dumazet
  2008-10-07 10:54                                     ` Neil Horman
  2008-10-13 18:26                                     ` Neil Horman
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-10-07  5:13 UTC (permalink / raw)
  To: Neil Horman
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

Neil Horman a écrit :
> On Mon, Oct 06, 2008 at 11:21:38PM +0200, Eric Dumazet wrote:
>> Neil Horman a écrit :
>>> So, I've been playing with this patch, and I've not figured out eactly whats
>>> bothering me yet, since the math seems right, but something doesn't seem right
>>> about the outcome of this algorithm.  I've tested with my local system, and all
>>> works well, because the route cache is well behaved, and the sd value always
>>> works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
>>> working through some scenarios by hand to see what this looks like using larger
>>> numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
>>> count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
>>> of the buckets have one entry on the chain, and all the rest are zeros, my hand
>>> calculations result in a standard deviation of approximately 140 and an average
>>> of .4.  That imples that in that sample set any one chain could be almost 500
>>> entires long before it triggered a cache rebuld.  Does that seem reasonable?
>>>
>> if rt_hash_log is 17, and interval is 60, then you should scan (60 << 
>> 17)/300 slots. That's 26214 slots. (ie 20% of the 2^17 slots)
>>
>> I have no idea how you can have sd = 140, even if scaled by (1 << 3)
>> with slots being empty or with one entry only...
>>
> I don't either, that was my concern :).
> 
>> If 4% of your slots have one element, then average length is 0.04 :)
>>
> Yes, and the average worked out properly, which is why I was concerned.
> 
> If you take an even simpler case, like you state above (I admit I miseed the
> /300 part of the sample, but no matter).
> 
> samples = 26214
> Assume each sample has a chain length of 1
> 
> sum = 26214 * (1 << 3) = 209712
> sum2 = sum * sum = s09712 * 209712 = 43979122944
> avg = sum / samples = 209712 / 26214 = 8 (correct)
> sd = sqrt(sum2 / samples - avg*avg) = sqrt(43979122944/26214 - 64) = 1295
> sd >> 3 = 1295.23 >> 3 = 161
> 
> 
> Clearly, given the assumption that every chain in the sample set has 1 entry,
> giving us an average of one, the traditional method of computing standard
> deviation should have yielded an sd of 0 exactly, since every sample was
> precisely the average. However, the math above gives us something significantly
> larger.  I'm hoping I missed something, but I don't think I have.  
> 

Famous last sentence ;)

You made some errors in your hand calculations.
sum2 is the sum of squares. Its not sum * sum.

If all slots have one entry, all "lengths" are (1<<3), 
and their 'square scaled by 6' is (1 << 6) . So sum2 = 26214 * (1 << 6) = 1677696
avg = sum / samples = 209712 / 26214 = (1 << 3)
sd = sqrt(sum2 / samples - avg*avg) = sqrt(64 - 64) = 0 (this is what we expected)

Now if you have 4 % of slots with one entry, and 96 % that are empty,
you should have 
/* real maths */
avg = 0.04
sd = sqrt(0.04 - 0.04*0.04) = sqrt(0.0384) = 0.195959
avg + 4*sd = 0.82

/* fixed point math */
sum = 0.04 * 26214 * (1<<3) = 1048 * (1<<3) = 8384 
sum2 = 1048 * (1 << 6) = 67072
avg << 3 = 8384/26214 = 0   (with 3 bits for fractional part, we do have rounding error)
sd << 3 = sqrt(67072/26214 - 0) = 1
(avg + 4*sd) << 3 = 4  -> final result is 4>>3 = 0 (expected)

Now if 50% of slots have one entry, we get :
/* real maths */
avg = 0.5
sd = sqrt(0.5 - 0.5*0.5) = sqrt(0.25) = 0.5
avg + 4*sd = 2.5

/* fixed point math */
sum = 0.5 * 26214 * (1<<3) = 104856
sum2 = 13107 * (1<<6) = 838848
avg << 3 = 104856/26214 = 4
sd << 3 = sqrt(838848/26214 - 4*4) = sqrt(32 - 16) = 4 
(avg + 4*sd) << 3 = 20  -> final result is 20>>3 = 2 (expected)


Hope this helps




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-07  5:13                                   ` Eric Dumazet
@ 2008-10-07 10:54                                     ` Neil Horman
  2008-10-13 18:26                                     ` Neil Horman
  1 sibling, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-07 10:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Tue, Oct 07, 2008 at 07:13:29AM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> On Mon, Oct 06, 2008 at 11:21:38PM +0200, Eric Dumazet wrote:
>>> Neil Horman a écrit :
>>>> So, I've been playing with this patch, and I've not figured out eactly whats
>>>> bothering me yet, since the math seems right, but something doesn't seem right
>>>> about the outcome of this algorithm.  I've tested with my local system, and all
>>>> works well, because the route cache is well behaved, and the sd value always
>>>> works out to be very small, so ip_rt_gc_elasticity is used.  So I've been
>>>> working through some scenarios by hand to see what this looks like using larger
>>>> numbers.  If i assume ip_rt_gc_interval is 60, and rt_hash_log is 17, my sample
>>>> count here is 7864320 samples per run.  If within that sample 393216 (about 4%)
>>>> of the buckets have one entry on the chain, and all the rest are zeros, my hand
>>>> calculations result in a standard deviation of approximately 140 and an average
>>>> of .4.  That imples that in that sample set any one chain could be almost 500
>>>> entires long before it triggered a cache rebuld.  Does that seem reasonable?
>>>>
>>> if rt_hash_log is 17, and interval is 60, then you should scan (60 << 
>>> 17)/300 slots. That's 26214 slots. (ie 20% of the 2^17 slots)
>>>
>>> I have no idea how you can have sd = 140, even if scaled by (1 << 3)
>>> with slots being empty or with one entry only...
>>>
>> I don't either, that was my concern :).
>>
>>> If 4% of your slots have one element, then average length is 0.04 :)
>>>
>> Yes, and the average worked out properly, which is why I was concerned.
>>
>> If you take an even simpler case, like you state above (I admit I miseed the
>> /300 part of the sample, but no matter).
>>
>> samples = 26214
>> Assume each sample has a chain length of 1
>>
>> sum = 26214 * (1 << 3) = 209712
>> sum2 = sum * sum = s09712 * 209712 = 43979122944
>> avg = sum / samples = 209712 / 26214 = 8 (correct)
>> sd = sqrt(sum2 / samples - avg*avg) = sqrt(43979122944/26214 - 64) = 1295
>> sd >> 3 = 1295.23 >> 3 = 161
>>
>>
>> Clearly, given the assumption that every chain in the sample set has 1 entry,
>> giving us an average of one, the traditional method of computing standard
>> deviation should have yielded an sd of 0 exactly, since every sample was
>> precisely the average. However, the math above gives us something significantly
>> larger.  I'm hoping I missed something, but I don't think I have.  
>>
>
> Famous last sentence ;)
>
> You made some errors in your hand calculations.
> sum2 is the sum of squares. Its not sum * sum.
>
> If all slots have one entry, all "lengths" are (1<<3), and their 'square 
> scaled by 6' is (1 << 6) . So sum2 = 26214 * (1 << 6) = 1677696
> avg = sum / samples = 209712 / 26214 = (1 << 3)
> sd = sqrt(sum2 / samples - avg*avg) = sqrt(64 - 64) = 0 (this is what we expected)
>
> Now if you have 4 % of slots with one entry, and 96 % that are empty,
> you should have /* real maths */
> avg = 0.04
> sd = sqrt(0.04 - 0.04*0.04) = sqrt(0.0384) = 0.195959
> avg + 4*sd = 0.82
>
> /* fixed point math */
> sum = 0.04 * 26214 * (1<<3) = 1048 * (1<<3) = 8384 sum2 = 1048 * (1 << 6) 
> = 67072
> avg << 3 = 8384/26214 = 0   (with 3 bits for fractional part, we do have rounding error)
> sd << 3 = sqrt(67072/26214 - 0) = 1
> (avg + 4*sd) << 3 = 4  -> final result is 4>>3 = 0 (expected)
>
> Now if 50% of slots have one entry, we get :
> /* real maths */
> avg = 0.5
> sd = sqrt(0.5 - 0.5*0.5) = sqrt(0.25) = 0.5
> avg + 4*sd = 2.5
>
> /* fixed point math */
> sum = 0.5 * 26214 * (1<<3) = 104856
> sum2 = 13107 * (1<<6) = 838848
> avg << 3 = 104856/26214 = 4
> sd << 3 = sqrt(838848/26214 - 4*4) = sqrt(32 - 16) = 4 (avg + 4*sd) << 3 
> = 20  -> final result is 20>>3 = 2 (expected)
>
>
> Hope this helps
>


Gaaa! Yes, embarrasing as it may be, that clarifies it for me.  Sorry, I missed
the the fact that sum2 is the sum of squares rather than the sqare of the sum.
Stupid of me.  Ok, so this actually works rather well then. I'll keep testing.

Thanks!
Neil

>
>
>

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-07  5:13                                   ` Eric Dumazet
  2008-10-07 10:54                                     ` Neil Horman
@ 2008-10-13 18:26                                     ` Neil Horman
  2008-10-16  6:55                                       ` David Miller
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-13 18:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Bill Fink, David Miller, netdev, kuznet, pekkas, jmorris,
	yoshfuji, kaber, Evgeniy Polyakov

On Tue, Oct 07, 2008 at 07:13:29AM +0200, Eric Dumazet wrote:

<lost of prior commentary snipped>

Ok, after my previous bone-headedness, I've ssen the light and concluded that
the standard deviation computation that Eric devised works quite well.  I've
spent the past week adding to and enhancing his code to come up with the patch
below.  Its Erics at the core, to which I've added the following:

1) The actual test in rt_intern_hash to detect chains with outlier lengths,
which will trgger an emergency cache rebuild, should an over-long chain be found

2) I've added some bits to organize chain in such a way that entries on the same
chains which differ only by values that do not affect the hash computation are
grouped together in contiguous segments.  This lets us count hash input value
transitions rather than simple lengths.  This addresses Herberts concern that we
might have a valid, non-malignant route cache with only one or two long chains
(the case in which several QOS types were used on the same route would be one
such example).  So in this patch routes in the cache with the same hash inputs
are counted only once

3) This patch keeps 2 extra values in the ipv4 net namespace structure.  One is
a sysctl that defines the maximum allowable number of route cache emergency
rebuilds that we will allow, and the other is a counter of the number of times
we have rebuilt under duress.  If we cross that sysctl boundary for a given
namespace, we discontinue using the route cache for that namespace entirely
until such time as the sysadmin resets the count via a reboot or ups the maximum
rebuild count via the sysctl.  This addresses Herberts other concern that might
find ourselves constantly rebuilding should we come under an especially
sophisticated attack.


I've done some rudimentary testing on it here, so more agressive testing, review
comments, etc would be greatly appreciated.

If this meets everyones approval I think we can follow up with a patch to remove
the secret interval code entirely.

Thanks & Regards!
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/linux/sysctl.h     |    1 
 include/net/netns/ipv4.h   |    2 
 kernel/sysctl_check.c      |    1 
 net/ipv4/route.c           |  109 ++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   12 ++++
 5 files changed, 123 insertions(+), 2 deletions(-)


diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..481aa44 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
 };
 
 enum {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ece1c92..977f482 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..eb9fb57 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
 	{ NET_TCP_ALLOWED_CONG_CONTROL,		"tcp_allowed_congestion_control" },
 	{ NET_TCP_MAX_SSTHRESH,			"tcp_max_ssthresh" },
 	{ NET_TCP_FRTO_RESPONSE,		"tcp_frto_response" },
+	{ NET_IPV4_RT_CACHE_REBUILD_COUNT,	"rt_cache_rebuild_count" },
 	{ 2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
 	{}
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a6d7c58..f20299c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 8;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -674,6 +677,19 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline int rt_caching(struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <= 
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -753,11 +769,23 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -789,11 +817,24 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only increment our chain length if the hash
+					 * inputs from this and the next entry are not
+					 * the same, we only want to count entries on
+					 * the chain with the same hash inputs once
+					 * so that multiple entries for different QOS
+					 * levels, and other non-hash affecting attributes
+					 * don't unfairly skew the length computation
+					 */
+					if (*rthp && !compare_hash_inputs(&(*rthp)->fl, &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if (*rthp && !compare_hash_inputs(&(*rthp)->fl, &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -851,6 +892,24 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net) {
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net) {
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -989,6 +1048,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1002,7 +1062,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1048,6 +1114,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust 
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1061,6 +1138,15 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			int num = ++dev_net(rt->u.dst.dev)->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1098,7 +1184,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1109,7 +1199,10 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	if (rthi)
+		rthi->u.dst.rt_next = rt;
+	else
+		rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1212,6 +1305,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2125,6 +2221,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2149,6 +2249,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2534,6 +2635,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2558,6 +2662,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 276d047..0f77361 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -795,6 +795,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_IPV4_RT_CACHE_REBUILD_COUNT,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -827,8 +835,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-13 18:26                                     ` Neil Horman
@ 2008-10-16  6:55                                       ` David Miller
  2008-10-16  9:19                                         ` Eric Dumazet
  2008-10-16 11:41                                         ` Neil Horman
  0 siblings, 2 replies; 64+ messages in thread
From: David Miller @ 2008-10-16  6:55 UTC (permalink / raw)
  To: nhorman
  Cc: dada1, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber,
	johnpol

From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 13 Oct 2008 14:26:55 -0400

> If this meets everyones approval I think we can follow up with a
> patch to remove the secret interval code entirely.

This patch looks pretty good to me.

Just some minor coding style nits:

> +static void rt_secret_rebuild_oneshot(struct net *net) {

Openning brace on new line please.

> +static void rt_emergency_hash_rebuild(struct net *net) {

Likewise.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16  6:55                                       ` David Miller
@ 2008-10-16  9:19                                         ` Eric Dumazet
  2008-10-16 21:18                                           ` David Miller
  2008-10-16 11:41                                         ` Neil Horman
  1 sibling, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-16  9:19 UTC (permalink / raw)
  To: David Miller
  Cc: nhorman, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

David Miller a écrit :
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 13 Oct 2008 14:26:55 -0400
> 
>> If this meets everyones approval I think we can follow up with a
>> patch to remove the secret interval code entirely.
> 
> This patch looks pretty good to me.
> 
> Just some minor coding style nits:
> 
>> +static void rt_secret_rebuild_oneshot(struct net *net) {
> 
> Openning brace on new line please.
> 
>> +static void rt_emergency_hash_rebuild(struct net *net) {
> 
> Likewise.
> 
> 

While browsing Neil patch, I found one missing rcu_assign_pointer()
in current code. I added a comment similar to other rcu_assign_pointer()
in rt_intern_hash() function.

[PATCH] ip: adds a missing rcu_assign_pointer()

rt_intern_hash() is doing an update of a RCU guarded hash chain
without using rcu_assign_pointer() or equivalent barrier.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: ipv4_route.patch --]
[-- Type: text/plain, Size: 496 bytes --]

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a6d7c58..377c6f9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1109,7 +1109,12 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	/*
+	 * Since lookup is lockfree, we must make sure
+	 * previous writes to rt are comitted to memory
+	 * before making rt visible to other CPUS.
+	 */	
+	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16  6:55                                       ` David Miller
  2008-10-16  9:19                                         ` Eric Dumazet
@ 2008-10-16 11:41                                         ` Neil Horman
  2008-10-16 12:25                                           ` Eric Dumazet
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-16 11:41 UTC (permalink / raw)
  To: David Miller
  Cc: dada1, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji, kaber,
	johnpol

On Wed, Oct 15, 2008 at 11:55:56PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 13 Oct 2008 14:26:55 -0400
> 
> > If this meets everyones approval I think we can follow up with a
> > patch to remove the secret interval code entirely.
> 
> This patch looks pretty good to me.
> 
> Just some minor coding style nits:
> 
> > +static void rt_secret_rebuild_oneshot(struct net *net) {
> 
> Openning brace on new line please.
> 
> > +static void rt_emergency_hash_rebuild(struct net *net) {
> 
> Likewise.
> 

Thanks Dave, new patch, with those nits fixed up.  I also cleaned up a few
checkpatch errors (all trailing whitespace and 80 col errors)

Best
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/linux/sysctl.h     |    1 
 include/net/netns/ipv4.h   |    2 
 kernel/sysctl_check.c      |    1 
 net/ipv4/route.c           |  117 ++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   12 ++++
 5 files changed, 131 insertions(+), 2 deletions(-)


diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..481aa44 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
 };
 
 enum {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..eb9fb57 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
 	{ NET_TCP_ALLOWED_CONG_CONTROL,		"tcp_allowed_congestion_control" },
 	{ NET_TCP_MAX_SSTHRESH,			"tcp_max_ssthresh" },
 	{ NET_TCP_FRTO_RESPONSE,		"tcp_frto_response" },
+	{ NET_IPV4_RT_CACHE_REBUILD_COUNT,	"rt_cache_rebuild_count" },
 	{ 2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
 	{}
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..623b633 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 8;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -669,6 +672,19 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline int rt_caching(struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -748,11 +764,23 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -784,11 +812,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if (*rthp &&
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if (*rthp &&
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -846,6 +892,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -984,6 +1050,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -997,7 +1064,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1043,6 +1116,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1056,6 +1140,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1093,7 +1187,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1104,7 +1202,10 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	if (rthi)
+		rthi->u.dst.rt_next = rt;
+	else
+		rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1207,6 +1308,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2120,6 +2224,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2144,6 +2252,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2523,6 +2632,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2547,6 +2659,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0689fd..6d9ab73 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -798,6 +798,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_IPV4_RT_CACHE_REBUILD_COUNT,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -830,8 +838,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 11:41                                         ` Neil Horman
@ 2008-10-16 12:25                                           ` Eric Dumazet
  2008-10-16 16:36                                             ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-16 12:25 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

Neil Horman a écrit :
> On Wed, Oct 15, 2008 at 11:55:56PM -0700, David Miller wrote:
>> From: Neil Horman <nhorman@tuxdriver.com>
>> Date: Mon, 13 Oct 2008 14:26:55 -0400
>>
>>> If this meets everyones approval I think we can follow up with a
>>> patch to remove the secret interval code entirely.
>> This patch looks pretty good to me.
>>
>> Just some minor coding style nits:
>>
>>> +static void rt_secret_rebuild_oneshot(struct net *net) {
>> Openning brace on new line please.
>>
>>> +static void rt_emergency_hash_rebuild(struct net *net) {
>> Likewise.
>>
> 
> Thanks Dave, new patch, with those nits fixed up.  I also cleaned up a few
> checkpatch errors (all trailing whitespace and 80 col errors)
> 
> Best
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

>  
> +/*
> + * While freeing expired entries, we compute average chain length
> + * and standard deviation, using fixed-point arithmetic.
> + * This to have an estimation of rt_chain_length_max
> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
> + */
> +
> +#define FRACT_BITS 3
> +#define ONE (1UL << FRACT_BITS)
> +
>  static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
>  	struct rtable *rth, **rthp;
> +	unsigned long length;
>  	u64 mult;
>  
>  	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
> @@ -784,11 +812,29 @@ static void rt_check_expire(void)
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
> +					/*
> +					 * Only bump our length if the hash
> +					 * inputs on entries n and n+1 are not
> +					 * the same, we only count entries on
> +					 * a chain with equal hash inputs once
> +					 * so that entries for different QOS
> +					 * levels, and other non-hash input
> +					 * attributes don't unfairly skew
> +					 * the length computation
> +					 */
> +					if (*rthp &&
> +					    !compare_hash_inputs(&(*rthp)->fl,
> +								 &rth->fl))
> +						length += ONE;
>  					continue;
>  				}
>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>  				tmo >>= 1;
>  				rthp = &rth->u.dst.rt_next;
> +				if (*rthp &&
> +				    !compare_hash_inputs(&(*rthp)->fl,
> +							 &rth->fl))
> +					length += ONE;
>  				continue;
>  			}

Incomplete patch ?

You added a 'length' variable, and update it but nowhere initialize and/or read it ?

Some way to change rt_chain_length_max is needed, sysctl or dynamically...




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 12:25                                           ` Eric Dumazet
@ 2008-10-16 16:36                                             ` Neil Horman
  2008-10-16 23:35                                               ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-16 16:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

On Thu, Oct 16, 2008 at 02:25:47PM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> On Wed, Oct 15, 2008 at 11:55:56PM -0700, David Miller wrote:
>>> From: Neil Horman <nhorman@tuxdriver.com>
>>> Date: Mon, 13 Oct 2008 14:26:55 -0400
>>>
>>>> If this meets everyones approval I think we can follow up with a
>>>> patch to remove the secret interval code entirely.
>>> This patch looks pretty good to me.
>>>
>>> Just some minor coding style nits:
>>>
>>>> +static void rt_secret_rebuild_oneshot(struct net *net) {
>>> Openning brace on new line please.
>>>
>>>> +static void rt_emergency_hash_rebuild(struct net *net) {
>>> Likewise.
>>>
>>
>> Thanks Dave, new patch, with those nits fixed up.  I also cleaned up a few
>> checkpatch errors (all trailing whitespace and 80 col errors)
>>
>> Best
>> Neil
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>
>>  +/*
>> + * While freeing expired entries, we compute average chain length
>> + * and standard deviation, using fixed-point arithmetic.
>> + * This to have an estimation of rt_chain_length_max
>> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
>> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
>> + */
>> +
>> +#define FRACT_BITS 3
>> +#define ONE (1UL << FRACT_BITS)
>> +
>>  static void rt_check_expire(void)
>>  {
>>  	static unsigned int rover;
>>  	unsigned int i = rover, goal;
>>  	struct rtable *rth, **rthp;
>> +	unsigned long length;
>>  	u64 mult;
>>   	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
>> @@ -784,11 +812,29 @@ static void rt_check_expire(void)
>>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>>  					tmo >>= 1;
>>  					rthp = &rth->u.dst.rt_next;
>> +					/*
>> +					 * Only bump our length if the hash
>> +					 * inputs on entries n and n+1 are not
>> +					 * the same, we only count entries on
>> +					 * a chain with equal hash inputs once
>> +					 * so that entries for different QOS
>> +					 * levels, and other non-hash input
>> +					 * attributes don't unfairly skew
>> +					 * the length computation
>> +					 */
>> +					if (*rthp &&
>> +					    !compare_hash_inputs(&(*rthp)->fl,
>> +								 &rth->fl))
>> +						length += ONE;
>>  					continue;
>>  				}
>>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>>  				tmo >>= 1;
>>  				rthp = &rth->u.dst.rt_next;
>> +				if (*rthp &&
>> +				    !compare_hash_inputs(&(*rthp)->fl,
>> +							 &rth->fl))
>> +					length += ONE;
>>  				continue;
>>  			}
>
> Incomplete patch ?
>
Yeah, that was quite stupid of me.  I rescind this, and I'll post a patch with the 
missing chunk later tonight after I spin/test it.

> You added a 'length' variable, and update it but nowhere initialize and/or read it ?
>
> Some way to change rt_chain_length_max is needed, sysctl or dynamically...
I don't really think so, since thats computed every run through rt_check_expire anyway.


Thanks!
Neil

>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16  9:19                                         ` Eric Dumazet
@ 2008-10-16 21:18                                           ` David Miller
  0 siblings, 0 replies; 64+ messages in thread
From: David Miller @ 2008-10-16 21:18 UTC (permalink / raw)
  To: dada1
  Cc: nhorman, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 16 Oct 2008 11:19:38 +0200

> While browsing Neil patch, I found one missing rcu_assign_pointer()
> in current code. I added a comment similar to other rcu_assign_pointer()
> in rt_intern_hash() function.
> 
> [PATCH] ip: adds a missing rcu_assign_pointer()
> 
> rt_intern_hash() is doing an update of a RCU guarded hash chain
> without using rcu_assign_pointer() or equivalent barrier.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 16:36                                             ` Neil Horman
@ 2008-10-16 23:35                                               ` Neil Horman
  2008-10-17  4:53                                                 ` Eric Dumazet
                                                                   ` (2 more replies)
  0 siblings, 3 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-16 23:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

On Thu, Oct 16, 2008 at 12:36:44PM -0400, Neil Horman wrote:
> On Thu, Oct 16, 2008 at 02:25:47PM +0200, Eric Dumazet wrote:
> > Neil Horman a écrit :
> Yeah, that was quite stupid of me.  I rescind this, and I'll post a patch with the 
> missing chunk later tonight after I spin/test it.
> 


Ok, heres a new patch, same as before, but added in proper initalization for
stack variables, as well as the missing chunk that actually computed the
standard deviation and maximum chain length.  Built/tested by me successfully.
Sorry for the prior noise.  Please review/ack.

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/linux/sysctl.h     |    1 
 include/net/netns/ipv4.h   |    2 
 kernel/sysctl_check.c      |    1 
 net/ipv4/route.c           |  130 ++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   12 ++++
 5 files changed, 144 insertions(+), 2 deletions(-)


diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index d0437f3..481aa44 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -435,6 +435,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
 };
 
 enum {
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index c35da23..eb9fb57 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
 	{ NET_TCP_ALLOWED_CONG_CONTROL,		"tcp_allowed_congestion_control" },
 	{ NET_TCP_MAX_SSTHRESH,			"tcp_max_ssthresh" },
 	{ NET_TCP_FRTO_RESPONSE,		"tcp_frto_response" },
+	{ NET_IPV4_RT_CACHE_REBUILD_COUNT,	"rt_cache_rebuild_count" },
 	{ 2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
 	{}
 };
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..0ff28c4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 8;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -669,6 +672,19 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline int rt_caching(struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -748,11 +764,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length = 0, samples = 0;
+	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -761,6 +790,7 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
+	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -770,6 +800,8 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
@@ -784,11 +816,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if (*rthp &&
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if (*rthp &&
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -797,6 +847,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					ip_rt_gc_elasticity,
+					(avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }
@@ -846,6 +905,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -984,6 +1063,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -997,7 +1077,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1043,6 +1129,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1056,6 +1153,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1093,7 +1200,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1104,7 +1215,10 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	if (rthi)
+		rthi->u.dst.rt_next = rt;
+	else
+		rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1207,6 +1321,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2120,6 +2237,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2144,6 +2265,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2523,6 +2645,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2547,6 +2672,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0689fd..6d9ab73 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -798,6 +798,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_IPV4_RT_CACHE_REBUILD_COUNT,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -830,8 +838,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 23:35                                               ` Neil Horman
@ 2008-10-17  4:53                                                 ` Eric Dumazet
  2008-10-17  5:23                                                   ` David Miller
  2008-10-17  5:03                                                 ` Stephen Hemminger
  2008-10-17  5:06                                                 ` Stephen Hemminger
  2 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-17  4:53 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

Neil Horman a écrit :
> On Thu, Oct 16, 2008 at 12:36:44PM -0400, Neil Horman wrote:
>> On Thu, Oct 16, 2008 at 02:25:47PM +0200, Eric Dumazet wrote:
>>> Neil Horman a écrit :
>> Yeah, that was quite stupid of me.  I rescind this, and I'll post a patch with the 
>> missing chunk later tonight after I spin/test it.
>>
> 
> 
> Ok, heres a new patch, same as before, but added in proper initalization for
> stack variables, as well as the missing chunk that actually computed the
> standard deviation and maximum chain length.  Built/tested by me successfully.
> Sorry for the prior noise.  Please review/ack.
> 
> Regards
> Neil

OK

First, please include the description, everytime you submit a patch.
An extensive description is probably the most important part of a patch.
Many people will only read description.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  include/linux/sysctl.h     |    1 
>  include/net/netns/ipv4.h   |    2 
>  kernel/sysctl_check.c      |    1 
>  net/ipv4/route.c           |  130 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/ipv4/sysctl_net_ipv4.c |   12 ++++
>  5 files changed, 144 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index d0437f3..481aa44 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -435,6 +435,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
>  };
>  
>  enum {
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index a6ed838..4fef762 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -46,6 +46,8 @@ struct netns_ipv4 {
>  	int sysctl_icmp_ratelimit;
>  	int sysctl_icmp_ratemask;
>  	int sysctl_icmp_errors_use_inbound_ifaddr;
> +	int sysctl_rt_cache_rebuild_count;
> +	int current_rt_cache_rebuild_count;
>  
>  	struct timer_list rt_secret_timer;
>  	atomic_t rt_genid;
> diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
> index c35da23..eb9fb57 100644
> --- a/kernel/sysctl_check.c
> +++ b/kernel/sysctl_check.c
> @@ -389,6 +389,7 @@ static const struct trans_ctl_table trans_net_ipv4_table[] = {
>  	{ NET_TCP_ALLOWED_CONG_CONTROL,		"tcp_allowed_congestion_control" },
>  	{ NET_TCP_MAX_SSTHRESH,			"tcp_max_ssthresh" },
>  	{ NET_TCP_FRTO_RESPONSE,		"tcp_frto_response" },
> +	{ NET_IPV4_RT_CACHE_REBUILD_COUNT,	"rt_cache_rebuild_count" },
>  	{ 2088 /* NET_IPQ_QMAX */,		"ip_queue_maxlen" },
>  	{}
>  };
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 6ee5354..0ff28c4 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
>  static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
>  static int ip_rt_min_advmss __read_mostly	= 256;
>  static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
> +static int rt_chain_length_max __read_mostly	= 8;

Some machine come to life on hostile environments and receive lots of UDP messages
from many hosts and may hit this limit before rt_check_expire() has started its first
run. Please consider default route cache settings, that allow 16 elements per chain
in average. And because of the formula avg+4*sd, a default of 20 would be better.

>  
>  static void rt_worker_func(struct work_struct *work);
>  static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
> @@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
>  static void		 ipv4_link_failure(struct sk_buff *skb);
>  static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
>  static int rt_garbage_collect(struct dst_ops *ops);
> +static void rt_emergency_hash_rebuild(struct net *net);
>  
>  
>  static struct dst_ops ipv4_dst_ops = {
> @@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
>  struct rt_hash_bucket {
>  	struct rtable	*chain;
>  };
> +
>  #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
>  	defined(CONFIG_PROVE_LOCKING)
>  /*
> @@ -669,6 +672,19 @@ static inline u32 rt_score(struct rtable *rt)
>  	return score;
>  }
>  
> +static inline int rt_caching(struct net *net)
> +{
> +	return net->ipv4.current_rt_cache_rebuild_count <=
> +		net->ipv4.sysctl_rt_cache_rebuild_count;
> +}
> +
> +static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
> +{
> +	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> +		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
> +		(fl1->iif ^ fl2->iif)) == 0);
> +}
> +
>  static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
>  {
>  	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> @@ -748,11 +764,24 @@ static void rt_do_flush(int process_context)
>  	}
>  }
>  
> +/*
> + * While freeing expired entries, we compute average chain length
> + * and standard deviation, using fixed-point arithmetic.
> + * This to have an estimation of rt_chain_length_max
> + *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
> + * We use 3 bits for frational part, and 29 (or 61) for magnitude.
> + */
> +
> +#define FRACT_BITS 3
> +#define ONE (1UL << FRACT_BITS)
> +
>  static void rt_check_expire(void)
>  {
>  	static unsigned int rover;
>  	unsigned int i = rover, goal;
>  	struct rtable *rth, **rthp;
> +	unsigned long length = 0, samples = 0;
> +	unsigned long sum = 0, sum2 = 0;
>  	u64 mult;
>  
>  	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
> @@ -761,6 +790,7 @@ static void rt_check_expire(void)
>  	goal = (unsigned int)mult;
>  	if (goal > rt_hash_mask)
>  		goal = rt_hash_mask + 1;
> +	length = 0;
>  	for (; goal > 0; goal--) {
>  		unsigned long tmo = ip_rt_gc_timeout;
>  
> @@ -770,6 +800,8 @@ static void rt_check_expire(void)
>  		if (need_resched())
>  			cond_resched();
>  
> +		samples++;
> +
>  		if (*rthp == NULL)
>  			continue;
>  		spin_lock_bh(rt_hash_lock_addr(i));
> @@ -784,11 +816,29 @@ static void rt_check_expire(void)
>  				if (time_before_eq(jiffies, rth->u.dst.expires)) {
>  					tmo >>= 1;
>  					rthp = &rth->u.dst.rt_next;
> +					/*
> +					 * Only bump our length if the hash
> +					 * inputs on entries n and n+1 are not
> +					 * the same, we only count entries on
> +					 * a chain with equal hash inputs once
> +					 * so that entries for different QOS
> +					 * levels, and other non-hash input
> +					 * attributes don't unfairly skew
> +					 * the length computation
> +					 */
> +					if (*rthp &&
> +					    !compare_hash_inputs(&(*rthp)->fl,
> +								 &rth->fl))
> +						length += ONE;

Here you ignore chains with one elements. This introduce a bias in the average and sd computation.
A correct test would be :
if ((*rthp == NULL) ||
	!compare_hash_inputs(&(*rthp)->fl,&rth->fl))
	length += ONE;

>  					continue;
>  				}
>  			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
>  				tmo >>= 1;
>  				rthp = &rth->u.dst.rt_next;
> +				if (*rthp &&
> +				    !compare_hash_inputs(&(*rthp)->fl,
> +							 &rth->fl))
> +					length += ONE;

same remark here

>  				continue;
>  			}
>  
> @@ -797,6 +847,15 @@ static void rt_check_expire(void)
>  			rt_free(rth);
>  		}
>  		spin_unlock_bh(rt_hash_lock_addr(i));
> +		sum += length;
> +		sum2 += length*length;
> +	}
> +	if (samples) {
> +		unsigned long avg = sum / samples;
> +		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
> +		rt_chain_length_max = max_t(unsigned long,
> +					ip_rt_gc_elasticity,
> +					(avg + 4*sd) >> FRACT_BITS);
>  	}
>  	rover = i;
>  }
> @@ -846,6 +905,26 @@ static void rt_secret_rebuild(unsigned long __net)
>  	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
>  }
>  
> +static void rt_secret_rebuild_oneshot(struct net *net)
> +{
> +	del_timer_sync(&net->ipv4.rt_secret_timer);
> +	rt_cache_invalidate(net);
> +	if (ip_rt_secret_interval) {
> +		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
> +		add_timer(&net->ipv4.rt_secret_timer);
> +	}
> +}
> +
> +static void rt_emergency_hash_rebuild(struct net *net)
> +{
> +	if (net_ratelimit()) {
> +		printk(KERN_WARNING "Route hash chain too long!\n");
> +		printk(KERN_WARNING "Adjust your secret_interval!\n");
> +	}
> +
> +	rt_secret_rebuild_oneshot(net);
> +}
> +
>  /*
>     Short description of GC goals.
>  
> @@ -984,6 +1063,7 @@ out:	return 0;
>  static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
>  {
>  	struct rtable	*rth, **rthp;
> +	struct rtable	*rthi;
>  	unsigned long	now;
>  	struct rtable *cand, **candp;
>  	u32 		min_score;
> @@ -997,7 +1077,13 @@ restart:
>  	candp = NULL;
>  	now = jiffies;
>  
> +	if (!rt_caching(dev_net(rt->u.dst.dev))) {
> +		rt_drop(rt);

Would be nice to add a new counter that one can check in /proc/net/stat/rt_cache
RT_CACHE_STAT_INC(notcached)
> +		return 0;
> +	}
> +
>  	rthp = &rt_hash_table[hash].chain;
> +	rthi = NULL;
>  
>  	spin_lock_bh(rt_hash_lock_addr(hash));
>  	while ((rth = *rthp) != NULL) {
> @@ -1043,6 +1129,17 @@ restart:
>  		chain_length++;
>  
>  		rthp = &rth->u.dst.rt_next;
> +
> +		/*
> +		 * check to see if the next entry in the chain
> +		 * contains the same hash input values as rt.  If it does
> +		 * This is where we will insert into the list, instead of
> +		 * at the head.  This groups entries that differ by aspects not
> +		 * relvant to the hash function together, which we use to adjust
> +		 * our chain length
> +		 */
> +		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
> +			rthi = rth;
>  	}
>  
>  	if (cand) {
> @@ -1056,6 +1153,16 @@ restart:
>  			*candp = cand->u.dst.rt_next;
>  			rt_free(cand);
>  		}
> +	} else {
> +		if (chain_length > rt_chain_length_max) {
> +			struct net *net = dev_net(rt->u.dst.dev);
> +			int num = ++net->ipv4.current_rt_cache_rebuild_count;
> +			if (!rt_caching(dev_net(rt->u.dst.dev))) {
> +				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
> +					rt->u.dst.dev->name, num);
> +			}
> +			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
> +		}
>  	}
>  
>  	/* Try to bind route to arp only if it is output
> @@ -1093,7 +1200,11 @@ restart:
>  		}
>  	}
>  
> -	rt->u.dst.rt_next = rt_hash_table[hash].chain;
> +	if (rthi)
> +		rt->u.dst.rt_next = rthi->u.dst.rt_next;
> +	else
> +		rt->u.dst.rt_next = rt_hash_table[hash].chain;
> +
>  #if RT_CACHE_DEBUG >= 2
>  	if (rt->u.dst.rt_next) {
>  		struct rtable *trt;
> @@ -1104,7 +1215,10 @@ restart:
>  		printk("\n");
>  	}
>  #endif
> -	rt_hash_table[hash].chain = rt;

Please respin your patch to last tree, this misses my last patch about
memory barrier...

Ah, maybe David did not committed it yet...


> +	if (rthi)
> +		rthi->u.dst.rt_next = rt;
> +	else
> +		rt_hash_table[hash].chain = rt;
>  	spin_unlock_bh(rt_hash_lock_addr(hash));
>  	*rp = rt;
>  	return 0;
> @@ -1207,6 +1321,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
>  	    || ipv4_is_zeronet(new_gw))
>  		goto reject_redirect;
>  
> +	if (!rt_caching(net))
> +		goto reject_redirect;
> +
>  	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
>  		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
>  			goto reject_redirect;
> @@ -2120,6 +2237,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>  	struct net *net;
>  
>  	net = dev_net(dev);
> +
> +	if (!rt_caching(net))
> +		goto skip_cache;
> +
>  	tos &= IPTOS_RT_MASK;
>  	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
>  
> @@ -2144,6 +2265,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
>  	}
>  	rcu_read_unlock();
>  
> +skip_cache:
>  	/* Multicast recognition logic is moved from route cache to here.
>  	   The problem was that too many Ethernet cards have broken/missing
>  	   hardware multicast filters :-( As result the host on multicasting
> @@ -2523,6 +2645,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
>  	unsigned hash;
>  	struct rtable *rth;
>  
> +	if (!rt_caching(net))
> +		goto slow_output;
> +
>  	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
>  
>  	rcu_read_lock_bh();
> @@ -2547,6 +2672,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
>  	}
>  	rcu_read_unlock_bh();
>  
> +slow_output:
>  	return ip_route_output_slow(net, rp, flp);
>  }
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index e0689fd..6d9ab73 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -798,6 +798,14 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= &proc_dointvec
>  	},
> +	{
> +		.ctl_name	= NET_IPV4_RT_CACHE_REBUILD_COUNT,
> +		.procname	= "rt_cache_rebuild_count",
> +		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec
> +	},

You should describe this in Documentation/networking/ip-sysctl.txt

>  	{ }
>  };
>  
> @@ -830,8 +838,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>  			&net->ipv4.sysctl_icmp_ratelimit;
>  		table[5].data =
>  			&net->ipv4.sysctl_icmp_ratemask;
> +		table[6].data =
> +			&net->ipv4.sysctl_rt_cache_rebuild_count;
>  	}
>  
> +	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
> +
>  	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
>  			net_ipv4_ctl_path, table);
>  	if (net->ipv4.ipv4_hdr == NULL)
> 
> 

Thank you




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 23:35                                               ` Neil Horman
  2008-10-17  4:53                                                 ` Eric Dumazet
@ 2008-10-17  5:03                                                 ` Stephen Hemminger
  2008-10-17  5:06                                                 ` Stephen Hemminger
  2 siblings, 0 replies; 64+ messages in thread
From: Stephen Hemminger @ 2008-10-17  5:03 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, David Miller, billfink, netdev, kuznet, pekkas,
	jmorris, yoshfuji, kaber, johnpol

On Thu, 16 Oct 2008 19:35:17 -0400
Neil Horman <nhorman@tuxdriver.com> wrote:

> On Thu, Oct 16, 2008 at 12:36:44PM -0400, Neil Horman wrote:
> > On Thu, Oct 16, 2008 at 02:25:47PM +0200, Eric Dumazet wrote:
> > > Neil Horman a écrit :
> > Yeah, that was quite stupid of me.  I rescind this, and I'll post a patch with the 
> > missing chunk later tonight after I spin/test it.
> > 
> 
> 
> Ok, heres a new patch, same as before, but added in proper initalization for
> stack variables, as well as the missing chunk that actually computed the
> standard deviation and maximum chain length.  Built/tested by me successfully.
> Sorry for the prior noise.  Please review/ack.
> 
> Regards
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> 
>  include/linux/sysctl.h     |    1 
>  include/net/netns/ipv4.h   |    2 
>  kernel/sysctl_check.c      |    1 
>  net/ipv4/route.c           |  130 ++++++++++++++++++++++++++++++++++++++++++++-
>  net/ipv4/sysctl_net_ipv4.c |   12 ++++
>  5 files changed, 144 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index d0437f3..481aa44 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -435,6 +435,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_IPV4_RT_CACHE_REBUILD_COUNT=126,
>  };
>  

Don't add new sysctl numbers use CTL_UNNUMBERED

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-16 23:35                                               ` Neil Horman
  2008-10-17  4:53                                                 ` Eric Dumazet
  2008-10-17  5:03                                                 ` Stephen Hemminger
@ 2008-10-17  5:06                                                 ` Stephen Hemminger
  2008-10-17 10:39                                                   ` Neil Horman
  2 siblings, 1 reply; 64+ messages in thread
From: Stephen Hemminger @ 2008-10-17  5:06 UTC (permalink / raw)
  To: Neil Horman
  Cc: Eric Dumazet, David Miller, billfink, netdev, kuznet, pekkas,
	jmorris, yoshfuji, kaber, johnpol

O
> +static inline int rt_caching(struct net *net)
> +{
> +	return net->ipv4.current_rt_cache_rebuild_count <=
> +		net->ipv4.sysctl_rt_cache_rebuild_count;
> +}

Why not?

static inline  bool rt_caching(const struct net *)

 

> +static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
> +{
> +	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> +		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
> +		(fl1->iif ^ fl2->iif)) == 0);
> +}

static inline bool compare_hash_inputs(const struct flowi *fl1, struct flowi *fl2)

...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-17  4:53                                                 ` Eric Dumazet
@ 2008-10-17  5:23                                                   ` David Miller
  0 siblings, 0 replies; 64+ messages in thread
From: David Miller @ 2008-10-17  5:23 UTC (permalink / raw)
  To: dada1
  Cc: nhorman, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 17 Oct 2008 06:53:54 +0200

> Please respin your patch to last tree, this misses my last patch about
> memory barrier...
> 
> Ah, maybe David did not committed it yet...

It's there in net-2.6

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-17  5:06                                                 ` Stephen Hemminger
@ 2008-10-17 10:39                                                   ` Neil Horman
       [not found]                                                     ` <48F8806A.6090306@cosmosbay.com>
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-17 10:39 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, David Miller, billfink, netdev, kuznet, pekkas,
	jmorris, yoshfuji, kaber, johnpol

On Thu, Oct 16, 2008 at 10:06:24PM -0700, Stephen Hemminger wrote:
> O
> > +static inline int rt_caching(struct net *net)
> > +{
> > +	return net->ipv4.current_rt_cache_rebuild_count <=
> > +		net->ipv4.sysctl_rt_cache_rebuild_count;
> > +}
> 
> Why not?
> 
I assume that you're asking why we use these values to determine if this
namespace should use the route cache?  The answer is this thread.  Because
ideally we should never rebuild the route cache unless something is giving the
cache an uneven distribution.  The consensus is that if we rebuild too many
times, its an indicator that someone knows enough about our hashing algorithm to
maliginantly force the creation of routes in the cache on the same chain
regardless of how many times we rebuild it.  In that case the best thing to do
is skip the use of the cache entirely.

Neil

-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
       [not found]                                                         ` <48F8AFBE.5080503@cosmosbay.com>
@ 2008-10-17 20:44                                                           ` Neil Horman
  2008-10-18  0:54                                                             ` Neil Horman
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-17 20:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger, nhorman

Ok, here we go again.  New patch attached, same as before, but replaces the
int return from rt_caching with a bool type, and migrates the added sysctl to
use the CTL_UNNUMBERED interface

This is a patch to provide on demand route cache rebuilding.  Currently, our
route cache is rebulid periodically regardless of need.  This introduced
unneeded periodic latency.  This patch offers a better approach.  Using code
provided by Eric Dumazet, we compute the standard deviation of the average hash
bucket chain length while running rt_check_expire.  Should any given chain
length grow to larger that average plus 4 standard deviations, we trigger an
emergency hash table rebuild for that net namespace.  This allows for the common
case in which chains are well behaved and do not grow unevenly to not incur any
latency at all, while those systems (which may be being maliciously attacked),
only rebuild when the attack is detected.  This patch take 2 other factors into
account:
1) chains with multiple entries that differ by attributes that do not affect the
hash value are only counted once, so as not to unduly bias system to rebuilding
if features like QOS are heavily used
2) if rebuilding crosses a certain threshold (which is adjustable via the added
sysctl in this patch), route caching is disabled entirely for that net
namespace, since constant rebuilding is less efficient that no caching at all

Tested successfully by me.

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 include/net/netns/ipv4.h   |    2 
 net/ipv4/route.c           |  130 ++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c |   12 ++++
 3 files changed, 142 insertions(+), 2 deletions(-)


diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..11e2ad8 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 8;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -669,6 +672,19 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline bool rt_caching(struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -748,11 +764,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length = 0, samples = 0;
+	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -761,6 +790,7 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
+	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -770,6 +800,8 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
@@ -784,11 +816,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if (*rthp &&
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if (*rthp &&
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -797,6 +847,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					ip_rt_gc_elasticity,
+					(avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }
@@ -846,6 +905,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -984,6 +1063,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -997,7 +1077,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1043,6 +1129,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1056,6 +1153,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1093,7 +1200,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1104,7 +1215,10 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	if (rthi)
+		rthi->u.dst.rt_next = rt;
+	else
+		rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1207,6 +1321,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2120,6 +2237,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2144,6 +2265,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2523,6 +2645,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2547,6 +2672,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0689fd..1e33009 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -798,6 +798,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -830,8 +838,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-17 20:44                                                           ` Neil Horman
@ 2008-10-18  0:54                                                             ` Neil Horman
  2008-10-18  4:36                                                               ` Eric Dumazet
  0 siblings, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-18  0:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger

Sorry for the additional noise, but Eric just pointed out that I'd missed an
email from him and consequently a few comments.  I've made the appropriate
updates in this patch, which is otherwise unchanged.  The only comment I'd
skipped was the request for an additional stat in /proc/net/stat/rt_cache for a
per net rebuild count.  I figure thats a good break point to submit an
additional follow on patch for.

This is a patch to provide on demand route cache rebuilding.  Currently, our
route cache is rebulid periodically regardless of need.  This introduced
unneeded periodic latency.  This patch offers a better approach.  Using code
provided by Eric Dumazet, we compute the standard deviation of the average hash
bucket chain length while running rt_check_expire.  Should any given chain
length grow to larger that average plus 4 standard deviations, we trigger an
emergency hash table rebuild for that net namespace.  This allows for the common
case in which chains are well behaved and do not grow unevenly to not incur any
latency at all, while those systems (which may be being maliciously attacked),
only rebuild when the attack is detected.  This patch take 2 other factors into
account:
1) chains with multiple entries that differ by attributes that do not affect the
hash value are only counted once, so as not to unduly bias system to rebuilding
if features like QOS are heavily used
2) if rebuilding crosses a certain threshold (which is adjustable via the added
sysctl in this patch), route caching is disabled entirely for that net
namespace, since constant rebuilding is less efficient that no caching at all

Tested successfully by me.

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 Documentation/networking/ip-sysctl.txt |    4 +
 include/net/netns/ipv4.h               |    2 
 net/ipv4/route.c                       |  130 ++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c             |   12 +++
 4 files changed, 146 insertions(+), 2 deletions(-)



diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d849326..78a3b43 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -27,6 +27,10 @@ min_adv_mss - INTEGER
 	The advertised MSS depends on the first hop route MTU, but will
 	never be lower than this setting.
 
+rt_cache_rebuild_count - INTEGER
+	The per net-namespace count of the number of times the route cache
+	has had to be rebuilt for that namespace
+
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index a6ed838..4fef762 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -46,6 +46,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 6ee5354..97dd9ec 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 20;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -669,6 +672,19 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline bool rt_caching(struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -748,11 +764,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length = 0, samples = 0;
+	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -761,6 +790,7 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
+	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -770,6 +800,8 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
@@ -784,11 +816,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if ((*rthp == NULL) ||
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if ((*rthp == NULL) || 
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -797,6 +847,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					ip_rt_gc_elasticity,
+					(avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }
@@ -846,6 +905,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -984,6 +1063,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -997,7 +1077,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1043,6 +1129,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1056,6 +1153,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1093,7 +1200,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1104,7 +1215,10 @@ restart:
 		printk("\n");
 	}
 #endif
-	rt_hash_table[hash].chain = rt;
+	if (rthi)
+		rthi->u.dst.rt_next = rt;
+	else
+		rt_hash_table[hash].chain = rt;
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1207,6 +1321,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2120,6 +2237,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2144,6 +2265,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2523,6 +2645,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2547,6 +2672,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index e0689fd..1e33009 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -798,6 +798,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -830,8 +838,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-18  0:54                                                             ` Neil Horman
@ 2008-10-18  4:36                                                               ` Eric Dumazet
  2008-10-18 13:30                                                                 ` Neil Horman
  2008-10-20  0:07                                                                 ` Neil Horman
  0 siblings, 2 replies; 64+ messages in thread
From: Eric Dumazet @ 2008-10-18  4:36 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger

Neil Horman a écrit :
> Sorry for the additional noise, but Eric just pointed out that I'd missed an
> email from him and consequently a few comments.  I've made the appropriate
> updates in this patch, which is otherwise unchanged.  The only comment I'd
> skipped was the request for an additional stat in /proc/net/stat/rt_cache for a
> per net rebuild count.  I figure thats a good break point to submit an
> additional follow on patch for.

OK, yet an admin cannot know if its route cache is disabled or not...
Right its can be addressed in a follow path.

> 
> This is a patch to provide on demand route cache rebuilding.  Currently, our
> route cache is rebulid periodically regardless of need.  This introduced
> unneeded periodic latency.  This patch offers a better approach.  Using code
> provided by Eric Dumazet, we compute the standard deviation of the average hash
> bucket chain length while running rt_check_expire.  Should any given chain
> length grow to larger that average plus 4 standard deviations, we trigger an
> emergency hash table rebuild for that net namespace.  This allows for the common
> case in which chains are well behaved and do not grow unevenly to not incur any
> latency at all, while those systems (which may be being maliciously attacked),
> only rebuild when the attack is detected.  This patch take 2 other factors into
> account:
> 1) chains with multiple entries that differ by attributes that do not affect the
> hash value are only counted once, so as not to unduly bias system to rebuilding
> if features like QOS are heavily used
> 2) if rebuilding crosses a certain threshold (which is adjustable via the added
> sysctl in this patch), route caching is disabled entirely for that net
> namespace, since constant rebuilding is less efficient that no caching at all
> 
> Tested successfully by me.
> 
> Regards
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

OK, its almost done Neil :)

Please rebase your patch against latest net-2.6 tree that includes my previous patch.

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=00269b54edbf25f3bb0dccb558ae23a6fc77ed86

Please read *all* this mail to catch all final comments, in order to avoid upset netdev 
readers with small details. You also can submit it privatly to me, if you wish.

Please add my Signed-off-by after yours

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

> +static inline bool rt_caching(struct net *net)

As Stephen pointed out, you *should* add a const qualifier here to "struct net *net",

-> static inline bool rt_caching(const struct net *net)

> +{
> +	return net->ipv4.current_rt_cache_rebuild_count <=
> +		net->ipv4.sysctl_rt_cache_rebuild_count;
> +}
> +
> +static inline int compare_hash_inputs(struct flowi *fl1, struct flowi *fl2)

const qualifiers too, please.

> +{
> +	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> +		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
> +		(fl1->iif ^ fl2->iif)) == 0);
> +}
> +
>  static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)

previous code can be cleaned up in a followup patch.

New code is not forced to follow old and not "clean by modern standards" code :)

>  {
>  	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
> @@ -748,11 +764,24 @@ static void rt_do_flush(int process_context)
>  	}
>  }
>  

... snip to ease your job ...

>  #endif
> -	rt_hash_table[hash].chain = rt;
> +	if (rthi)
> +		rthi->u.dst.rt_next = rt;
> +	else
> +		rt_hash_table[hash].chain = rt;
>  	spin_unlock_bh(rt_hash_lock_addr(hash));

Based on latest tree, this should be something like :

> -	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
> +	if (rthi)
> +		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
> +	else
> +		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
>  	spin_unlock_bh(rt_hash_lock_addr(hash));



Thanks




^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-18  4:36                                                               ` Eric Dumazet
@ 2008-10-18 13:30                                                                 ` Neil Horman
  2008-10-20  0:07                                                                 ` Neil Horman
  1 sibling, 0 replies; 64+ messages in thread
From: Neil Horman @ 2008-10-18 13:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger

On Sat, Oct 18, 2008 at 06:36:10AM +0200, Eric Dumazet wrote:
> Neil Horman a écrit :
>> Sorry for the additional noise, but Eric just pointed out that I'd missed an
>> email from him and consequently a few comments.  I've made the appropriate
>> updates in this patch, which is otherwise unchanged.  The only comment I'd
>> skipped was the request for an additional stat in /proc/net/stat/rt_cache for a
>> per net rebuild count.  I figure thats a good break point to submit an
>> additional follow on patch for.
>
> OK, yet an admin cannot know if its route cache is disabled or not...
> Right its can be addressed in a follow path.
>
>>
>> This is a patch to provide on demand route cache rebuilding.  Currently, our
>> route cache is rebulid periodically regardless of need.  This introduced
>> unneeded periodic latency.  This patch offers a better approach.  Using code
>> provided by Eric Dumazet, we compute the standard deviation of the average hash
>> bucket chain length while running rt_check_expire.  Should any given chain
>> length grow to larger that average plus 4 standard deviations, we trigger an
>> emergency hash table rebuild for that net namespace.  This allows for the common
>> case in which chains are well behaved and do not grow unevenly to not incur any
>> latency at all, while those systems (which may be being maliciously attacked),
>> only rebuild when the attack is detected.  This patch take 2 other factors into
>> account:
>> 1) chains with multiple entries that differ by attributes that do not affect the
>> hash value are only counted once, so as not to unduly bias system to rebuilding
>> if features like QOS are heavily used
>> 2) if rebuilding crosses a certain threshold (which is adjustable via the added
>> sysctl in this patch), route caching is disabled entirely for that net
>> namespace, since constant rebuilding is less efficient that no caching at all
>>
>> Tested successfully by me.
>>
>> Regards
>> Neil
>>
>> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>
> OK, its almost done Neil :)
>


Copy that :).  I'll take the weekend to go over this and repost it with _all_
your suggestions monday afternoon-ish.

Thanks for all your input :)
Neil



-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-18  4:36                                                               ` Eric Dumazet
  2008-10-18 13:30                                                                 ` Neil Horman
@ 2008-10-20  0:07                                                                 ` Neil Horman
  2008-10-20  8:12                                                                   ` Eric Dumazet
  1 sibling, 1 reply; 64+ messages in thread
From: Neil Horman @ 2008-10-20  0:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger, nhorman

Ok, one more revision. It takes Erics final suggestions into account.  Eric, as
per your request, I've added your signed off line to it.


This is a patch to provide on demand route cache rebuilding.  Currently, our
route cache is rebulid periodically regardless of need.  This introduced
unneeded periodic latency.  This patch offers a better approach.  Using code
provided by Eric Dumazet, we compute the standard deviation of the average hash
bucket chain length while running rt_check_expire.  Should any given chain
length grow to larger that average plus 4 standard deviations, we trigger an
emergency hash table rebuild for that net namespace.  This allows for the common
case in which chains are well behaved and do not grow unevenly to not incur any
latency at all, while those systems (which may be being maliciously attacked),
only rebuild when the attack is detected.  This patch take 2 other factors into
account:
1) chains with multiple entries that differ by attributes that do not affect the
hash value are only counted once, so as not to unduly bias system to rebuilding
if features like QOS are heavily used
2) if rebuilding crosses a certain threshold (which is adjustable via the added
sysctl in this patch), route caching is disabled entirely for that net
namespace, since constant rebuilding is less efficient that no caching at all

Tested successfully by me.

Regards
Neil

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


 Documentation/networking/ip-sysctl.txt |    6 +
 include/net/netns/ipv4.h               |    2 
 net/ipv4/route.c                       |  132 ++++++++++++++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c             |   12 +++
 4 files changed, 150 insertions(+), 2 deletions(-)


diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d849326..c771278 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -27,6 +27,12 @@ min_adv_mss - INTEGER
 	The advertised MSS depends on the first hop route MTU, but will
 	never be lower than this setting.
 
+rt_cache_rebuild_count - INTEGER
+	The per net-namespace route cache emergency rebuild threshold.
+	Any net-namespace having its route cache rebuilt due to
+	a hash bucket chain being too long more than this many times
+	will have its route caching disabled
+
 IP Fragmentation:
 
 ipfrag_high_thresh - INTEGER
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index ece1c92..977f482 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,8 @@ struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_rt_cache_rebuild_count;
+	int current_rt_cache_rebuild_count;
 
 	struct timer_list rt_secret_timer;
 	atomic_t rt_genid;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8d23cc7..c132132 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -129,6 +129,7 @@ static int ip_rt_mtu_expires __read_mostly	= 10 * 60 * HZ;
 static int ip_rt_min_pmtu __read_mostly		= 512 + 20 + 20;
 static int ip_rt_min_advmss __read_mostly	= 256;
 static int ip_rt_secret_interval __read_mostly	= 10 * 60 * HZ;
+static int rt_chain_length_max __read_mostly	= 20;
 
 static void rt_worker_func(struct work_struct *work);
 static DECLARE_DELAYED_WORK(expires_work, rt_worker_func);
@@ -145,6 +146,7 @@ static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
+static void rt_emergency_hash_rebuild(struct net *net);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -201,6 +203,7 @@ const __u8 ip_tos2prio[16] = {
 struct rt_hash_bucket {
 	struct rtable	*chain;
 };
+
 #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || \
 	defined(CONFIG_PROVE_LOCKING)
 /*
@@ -674,6 +677,20 @@ static inline u32 rt_score(struct rtable *rt)
 	return score;
 }
 
+static inline bool rt_caching(const struct net *net)
+{
+	return net->ipv4.current_rt_cache_rebuild_count <=
+		net->ipv4.sysctl_rt_cache_rebuild_count;
+}
+
+static inline bool compare_hash_inputs(const struct flowi *fl1,
+					const struct flowi *fl2)
+{
+	return (__force u32)(((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
+		(fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) |
+		(fl1->iif ^ fl2->iif)) == 0);
+}
+
 static inline int compare_keys(struct flowi *fl1, struct flowi *fl2)
 {
 	return ((__force u32)((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) |
@@ -753,11 +770,24 @@ static void rt_do_flush(int process_context)
 	}
 }
 
+/*
+ * While freeing expired entries, we compute average chain length
+ * and standard deviation, using fixed-point arithmetic.
+ * This to have an estimation of rt_chain_length_max
+ *  rt_chain_length_max = max(elasticity, AVG + 4*SD)
+ * We use 3 bits for frational part, and 29 (or 61) for magnitude.
+ */
+
+#define FRACT_BITS 3
+#define ONE (1UL << FRACT_BITS)
+
 static void rt_check_expire(void)
 {
 	static unsigned int rover;
 	unsigned int i = rover, goal;
 	struct rtable *rth, **rthp;
+	unsigned long length = 0, samples = 0;
+	unsigned long sum = 0, sum2 = 0;
 	u64 mult;
 
 	mult = ((u64)ip_rt_gc_interval) << rt_hash_log;
@@ -766,6 +796,7 @@ static void rt_check_expire(void)
 	goal = (unsigned int)mult;
 	if (goal > rt_hash_mask)
 		goal = rt_hash_mask + 1;
+	length = 0;
 	for (; goal > 0; goal--) {
 		unsigned long tmo = ip_rt_gc_timeout;
 
@@ -775,6 +806,8 @@ static void rt_check_expire(void)
 		if (need_resched())
 			cond_resched();
 
+		samples++;
+
 		if (*rthp == NULL)
 			continue;
 		spin_lock_bh(rt_hash_lock_addr(i));
@@ -789,11 +822,29 @@ static void rt_check_expire(void)
 				if (time_before_eq(jiffies, rth->u.dst.expires)) {
 					tmo >>= 1;
 					rthp = &rth->u.dst.rt_next;
+					/*
+					 * Only bump our length if the hash
+					 * inputs on entries n and n+1 are not
+					 * the same, we only count entries on
+					 * a chain with equal hash inputs once
+					 * so that entries for different QOS
+					 * levels, and other non-hash input
+					 * attributes don't unfairly skew
+					 * the length computation
+					 */
+					if ((*rthp == NULL) ||
+					    !compare_hash_inputs(&(*rthp)->fl,
+								 &rth->fl))
+						length += ONE;
 					continue;
 				}
 			} else if (!rt_may_expire(rth, tmo, ip_rt_gc_timeout)) {
 				tmo >>= 1;
 				rthp = &rth->u.dst.rt_next;
+				if ((*rthp == NULL) || 
+				    !compare_hash_inputs(&(*rthp)->fl,
+							 &rth->fl))
+					length += ONE;
 				continue;
 			}
 
@@ -802,6 +853,15 @@ static void rt_check_expire(void)
 			rt_free(rth);
 		}
 		spin_unlock_bh(rt_hash_lock_addr(i));
+		sum += length;
+		sum2 += length*length;
+	}
+	if (samples) {
+		unsigned long avg = sum / samples;
+		unsigned long sd = int_sqrt(sum2 / samples - avg*avg);
+		rt_chain_length_max = max_t(unsigned long,
+					ip_rt_gc_elasticity,
+					(avg + 4*sd) >> FRACT_BITS);
 	}
 	rover = i;
 }
@@ -851,6 +911,26 @@ static void rt_secret_rebuild(unsigned long __net)
 	mod_timer(&net->ipv4.rt_secret_timer, jiffies + ip_rt_secret_interval);
 }
 
+static void rt_secret_rebuild_oneshot(struct net *net)
+{
+	del_timer_sync(&net->ipv4.rt_secret_timer);
+	rt_cache_invalidate(net);
+	if (ip_rt_secret_interval) {
+		net->ipv4.rt_secret_timer.expires += ip_rt_secret_interval;
+		add_timer(&net->ipv4.rt_secret_timer);
+	}
+}
+
+static void rt_emergency_hash_rebuild(struct net *net)
+{
+	if (net_ratelimit()) {
+		printk(KERN_WARNING "Route hash chain too long!\n");
+		printk(KERN_WARNING "Adjust your secret_interval!\n");
+	}
+
+	rt_secret_rebuild_oneshot(net);
+}
+
 /*
    Short description of GC goals.
 
@@ -989,6 +1069,7 @@ out:	return 0;
 static int rt_intern_hash(unsigned hash, struct rtable *rt, struct rtable **rp)
 {
 	struct rtable	*rth, **rthp;
+	struct rtable	*rthi;
 	unsigned long	now;
 	struct rtable *cand, **candp;
 	u32 		min_score;
@@ -1002,7 +1083,13 @@ restart:
 	candp = NULL;
 	now = jiffies;
 
+	if (!rt_caching(dev_net(rt->u.dst.dev))) {
+		rt_drop(rt);
+		return 0;
+	}
+
 	rthp = &rt_hash_table[hash].chain;
+	rthi = NULL;
 
 	spin_lock_bh(rt_hash_lock_addr(hash));
 	while ((rth = *rthp) != NULL) {
@@ -1048,6 +1135,17 @@ restart:
 		chain_length++;
 
 		rthp = &rth->u.dst.rt_next;
+
+		/*
+		 * check to see if the next entry in the chain
+		 * contains the same hash input values as rt.  If it does
+		 * This is where we will insert into the list, instead of
+		 * at the head.  This groups entries that differ by aspects not
+		 * relvant to the hash function together, which we use to adjust
+		 * our chain length
+		 */
+		if (*rthp && compare_hash_inputs(&(*rthp)->fl, &rt->fl))
+			rthi = rth;
 	}
 
 	if (cand) {
@@ -1061,6 +1159,16 @@ restart:
 			*candp = cand->u.dst.rt_next;
 			rt_free(cand);
 		}
+	} else {
+		if (chain_length > rt_chain_length_max) {
+			struct net *net = dev_net(rt->u.dst.dev);
+			int num = ++net->ipv4.current_rt_cache_rebuild_count;
+			if (!rt_caching(dev_net(rt->u.dst.dev))) {
+				printk(KERN_WARNING "%s: %d rebuilds is over limit, route caching disabled\n",
+					rt->u.dst.dev->name, num);
+			}
+			rt_emergency_hash_rebuild(dev_net(rt->u.dst.dev));
+		}
 	}
 
 	/* Try to bind route to arp only if it is output
@@ -1098,7 +1206,11 @@ restart:
 		}
 	}
 
-	rt->u.dst.rt_next = rt_hash_table[hash].chain;
+	if (rthi)
+		rt->u.dst.rt_next = rthi->u.dst.rt_next;
+	else
+		rt->u.dst.rt_next = rt_hash_table[hash].chain;
+
 #if RT_CACHE_DEBUG >= 2
 	if (rt->u.dst.rt_next) {
 		struct rtable *trt;
@@ -1114,7 +1226,11 @@ restart:
 	 * previous writes to rt are comitted to memory
 	 * before making rt visible to other CPUS.
 	 */
-	rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+	if (rthi)
+		rcu_assign_pointer(rthi->u.dst.rt_next, rt);
+	else
+		rcu_assign_pointer(rt_hash_table[hash].chain, rt);
+
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 	*rp = rt;
 	return 0;
@@ -1217,6 +1333,9 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 	    || ipv4_is_zeronet(new_gw))
 		goto reject_redirect;
 
+	if (!rt_caching(net))
+		goto reject_redirect;
+
 	if (!IN_DEV_SHARED_MEDIA(in_dev)) {
 		if (!inet_addr_onlink(in_dev, new_gw, old_gw))
 			goto reject_redirect;
@@ -2130,6 +2249,10 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	struct net *net;
 
 	net = dev_net(dev);
+
+	if (!rt_caching(net))
+		goto skip_cache;
+
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif, rt_genid(net));
 
@@ -2154,6 +2277,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	}
 	rcu_read_unlock();
 
+skip_cache:
 	/* Multicast recognition logic is moved from route cache to here.
 	   The problem was that too many Ethernet cards have broken/missing
 	   hardware multicast filters :-( As result the host on multicasting
@@ -2539,6 +2663,9 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	unsigned hash;
 	struct rtable *rth;
 
+	if (!rt_caching(net))
+		goto slow_output;
+
 	hash = rt_hash(flp->fl4_dst, flp->fl4_src, flp->oif, rt_genid(net));
 
 	rcu_read_lock_bh();
@@ -2563,6 +2690,7 @@ int __ip_route_output_key(struct net *net, struct rtable **rp,
 	}
 	rcu_read_unlock_bh();
 
+slow_output:
 	return ip_route_output_slow(net, rp, flp);
 }
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 276d047..0d0fd64 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -795,6 +795,14 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "rt_cache_rebuild_count",
+		.data		= &init_net.ipv4.sysctl_rt_cache_rebuild_count,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec
+	},
 	{ }
 };
 
@@ -827,8 +835,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			&net->ipv4.sysctl_icmp_ratelimit;
 		table[5].data =
 			&net->ipv4.sysctl_icmp_ratemask;
+		table[6].data =
+			&net->ipv4.sysctl_rt_cache_rebuild_count;
 	}
 
+	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
-- 
/****************************************************
 * Neil Horman <nhorman@tuxdriver.com>
 * Software Engineer, Red Hat
 ****************************************************/



^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-20  0:07                                                                 ` Neil Horman
@ 2008-10-20  8:12                                                                   ` Eric Dumazet
  2008-10-27 19:28                                                                     ` David Miller
  0 siblings, 1 reply; 64+ messages in thread
From: Eric Dumazet @ 2008-10-20  8:12 UTC (permalink / raw)
  To: Neil Horman
  Cc: David Miller, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, Stephen Hemminger

Neil Horman a écrit :
> Ok, one more revision. It takes Erics final suggestions into account.  Eric, as
> per your request, I've added your signed off line to it.
> 
> 
> This is a patch to provide on demand route cache rebuilding.  Currently, our
> route cache is rebulid periodically regardless of need.  This introduced
> unneeded periodic latency.  This patch offers a better approach.  Using code
> provided by Eric Dumazet, we compute the standard deviation of the average hash
> bucket chain length while running rt_check_expire.  Should any given chain
> length grow to larger that average plus 4 standard deviations, we trigger an
> emergency hash table rebuild for that net namespace.  This allows for the common
> case in which chains are well behaved and do not grow unevenly to not incur any
> latency at all, while those systems (which may be being maliciously attacked),
> only rebuild when the attack is detected.  This patch take 2 other factors into
> account:
> 1) chains with multiple entries that differ by attributes that do not affect the
> hash value are only counted once, so as not to unduly bias system to rebuilding
> if features like QOS are heavily used
> 2) if rebuilding crosses a certain threshold (which is adjustable via the added
> sysctl in this patch), route caching is disabled entirely for that net
> namespace, since constant rebuilding is less efficient that no caching at all
> 
> Tested successfully by me.
> 
> Regards
> Neil
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 

Thanks Neil for your patience, this patch is fine.





^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded
  2008-10-20  8:12                                                                   ` Eric Dumazet
@ 2008-10-27 19:28                                                                     ` David Miller
  0 siblings, 0 replies; 64+ messages in thread
From: David Miller @ 2008-10-27 19:28 UTC (permalink / raw)
  To: dada1
  Cc: nhorman, billfink, netdev, kuznet, pekkas, jmorris, yoshfuji,
	kaber, johnpol, shemminger

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 20 Oct 2008 10:12:51 +0200

> Neil Horman a écrit :
> > Ok, one more revision. It takes Erics final suggestions into account.  Eric, as
> > per your request, I've added your signed off line to it.
> > This is a patch to provide on demand route cache rebuilding.
 ...
> Thanks Neil for your patience, this patch is fine.

Applied, thanks everyone.

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2008-10-27 19:29 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-29 19:12 [PATCH] net: implement emergency route cache rebulds when gc_elasticity is exceeded Neil Horman
2008-09-29 20:22 ` Eric Dumazet
2008-09-29 20:27   ` Neil Horman
2008-09-29 21:00     ` Eric Dumazet
2008-09-29 22:38       ` Neil Horman
2008-09-30  6:02         ` Eric Dumazet
2008-09-30 11:23           ` Neil Horman
2008-09-30 14:10           ` David Miller
2008-09-30 17:16             ` Eric Dumazet
2008-09-30 18:42               ` Neil Horman
2008-10-02  7:16                 ` Evgeniy Polyakov
2008-10-02 13:14                   ` Neil Horman
2008-10-01 18:08               ` Neil Horman
2008-10-02  5:01                 ` Bill Fink
2008-10-02  6:56                   ` Eric Dumazet
2008-10-02  8:15                     ` Eric Dumazet
2008-10-02 14:20                       ` Eric Dumazet
2008-10-03  0:31                       ` Neil Horman
2008-10-03 20:36                         ` Neil Horman
2008-10-06 10:49                           ` Eric Dumazet
2008-10-06 13:14                             ` Neil Horman
2008-10-06 20:54                             ` Neil Horman
2008-10-06 21:21                               ` Eric Dumazet
2008-10-06 22:52                                 ` Neil Horman
2008-10-07  5:13                                   ` Eric Dumazet
2008-10-07 10:54                                     ` Neil Horman
2008-10-13 18:26                                     ` Neil Horman
2008-10-16  6:55                                       ` David Miller
2008-10-16  9:19                                         ` Eric Dumazet
2008-10-16 21:18                                           ` David Miller
2008-10-16 11:41                                         ` Neil Horman
2008-10-16 12:25                                           ` Eric Dumazet
2008-10-16 16:36                                             ` Neil Horman
2008-10-16 23:35                                               ` Neil Horman
2008-10-17  4:53                                                 ` Eric Dumazet
2008-10-17  5:23                                                   ` David Miller
2008-10-17  5:03                                                 ` Stephen Hemminger
2008-10-17  5:06                                                 ` Stephen Hemminger
2008-10-17 10:39                                                   ` Neil Horman
     [not found]                                                     ` <48F8806A.6090306@cosmosbay.com>
     [not found]                                                       ` <20081017152328.GB23591@hmsreliant.think-freely.org>
     [not found]                                                         ` <48F8AFBE.5080503@cosmosbay.com>
2008-10-17 20:44                                                           ` Neil Horman
2008-10-18  0:54                                                             ` Neil Horman
2008-10-18  4:36                                                               ` Eric Dumazet
2008-10-18 13:30                                                                 ` Neil Horman
2008-10-20  0:07                                                                 ` Neil Horman
2008-10-20  8:12                                                                   ` Eric Dumazet
2008-10-27 19:28                                                                     ` David Miller
2008-10-02  7:13               ` Evgeniy Polyakov
2008-09-30 14:08   ` David Miller
2008-09-30 14:08 ` David Miller
2008-09-30 17:47   ` Eric Dumazet
2008-10-05  3:26   ` Herbert Xu
2008-10-05  4:45     ` Andrew Dickinson
2008-10-05 17:34       ` David Miller
2008-10-05 18:06         ` Andrew Dickinson
2008-10-06  4:21         ` Herbert Xu
2008-10-06 10:50           ` Neil Horman
2008-10-06 11:02             ` Herbert Xu
2008-10-06 12:43               ` Neil Horman
2008-09-30 14:17 ` Denis V. Lunev
2008-09-30 14:35   ` Neil Horman
2008-09-30 14:49     ` Denis V. Lunev
2008-10-05  3:17 ` Herbert Xu
2008-10-05  3:20   ` Herbert Xu
2008-10-06  0:52     ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).