Problem wit route cache

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Problem wit route cache
@ 2010-02-08 13:16 Paweł Staszewski
  2010-02-08 13:28 ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 13:16 UTC (permalink / raw)
  To: Linux Network Development list

Hello

 From some time i have problem with route cache in linux
this is an info that i have in dmesg:

Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
Route hash chain too long!
Adjust your secret_interval!
vlan0811: 9 rebuilds is over limit, route caching disabled
Route hash chain too long!
Adjust your secret_interval!

The problem is that change of  net.ipv4.route.secret_interval is change 
nothing -- no matter that i set secret_interval from dfault 3600 to 2 or 
10000 i have always the same info about route cahce is disabled.
Also i change this parameter net.ipv4.rt_cache_rebuild_count from 
default 4 to 9 and the same info - i try also change this to 12 but also 
this change nothing.

The machine that have this info is:
2x Intel(R) Xeon(R) CPU           X5450  @ 3.00GHz
12GB of RAM

Network controllers are:
04:00.0 Ethernet controller: Intel Corporation 82573E Gigabit Ethernet 
Controller (Copper) (rev 03)
05:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet 
Controller

And the weird thing is that i need to set affinity for my nics to "ff" 
because when i bind network card to one cpu this cpu have always 100%

Router traffic is about 700Mbit/s RX + 700Mbit/s TX on eth0 (with 2 
tagged vlans) and the same traffic on eth1 that is untagged
This traffic is forwarded by this router.

Simple topology:

Clients<- ibgp ->  [ vlan0811@eth0 + vlan0508@eth0 - BGP process - eth1 ]<- ebgp ->  Internet Provisers

Some info about fibtrie:
  cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
         Aver depth:     2.58
         Max depth:      6
         Leaves:         297506
         Prefixes:       312472
         Internal nodes: 69968
           1: 35673  2: 14840  3: 10980  4: 4729  5: 2315  6: 956  7: 364  8: 109  9: 1  16: 1
         Pointers: 570018
Null ptrs: 202545
Total size: 36990  kB

Counters:
---------
gets = 2797832581
backtracks = 149015808
semantic match passed = 2789993308
semantic match miss = 766703
null node hit= 860359377
skipped node resize = 0

Local:
         Aver depth:     3.66
         Max depth:      4
         Leaves:         15
         Prefixes:       16
         Internal nodes: 11
           1: 8  2: 3
         Pointers: 28
Null ptrs: 3
Total size: 3  kB

Counters:
---------
gets = 2792656726
backtracks = 2185449412
semantic match passed = 5895311
semantic match miss = 0
null node hit= 818902
skipped node resize = 0


And interrupts
cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
   0:         80         21         32         27         38         39         38         34   IO-APIC-edge      timer
   1:          0          1          0          0          0          0          1          0   IO-APIC-edge      i8042
   9:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   acpi
  14:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide0
  15:          0          0          0          0          0          0          0          0   IO-APIC-edge      ide1
  20:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
  21:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb6
  22:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb1, uhci_hcd:usb5
  23:          0          0          0          0          0          0          0          0   IO-APIC-fasteoi   ehci_hcd:usb2, uhci_hcd:usb4
  29:      73903        136        135        135        134        138        136        136   PCI-MSI-edge      ahci
  30:  101136854  101509497  101627297  100701508  101684077  101870217  101154425  100604465   PCI-MSI-edge      eth0
  31:   91772747   92037311   92317341   92231065   91248484   91062342   91778136   92328099   PCI-MSI-edge      eth1
  32:          0          0          0          0          0          1          1          0   PCI-MSI-edge
  33:          0          1          0          0          0          0          0          1   PCI-MSI-edge
  34:          0          0          1          0          0          1          0          0   PCI-MSI-edge
  35:          1          0          0          1          0          0          0          0   PCI-MSI-edge
  36:          0          0          0          0          1          0          0          1   PCI-MSI-edge
  37:          1          0          0          0          0          0          1          0   PCI-MSI-edge
  38:          0          0          0          1          1          0          0          0   PCI-MSI-edge
  39:          0          1          1          0          0          0          0          0   PCI-MSI-edge
NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
LOC:  407733660  395964025  431438426  402307801  434522968  420170984  400450633  390318324   Local timer interrupts
SPU:          0          0          0          0          0          0          0          0   Spurious interrupts
PMI:          0          0          0          0          0          0          0          0   Performance monitoring interrupts
PND:          0          0          0          0          0          0          0          0   Performance pending work
RES:      14378      15781      19276       5691      14761      13579      16846      15629   Rescheduling interrupts
CAL:        378        383         92         86        364        354         92         89   Function call interrupts
TLB:        551        577        433        272        602        543        329        683   TLB shootdowns
ERR:          0
MIS:          0




Regards
Pawel





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 13:16 Problem wit route cache Paweł Staszewski
@ 2010-02-08 13:28 ` Eric Dumazet
  2010-02-08 13:33   ` Paweł Staszewski
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-08 13:28 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 08 février 2010 à 14:16 +0100, Paweł Staszewski a écrit :
> Hello
> 
>  From some time i have problem with route cache in linux
> this is an info that i have in dmesg:
> 
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> Route hash chain too long!
> Adjust your secret_interval!
> vlan0811: 9 rebuilds is over limit, route caching disabled
> Route hash chain too long!
> Adjust your secret_interval!
> 
> The problem is that change of  net.ipv4.route.secret_interval is change 
> nothing -- no matter that i set secret_interval from dfault 3600 to 2 or 
> 10000 i have always the same info about route cahce is disabled.
> Also i change this parameter net.ipv4.rt_cache_rebuild_count from 
> default 4 to 9 and the same info - i try also change this to 12 but also 
> this change nothing.
> 
> The machine that have this info is:
> 2x Intel(R) Xeon(R) CPU           X5450  @ 3.00GHz
> 12GB of RAM
> 

Are you running a 64bit kernel ?
What is your kernel version ?

Please send :

# grep . /proc/sys/net/ipv4/route/*
#rtstat -c10 -i1




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 13:28 ` Eric Dumazet
@ 2010-02-08 13:33   ` Paweł Staszewski
  2010-02-08 13:51     ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 13:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

[-- Attachment #1: Type: text/plain, Size: 2784 bytes --]

W dniu 2010-02-08 14:28, Eric Dumazet pisze:
> Le lundi 08 février 2010 à 14:16 +0100, Paweł Staszewski a écrit :
>    
>> Hello
>>
>>   From some time i have problem with route cache in linux
>> this is an info that i have in dmesg:
>>
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> Route hash chain too long!
>> Adjust your secret_interval!
>> vlan0811: 9 rebuilds is over limit, route caching disabled
>> Route hash chain too long!
>> Adjust your secret_interval!
>>
>> The problem is that change of  net.ipv4.route.secret_interval is change
>> nothing -- no matter that i set secret_interval from dfault 3600 to 2 or
>> 10000 i have always the same info about route cahce is disabled.
>> Also i change this parameter net.ipv4.rt_cache_rebuild_count from
>> default 4 to 9 and the same info - i try also change this to 12 but also
>> this change nothing.
>>
>> The machine that have this info is:
>> 2x Intel(R) Xeon(R) CPU           X5450  @ 3.00GHz
>> 12GB of RAM
>>
>>      
> Are you running a 64bit kernel ?
> What is your kernel version ?
>
> Please send :
>
> # grep . /proc/sys/net/ipv4/route/*
> #rtstat -c10 -i1
>
>    
Yes this is x86_64 kernel
i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all 
kernels the same thing happens.
grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:1250
/proc/sys/net/ipv4/route/error_cost:250
grep: /proc/sys/net/ipv4/route/flush: Permission denied
/proc/sys/net/ipv4/route/gc_elasticity:2
/proc/sys/net/ipv4/route/gc_interval:2
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:65535
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:524288
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:5
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:5120
/proc/sys/net/ipv4/route/secret_interval:2

This happens not all the time.
I have this info only when there are "internet rush hours" - thn there 
is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic


>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>    


[-- Attachment #2: rtstat.txt --]
[-- Type: text/plain, Size: 2039 bytes --]

rtstat -c10 -i1
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
 entries|  in_hit|in_slow_|in_slow_|in_no_ro|  in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
        |        |     tot|      mc|     ute|        |  an_dst|  an_src|        |    _tot|     _mc|        |      ed|    miss| verflow| _search|t_search|
   12082|3440217296|6456413873|       0|  623094|     294|       0|       3|  735116| 5701062|       0|261260739|261040365|       0|       0|654617961|  179044|
   10037|       0|  152142|       0|       7|       0|       0|       0|       0|     123|       0|       0|       0|       0|       0|       0|       0|
   12032|       0|  155770|       0|       4|       0|       0|       0|       0|     122|       0|       0|       0|       0|       0|       0|       0|
   10991|       0|  161040|       0|       7|       0|       0|       0|       0|     129|       0|       0|       0|       0|       0|       0|       0|
    9898|       0|  155503|       0|       9|       0|       0|       0|       0|     125|       0|       0|       0|       0|       0|       0|       0|
   12553|       0|  157455|       0|       6|       0|       0|       0|       0|     129|       0|       0|       0|       0|       0|       0|       0|
   10983|       0|  157742|       0|       9|       0|       0|       0|       0|     128|       0|       0|       0|       0|       0|       0|       0|
    9375|       0|  158226|       0|       8|       0|       0|       0|       0|     115|       0|       0|       0|       0|       0|       0|       0|
   11929|       0|  159342|       0|      10|       0|       0|       0|       0|     130|       0|       0|       0|       0|       0|       0|       0|
   11046|       0|  158015|       0|       8|       0|       0|       0|       0|     126|       0|       0|       0|       0|       0|       0|       0|

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 13:33   ` Paweł Staszewski
@ 2010-02-08 13:51     ` Eric Dumazet
  2010-02-08 13:59       ` Paweł Staszewski
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-08 13:51 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :

> >    
> Yes this is x86_64 kernel
> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all 
> kernels the same thing happens.
> grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:1250
> /proc/sys/net/ipv4/route/error_cost:250
> grep: /proc/sys/net/ipv4/route/flush: Permission denied
> /proc/sys/net/ipv4/route/gc_elasticity:2
> /proc/sys/net/ipv4/route/gc_interval:2
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> /proc/sys/net/ipv4/route/gc_thresh:65535
> /proc/sys/net/ipv4/route/gc_timeout:300
> /proc/sys/net/ipv4/route/max_size:524288
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:5
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:5120
> /proc/sys/net/ipv4/route/secret_interval:2
> 
> This happens not all the time.
> I have this info only when there are "internet rush hours" - thn there 
> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
> 

I dont understand your settings, they are very very small for your
setup. You want to flush cache every 2 seconds...

With 12GB of ram, you could have 

/proc/sys/net/ipv4/route/gc_thresh:524288
/proc/sys/net/ipv4/route/max_size:8388608
/proc/sys/net/ipv4/route/secret_interval:3600
/proc/sys/net/ipv4/route/gc_elasticity:4
/proc/sys/net/ipv4/route/gc_interval:1

That would allow about 2 million entries in your route cache, using 768
Mbytes of ram, and a good cache hit ratio.






^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 13:51     ` Eric Dumazet
@ 2010-02-08 13:59       ` Paweł Staszewski
  2010-02-08 14:06         ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 13:59 UTC (permalink / raw)
  To: Eric Dumazet, Linux Network Development list

W dniu 2010-02-08 14:51, Eric Dumazet pisze:
> Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :
>
>    
>>>
>>>        
>> Yes this is x86_64 kernel
>> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all
>> kernels the same thing happens.
>> grep . /proc/sys/net/ipv4/route/*
>> /proc/sys/net/ipv4/route/error_burst:1250
>> /proc/sys/net/ipv4/route/error_cost:250
>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>> /proc/sys/net/ipv4/route/gc_elasticity:2
>> /proc/sys/net/ipv4/route/gc_interval:2
>> /proc/sys/net/ipv4/route/gc_min_interval:0
>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>> /proc/sys/net/ipv4/route/gc_thresh:65535
>> /proc/sys/net/ipv4/route/gc_timeout:300
>> /proc/sys/net/ipv4/route/max_size:524288
>> /proc/sys/net/ipv4/route/min_adv_mss:256
>> /proc/sys/net/ipv4/route/min_pmtu:552
>> /proc/sys/net/ipv4/route/mtu_expires:600
>> /proc/sys/net/ipv4/route/redirect_load:5
>> /proc/sys/net/ipv4/route/redirect_number:9
>> /proc/sys/net/ipv4/route/redirect_silence:5120
>> /proc/sys/net/ipv4/route/secret_interval:2
>>
>> This happens not all the time.
>> I have this info only when there are "internet rush hours" - thn there
>> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
>>
>>      
> I dont understand your settings, they are very very small for your
> setup. You want to flush cache every 2 seconds...
>
> With 12GB of ram, you could have
>
> /proc/sys/net/ipv4/route/gc_thresh:524288
> /proc/sys/net/ipv4/route/max_size:8388608
> /proc/sys/net/ipv4/route/secret_interval:3600
> /proc/sys/net/ipv4/route/gc_elasticity:4
> /proc/sys/net/ipv4/route/gc_interval:1
>
> That would allow about 2 million entries in your route cache, using 768
> Mbytes of ram, and a good cache hit ratio.
>
>
>    
Yes as i write i change this settings after i see first info 
"secret_interval" - from 3600 to 2
To check if this resolve the problem.
Also my normal settings are:

/proc/sys/net/ipv4/route/gc_thresh:256000
/proc/sys/net/ipv4/route/max_size:1048576
/proc/sys/net/ipv4/route/secret_interval:3600
/proc/sys/net/ipv4/route/gc_interval:2
/proc/sys/net/ipv4/route/gc_elasticity:2

And with this setting i was have this info:
Route hash chain too long!
Adjust your secret_interval!



Now i put Your settings as You suggest ... and we will see but i dont know it will help.
Because i try many of different settings.








>
>
>
>
>    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 13:59       ` Paweł Staszewski
@ 2010-02-08 14:06         ` Eric Dumazet
  2010-02-08 14:16           ` Paweł Staszewski
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-08 14:06 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 08 février 2010 à 14:59 +0100, Paweł Staszewski a écrit :
> W dniu 2010-02-08 14:51, Eric Dumazet pisze:
> > Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :
> >
> >    
> >>>
> >>>        
> >> Yes this is x86_64 kernel
> >> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all
> >> kernels the same thing happens.
> >> grep . /proc/sys/net/ipv4/route/*
> >> /proc/sys/net/ipv4/route/error_burst:1250
> >> /proc/sys/net/ipv4/route/error_cost:250
> >> grep: /proc/sys/net/ipv4/route/flush: Permission denied
> >> /proc/sys/net/ipv4/route/gc_elasticity:2
> >> /proc/sys/net/ipv4/route/gc_interval:2
> >> /proc/sys/net/ipv4/route/gc_min_interval:0
> >> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> >> /proc/sys/net/ipv4/route/gc_thresh:65535
> >> /proc/sys/net/ipv4/route/gc_timeout:300
> >> /proc/sys/net/ipv4/route/max_size:524288
> >> /proc/sys/net/ipv4/route/min_adv_mss:256
> >> /proc/sys/net/ipv4/route/min_pmtu:552
> >> /proc/sys/net/ipv4/route/mtu_expires:600
> >> /proc/sys/net/ipv4/route/redirect_load:5
> >> /proc/sys/net/ipv4/route/redirect_number:9
> >> /proc/sys/net/ipv4/route/redirect_silence:5120
> >> /proc/sys/net/ipv4/route/secret_interval:2
> >>
> >> This happens not all the time.
> >> I have this info only when there are "internet rush hours" - thn there
> >> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
> >>
> >>      
> > I dont understand your settings, they are very very small for your
> > setup. You want to flush cache every 2 seconds...
> >
> > With 12GB of ram, you could have
> >
> > /proc/sys/net/ipv4/route/gc_thresh:524288
> > /proc/sys/net/ipv4/route/max_size:8388608
> > /proc/sys/net/ipv4/route/secret_interval:3600
> > /proc/sys/net/ipv4/route/gc_elasticity:4
> > /proc/sys/net/ipv4/route/gc_interval:1
> >
> > That would allow about 2 million entries in your route cache, using 768
> > Mbytes of ram, and a good cache hit ratio.
> >
> >
> >    
> Yes as i write i change this settings after i see first info 
> "secret_interval" - from 3600 to 2
> To check if this resolve the problem.
> Also my normal settings are:
> 
> /proc/sys/net/ipv4/route/gc_thresh:256000
> /proc/sys/net/ipv4/route/max_size:1048576
> /proc/sys/net/ipv4/route/secret_interval:3600
> /proc/sys/net/ipv4/route/gc_interval:2
> /proc/sys/net/ipv4/route/gc_elasticity:2
> 
> And with this setting i was have this info:
> Route hash chain too long!
> Adjust your secret_interval!
> 
> 
> 
> Now i put Your settings as You suggest ... and we will see but i dont know it will help.
> Because i try many of different settings.
> 

One important point is the size of hash table, you want something big
for your router.

# dmesg | grep 'IP route'
 ... IP route cache hash table entries: 524288 (order: 10, 4194304
bytes)

Then if it is correctly sized, dont change gc_thresh or max_size, as
defaults are good.

I would only change gc_interval to 1, to perform a smooth gc

And eventually gc_elasticity to 4, 5 or 6 if I had less ram than your
machine.





^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 14:06         ` Eric Dumazet
@ 2010-02-08 14:16           ` Paweł Staszewski
  2010-02-08 14:32             ` Eric Dumazet
  2010-02-08 14:32             ` Problem wit route cache Paweł Staszewski
  0 siblings, 2 replies; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 14:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2010-02-08 15:06, Eric Dumazet pisze:
> Le lundi 08 février 2010 à 14:59 +0100, Paweł Staszewski a écrit :
>    
>> W dniu 2010-02-08 14:51, Eric Dumazet pisze:
>>      
>>> Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :
>>>
>>>
>>>        
>>>>>
>>>>>            
>>>> Yes this is x86_64 kernel
>>>> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all
>>>> kernels the same thing happens.
>>>> grep . /proc/sys/net/ipv4/route/*
>>>> /proc/sys/net/ipv4/route/error_burst:1250
>>>> /proc/sys/net/ipv4/route/error_cost:250
>>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>>> /proc/sys/net/ipv4/route/gc_interval:2
>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>> /proc/sys/net/ipv4/route/gc_thresh:65535
>>>> /proc/sys/net/ipv4/route/gc_timeout:300
>>>> /proc/sys/net/ipv4/route/max_size:524288
>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>> /proc/sys/net/ipv4/route/redirect_load:5
>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>> /proc/sys/net/ipv4/route/redirect_silence:5120
>>>> /proc/sys/net/ipv4/route/secret_interval:2
>>>>
>>>> This happens not all the time.
>>>> I have this info only when there are "internet rush hours" - thn there
>>>> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
>>>>
>>>>
>>>>          
>>> I dont understand your settings, they are very very small for your
>>> setup. You want to flush cache every 2 seconds...
>>>
>>> With 12GB of ram, you could have
>>>
>>> /proc/sys/net/ipv4/route/gc_thresh:524288
>>> /proc/sys/net/ipv4/route/max_size:8388608
>>> /proc/sys/net/ipv4/route/secret_interval:3600
>>> /proc/sys/net/ipv4/route/gc_elasticity:4
>>> /proc/sys/net/ipv4/route/gc_interval:1
>>>
>>> That would allow about 2 million entries in your route cache, using 768
>>> Mbytes of ram, and a good cache hit ratio.
>>>
>>>
>>>
>>>        
>> Yes as i write i change this settings after i see first info
>> "secret_interval" - from 3600 to 2
>> To check if this resolve the problem.
>> Also my normal settings are:
>>
>> /proc/sys/net/ipv4/route/gc_thresh:256000
>> /proc/sys/net/ipv4/route/max_size:1048576
>> /proc/sys/net/ipv4/route/secret_interval:3600
>> /proc/sys/net/ipv4/route/gc_interval:2
>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>
>> And with this setting i was have this info:
>> Route hash chain too long!
>> Adjust your secret_interval!
>>
>>
>>
>> Now i put Your settings as You suggest ... and we will see but i dont know it will help.
>> Because i try many of different settings.
>>
>>      
> One important point is the size of hash table, you want something big
> for your router.
>
> # dmesg | grep 'IP route'
>   ... IP route cache hash table entries: 524288 (order: 10, 4194304
> bytes)
>
>    

On my machine it is also the same:
dmesg | grep 'IP route'
IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)


> Then if it is correctly sized, dont change gc_thresh or max_size, as
> defaults are good.
>
> I would only change gc_interval to 1, to perform a smooth gc
>
> And eventually gc_elasticity to 4, 5 or 6 if I had less ram than your
> machine.
>
>    
Some day ago after info about route cache i was have  also this info:
Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 
dev_watchdog+0x130/0x1d6()
Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit queue 
0 timed out
Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
Feb  4 13:12:40 TM_01_C1 Call Trace:
Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ? 
dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? 
warn_slowpath_common+0x77/0xa3
Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? warn_slowpath_fmt+0x51/0x59
Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? try_to_wake_up+0x1eb/0x1f8
Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? netdev_drivername+0x3b/0x40
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? 
run_timer_softirq+0x1ff/0x29d
Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? 
smp_apic_timer_interrupt+0x87/0x95
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? 
apic_timer_interrupt+0x13/0x20
Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? mwait_idle+0x9b/0xa0
Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---

And after change kernel to 2.6.33-rc6 another different inf:

BUG: soft lockup - CPU#1 stuck for 61s!
[events/1:28]
Modules linked in:
CPU 1
Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
kmem_cache_free+0x11b/0x11c
RSP: 0018:ffff880028243e50  EFLAGS: 00000292
RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
FS:  0000000000000000(0000) GS:ffff880028240000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task ffff88031f9a4f80)
Stack:
  ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
<0>  01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
<0>  0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
Call Trace:
  <IRQ>
  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
  <EOI>
  [<ffffffff81004599>] ? do_softirq+0x31/0x63
  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
  [<ffffffff810479bd>] ? kthread+0x79/0x81
  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff81047944>] ? kthread+0x0/0x81
  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48
83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5 53 48
89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
Call Trace:
  <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
  <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
  [<ffffffff810479bd>] ? kthread+0x79/0x81
  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
  [<ffffffff81047944>] ? kthread+0x0/0x81


  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10




>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>    


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 14:16           ` Paweł Staszewski
@ 2010-02-08 14:32             ` Eric Dumazet
  2010-02-08 19:32               ` [PATCH] dst: call cond_resched() in dst_gc_task() Eric Dumazet
  2010-02-08 14:32             ` Problem wit route cache Paweł Staszewski
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-08 14:32 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 08 février 2010 à 15:16 +0100, Paweł Staszewski a écrit :

> >    
> Some day ago after info about route cache i was have  also this info:

> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48
> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5 53 48
> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> Call Trace:
>   <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>   <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>   [<ffffffff810479bd>] ? kthread+0x79/0x81
>   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>   [<ffffffff81047944>] ? kthread+0x0/0x81
> 
> 
>   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> 
> 

This trace is indeed very interesting, since dst_gc_task() is run from a
work queue, and there is no scheduling point in it.

We might need add a scheduling point in dst_gc_task() in case huge
number of entries were flushed.




^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 14:32             ` Eric Dumazet
@ 2010-02-08 19:32               ` Eric Dumazet
  2010-02-08 23:01                 ` David Miller
  2010-02-08 23:26                 ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2010-02-08 19:32 UTC (permalink / raw)
  To: Paweł Staszewski, David Miller; +Cc: Linux Network Development list

Le lundi 08 février 2010 à 15:32 +0100, Eric Dumazet a écrit :
> Le lundi 08 février 2010 à 15:16 +0100, Paweł Staszewski a écrit :
> 
> > >    
> > Some day ago after info about route cache i was have  also this info:
> 
> > Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 c3 08 48
> > 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 f5 53 48
> > 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> > Call Trace:
> >   <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
> >   [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
> >   [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
> >   [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
> >   [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
> >   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >   [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> >   <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
> >   [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
> >   [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
> >   [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
> >   [<ffffffff8136b08c>] ? schedule+0x82c/0x906
> >   [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
> >   [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
> >   [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
> >   [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
> >   [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
> >   [<ffffffff810479bd>] ? kthread+0x79/0x81
> >   [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
> >   [<ffffffff81047944>] ? kthread+0x0/0x81
> > 
> > 
> >   [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> > 
> > 
> 
> This trace is indeed very interesting, since dst_gc_task() is run from a
> work queue, and there is no scheduling point in it.
> 
> We might need add a scheduling point in dst_gc_task() in case huge
> number of entries were flushed.
> 

David, here is the patch I sent to Pawel to solve this problem.
This probaby is a stable candidate.

Thanks

[PATCH] dst: call cond_resched() in dst_gc_task()

On some workloads, it is quite possible to get a huge dst list to
process in dst_gc_task(), and trigger soft lockup detection.

Fix is to call cond_resched(), as we run in process context.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/core/dst.c b/net/core/dst.c
index 57bc4d5..cb1b348 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -17,6 +17,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <net/net_namespace.h>
+#include <linux/sched.h>
 
 #include <net/dst.h>
 
@@ -79,6 +80,7 @@ loop:
 	while ((dst = next) != NULL) {
 		next = dst->next;
 		prefetch(&next->next);
+		cond_resched();
 		if (likely(atomic_read(&dst->__refcnt))) {
 			last->next = dst;
 			last = dst;




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 19:32               ` [PATCH] dst: call cond_resched() in dst_gc_task() Eric Dumazet
@ 2010-02-08 23:01                 ` David Miller
  2010-02-09  6:07                   ` Eric Dumazet
  2010-02-08 23:26                 ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2010-02-08 23:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: pstaszewski, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 08 Feb 2010 20:32:40 +0100

> [PATCH] dst: call cond_resched() in dst_gc_task()
> 
> On some workloads, it is quite possible to get a huge dst list to
> process in dst_gc_task(), and trigger soft lockup detection.
> 
> Fix is to call cond_resched(), as we run in process context.
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up to -stable.

When fixing bugs with kernel bugzilla entries, please
mention them in the commit message.  I fixed this up for
you but please take care of it next time.

Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:01                 ` David Miller
@ 2010-02-09  6:07                   ` Eric Dumazet
  0 siblings, 0 replies; 22+ messages in thread
From: Eric Dumazet @ 2010-02-09  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: pstaszewski, netdev

Le lundi 08 février 2010 à 15:01 -0800, David Miller a écrit :

> 
> When fixing bugs with kernel bugzilla entries, please
> mention them in the commit message.  I fixed this up for
> you but please take care of it next time.
> 
> Thanks!

Sorry Dave, I was not aware of the bugzilla entry.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 19:32               ` [PATCH] dst: call cond_resched() in dst_gc_task() Eric Dumazet
  2010-02-08 23:01                 ` David Miller
@ 2010-02-08 23:26                 ` Andrew Morton
  2010-02-08 23:34                   ` David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2010-02-08 23:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paweł Staszewski, David Miller,
	Linux Network Development list

On Mon, 08 Feb 2010 20:32:40 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> [PATCH] dst: call cond_resched() in dst_gc_task()
> 
> On some workloads, it is quite possible to get a huge dst list to
> process in dst_gc_task(), and trigger soft lockup detection.
> 
> Fix is to call cond_resched(), as we run in process context.
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
> 
> diff --git a/net/core/dst.c b/net/core/dst.c
> index 57bc4d5..cb1b348 100644
> --- a/net/core/dst.c
> +++ b/net/core/dst.c
> @@ -17,6 +17,7 @@
>  #include <linux/string.h>
>  #include <linux/types.h>
>  #include <net/net_namespace.h>
> +#include <linux/sched.h>
>  
>  #include <net/dst.h>
>  
> @@ -79,6 +80,7 @@ loop:
>  	while ((dst = next) != NULL) {
>  		next = dst->next;
>  		prefetch(&next->next);
> +		cond_resched();
>  		if (likely(atomic_read(&dst->__refcnt))) {
>  			last->next = dst;
>  			last = dst;

Gad.  Am I understanding this right?  The softlockup threshold is sixty
seconds!

I assume that this function spends most of its time walking over busy
entries?  Is a more powerful data structure needed?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:26                 ` Andrew Morton
@ 2010-02-08 23:34                   ` David Miller
  2010-02-08 23:37                     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2010-02-08 23:34 UTC (permalink / raw)
  To: akpm; +Cc: eric.dumazet, pstaszewski, netdev

From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 8 Feb 2010 15:26:06 -0800

> I assume that this function spends most of its time walking over busy
> entries?  Is a more powerful data structure needed?

When you're getting pounded with millions of packets per second,
all mostly to different destinations (and thus resolving to
different routing cache entries), this is what happens.

For a busy router, really, this is normal behavior.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:34                   ` David Miller
@ 2010-02-08 23:37                     ` Andrew Morton
  2010-02-08 23:50                       ` David Miller
  2010-02-08 23:50                       ` Stephen Hemminger
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2010-02-08 23:37 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, pstaszewski, netdev

On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Andrew Morton <akpm@linux-foundation.org>
> Date: Mon, 8 Feb 2010 15:26:06 -0800
> 
> > I assume that this function spends most of its time walking over busy
> > entries?  Is a more powerful data structure needed?
> 
> When you're getting pounded with millions of packets per second,
> all mostly to different destinations (and thus resolving to
> different routing cache entries), this is what happens.
> 
> For a busy router, really, this is normal behavior.

Is the cache a net win in that scenario?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:37                     ` Andrew Morton
@ 2010-02-08 23:50                       ` David Miller
  2010-02-08 23:50                       ` Stephen Hemminger
  1 sibling, 0 replies; 22+ messages in thread
From: David Miller @ 2010-02-08 23:50 UTC (permalink / raw)
  To: akpm; +Cc: eric.dumazet, pstaszewski, netdev

From: Andrew Morton <akpm@linux-foundation.org>
Date: Mon, 8 Feb 2010 15:37:44 -0800

> On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
>> For a busy router, really, this is normal behavior.
> 
> Is the cache a net win in that scenario?

Absolutely.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:37                     ` Andrew Morton
  2010-02-08 23:50                       ` David Miller
@ 2010-02-08 23:50                       ` Stephen Hemminger
  2010-02-09  6:06                         ` Eric Dumazet
  1 sibling, 1 reply; 22+ messages in thread
From: Stephen Hemminger @ 2010-02-08 23:50 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Miller, eric.dumazet, pstaszewski, netdev

On Mon, 8 Feb 2010 15:37:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 08 Feb 2010 15:34:06 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Andrew Morton <akpm@linux-foundation.org>
> > Date: Mon, 8 Feb 2010 15:26:06 -0800
> > 
> > > I assume that this function spends most of its time walking over busy
> > > entries?  Is a more powerful data structure needed?
> > 
> > When you're getting pounded with millions of packets per second,
> > all mostly to different destinations (and thus resolving to
> > different routing cache entries), this is what happens.
> > 
> > For a busy router, really, this is normal behavior.
> 
> Is the cache a net win in that scenario?

No, cache doesn't help.

Robert who is the expert in this area, runs with FIB TRIE and
no routing cache.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-08 23:50                       ` Stephen Hemminger
@ 2010-02-09  6:06                         ` Eric Dumazet
  2010-02-09  6:35                           ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-09  6:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Andrew Morton, David Miller, pstaszewski, netdev

Le lundi 08 février 2010 à 15:50 -0800, Stephen Hemminger a écrit :

> No, cache doesn't help.
> 
> Robert who is the expert in this area, runs with FIB TRIE and
> no routing cache.

Who knows, it probably depends on many factors. I always run with cache
enabled, because it saves cycles on moderate load.

FIB_TRIE is unrelated here, if routing table is very small, it fits HASH
or TRIE.

Pawel hit the bug with tunables that basically enabled the cache but in
a non helpful way (filling the list of busy dst). User error combined
with a lazy kernel function :)

Please note that conversion from softirq to workqueue, without
scheduling point, might/probably use same cpu for handling network irqs
and running dst_gc_task() :

On big routers, admins usually use irq affinities, so we can have very
litle cpu time available to run other tasks on those cpus.

After this patch, I believe that scheduler is allowed to migrate
dst_gc_task() to an idle cpu.

Another point (for 2.6.34) to address is the dst_gc_mutex that can delay
NETDEV_UNREGISTER/NETDEV_DOWN events for a long period.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-09  6:06                         ` Eric Dumazet
@ 2010-02-09  6:35                           ` Andrew Morton
  2010-02-09  7:20                             ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2010-02-09  6:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, David Miller, pstaszewski, netdev

On Tue, 09 Feb 2010 07:06:38 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> After this patch, I believe that scheduler is allowed to migrate
> dst_gc_task() to an idle cpu.

No, keventd threads are each pinned to a single CPU (kthread_bind() in
start_workqueue_thread()), so dst_gc_task() gets run on the CPU which
ran schedule_delayed_work() and no other.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-09  6:35                           ` Andrew Morton
@ 2010-02-09  7:20                             ` Eric Dumazet
  2010-02-09  7:31                               ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2010-02-09  7:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Stephen Hemminger, David Miller, pstaszewski, netdev

Le lundi 08 février 2010 à 22:35 -0800, Andrew Morton a écrit :
> On Tue, 09 Feb 2010 07:06:38 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > After this patch, I believe that scheduler is allowed to migrate
> > dst_gc_task() to an idle cpu.
> 
> No, keventd threads are each pinned to a single CPU (kthread_bind() in
> start_workqueue_thread()), so dst_gc_task() gets run on the CPU which
> ran schedule_delayed_work() and no other.
> 

Ah OK, thanks Andrew for this clarification.

I suppose offlining a cpu migrates its works to another (online) cpu ?




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] dst: call cond_resched() in dst_gc_task()
  2010-02-09  7:20                             ` Eric Dumazet
@ 2010-02-09  7:31                               ` Andrew Morton
  0 siblings, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2010-02-09  7:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, David Miller, pstaszewski, netdev

On Tue, 09 Feb 2010 08:20:36 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> I suppose offlining a cpu migrates its works to another (online) cpu ?

Sort of, effectively.  The workqueue code runs all the pending works on
the to-be-offlined CPU and then it's done.

schedule_delayed_work() starts out with a timer, and the timer code
_does_ perform migration off the going-away CPU.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 14:16           ` Paweł Staszewski
  2010-02-08 14:32             ` Eric Dumazet
@ 2010-02-08 14:32             ` Paweł Staszewski
  2010-02-08 14:45               ` Paweł Staszewski
  1 sibling, 1 reply; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 14:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2010-02-08 15:16, Paweł Staszewski pisze:
> W dniu 2010-02-08 15:06, Eric Dumazet pisze:
>> Le lundi 08 février 2010 à 14:59 +0100, Paweł Staszewski a écrit :
>>> W dniu 2010-02-08 14:51, Eric Dumazet pisze:
>>>> Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :
>>>>
>>>>
>>>>>>
>>>>> Yes this is x86_64 kernel
>>>>> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all
>>>>> kernels the same thing happens.
>>>>> grep . /proc/sys/net/ipv4/route/*
>>>>> /proc/sys/net/ipv4/route/error_burst:1250
>>>>> /proc/sys/net/ipv4/route/error_cost:250
>>>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>>>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>>>> /proc/sys/net/ipv4/route/gc_interval:2
>>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>>> /proc/sys/net/ipv4/route/gc_thresh:65535
>>>>> /proc/sys/net/ipv4/route/gc_timeout:300
>>>>> /proc/sys/net/ipv4/route/max_size:524288
>>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>>> /proc/sys/net/ipv4/route/redirect_load:5
>>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>>> /proc/sys/net/ipv4/route/redirect_silence:5120
>>>>> /proc/sys/net/ipv4/route/secret_interval:2
>>>>>
>>>>> This happens not all the time.
>>>>> I have this info only when there are "internet rush hours" - thn 
>>>>> there
>>>>> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
>>>>>
>>>>>
>>>> I dont understand your settings, they are very very small for your
>>>> setup. You want to flush cache every 2 seconds...
>>>>
>>>> With 12GB of ram, you could have
>>>>
>>>> /proc/sys/net/ipv4/route/gc_thresh:524288
>>>> /proc/sys/net/ipv4/route/max_size:8388608
>>>> /proc/sys/net/ipv4/route/secret_interval:3600
>>>> /proc/sys/net/ipv4/route/gc_elasticity:4
>>>> /proc/sys/net/ipv4/route/gc_interval:1
>>>>
>>>> That would allow about 2 million entries in your route cache, using 
>>>> 768
>>>> Mbytes of ram, and a good cache hit ratio.
>>>>
>>>>
>>>>
>>> Yes as i write i change this settings after i see first info
>>> "secret_interval" - from 3600 to 2
>>> To check if this resolve the problem.
>>> Also my normal settings are:
>>>
>>> /proc/sys/net/ipv4/route/gc_thresh:256000
>>> /proc/sys/net/ipv4/route/max_size:1048576
>>> /proc/sys/net/ipv4/route/secret_interval:3600
>>> /proc/sys/net/ipv4/route/gc_interval:2
>>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>>
>>> And with this setting i was have this info:
>>> Route hash chain too long!
>>> Adjust your secret_interval!
>>>
>>>
>>>
>>> Now i put Your settings as You suggest ... and we will see but i 
>>> dont know it will help.
>>> Because i try many of different settings.
>>>
>> One important point is the size of hash table, you want something big
>> for your router.
>>
>> # dmesg | grep 'IP route'
>>   ... IP route cache hash table entries: 524288 (order: 10, 4194304
>> bytes)
>>
>
> On my machine it is also the same:
> dmesg | grep 'IP route'
> IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)
>
>
>> Then if it is correctly sized, dont change gc_thresh or max_size, as
>> defaults are good.
>>
>> I would only change gc_interval to 1, to perform a smooth gc
>>
>> And eventually gc_elasticity to 4, 5 or 6 if I had less ram than your
>> machine.
>>
> Some day ago after info about route cache i was have  also this info:
> Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
> Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 
> dev_watchdog+0x130/0x1d6()
> Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
> Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit 
> queue 0 timed out
> Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
> Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
> Feb  4 13:12:40 TM_01_C1 Call Trace:
> Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ? 
> dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? 
> warn_slowpath_common+0x77/0xa3
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? 
> warn_slowpath_fmt+0x51/0x59
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
> Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? 
> try_to_wake_up+0x1eb/0x1f8
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? 
> netdev_drivername+0x3b/0x40
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
> Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? 
> run_timer_softirq+0x1ff/0x29d
> Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? 
> smp_apic_timer_interrupt+0x87/0x95
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? 
> apic_timer_interrupt+0x13/0x20
> Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? 
> mwait_idle+0x9b/0xa0
> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
> Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
>
> And after change kernel to 2.6.33-rc6 another different inf:
>
> BUG: soft lockup - CPU#1 stuck for 61s!
> [events/1:28]
> Modules linked in:
> CPU 1
> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
> RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
> kmem_cache_free+0x11b/0x11c
> RSP: 0018:ffff880028243e50  EFLAGS: 00000292
> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
> FS:  0000000000000000(0000) GS:ffff880028240000(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task 
> ffff88031f9a4f80)
> Stack:
>  ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
> <0>  01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
> <0>  0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
> Call Trace:
> <IRQ>
>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> <EOI>
>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>  [<ffffffff81047944>] ? kthread+0x0/0x81
>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 
> c3 08 48
> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 89 
> f5 53 48
> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
> Call Trace:
> <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
> <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>  [<ffffffff81047944>] ? kthread+0x0/0x81
>
>
>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>
>
>
And other weird thing is that when i make affinity for nics and i bind 
eth0 to cpu0 and eth1 to cpu2 i think i have too much cpu load:
mpstat -P ALL 1 10
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.10    1.63   16.71    
0.00    0.00   81.56
Average:       0    0.00    0.00    0.00    0.00    5.10   72.80    
0.00    0.00   22.10
Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
Average:       2    0.00    0.00    0.00    0.00    8.00   61.00    
0.00    0.00   31.00
Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
Average:       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
Average:       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
Average:       6    0.00    0.00    0.00    0.70    0.00    0.00    
0.00    0.00   99.30
Average:       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

As You see there is only 22% and 31% cpu idle
With forwarded traffic like here:
  bwm-ng v0.6 (probing every 3.000s), press 'h' for help
   input: /proc/net/dev type: rate
   -         iface                   Rx                   
Tx                Total
   
==============================================================================
                lo:           0.00  b/s            0.00  b/s            
0.00  b/s
              eth0:         346.64 Mb/s          487.24 Mb/s          
833.88 Mb/s
              eth1:         487.48 Mb/s          344.14 Mb/s          
831.61 Mb/s
          vlan0811:         273.29 Mb/s          381.71 Mb/s          
655.01 Mb/s
          vlan0508:          64.62 Mb/s          105.54 Mb/s          
170.15 Mb/s
   
------------------------------------------------------------------------------
             total:           1.14 Gb/s            1.29 Gb/s            
2.43 Gb/s




>
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Problem wit route cache
  2010-02-08 14:32             ` Problem wit route cache Paweł Staszewski
@ 2010-02-08 14:45               ` Paweł Staszewski
  0 siblings, 0 replies; 22+ messages in thread
From: Paweł Staszewski @ 2010-02-08 14:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2010-02-08 15:32, Paweł Staszewski pisze:
> W dniu 2010-02-08 15:16, Paweł Staszewski pisze:
>> W dniu 2010-02-08 15:06, Eric Dumazet pisze:
>>> Le lundi 08 février 2010 à 14:59 +0100, Paweł Staszewski a écrit :
>>>> W dniu 2010-02-08 14:51, Eric Dumazet pisze:
>>>>> Le lundi 08 février 2010 à 14:33 +0100, Paweł Staszewski a écrit :
>>>>>
>>>>>
>>>>>>>
>>>>>> Yes this is x86_64 kernel
>>>>>> i kernels  2.6.32.2 /  2.6.32.7 and now 2.6.33-rc6-git5 and on all
>>>>>> kernels the same thing happens.
>>>>>> grep . /proc/sys/net/ipv4/route/*
>>>>>> /proc/sys/net/ipv4/route/error_burst:1250
>>>>>> /proc/sys/net/ipv4/route/error_cost:250
>>>>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>>>>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>>>>> /proc/sys/net/ipv4/route/gc_interval:2
>>>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>>>> /proc/sys/net/ipv4/route/gc_thresh:65535
>>>>>> /proc/sys/net/ipv4/route/gc_timeout:300
>>>>>> /proc/sys/net/ipv4/route/max_size:524288
>>>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>>>> /proc/sys/net/ipv4/route/redirect_load:5
>>>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>>>> /proc/sys/net/ipv4/route/redirect_silence:5120
>>>>>> /proc/sys/net/ipv4/route/secret_interval:2
>>>>>>
>>>>>> This happens not all the time.
>>>>>> I have this info only when there are "internet rush hours" - thn 
>>>>>> there
>>>>>> is about 700Mbit/s TX + 700Mbit/s RX forwarded traffic
>>>>>>
>>>>>>
>>>>> I dont understand your settings, they are very very small for your
>>>>> setup. You want to flush cache every 2 seconds...
>>>>>
>>>>> With 12GB of ram, you could have
>>>>>
>>>>> /proc/sys/net/ipv4/route/gc_thresh:524288
>>>>> /proc/sys/net/ipv4/route/max_size:8388608
>>>>> /proc/sys/net/ipv4/route/secret_interval:3600
>>>>> /proc/sys/net/ipv4/route/gc_elasticity:4
>>>>> /proc/sys/net/ipv4/route/gc_interval:1
>>>>>
>>>>> That would allow about 2 million entries in your route cache, 
>>>>> using 768
>>>>> Mbytes of ram, and a good cache hit ratio.
>>>>>
>>>>>
>>>>>
>>>> Yes as i write i change this settings after i see first info
>>>> "secret_interval" - from 3600 to 2
>>>> To check if this resolve the problem.
>>>> Also my normal settings are:
>>>>
>>>> /proc/sys/net/ipv4/route/gc_thresh:256000
>>>> /proc/sys/net/ipv4/route/max_size:1048576
>>>> /proc/sys/net/ipv4/route/secret_interval:3600
>>>> /proc/sys/net/ipv4/route/gc_interval:2
>>>> /proc/sys/net/ipv4/route/gc_elasticity:2
>>>>
>>>> And with this setting i was have this info:
>>>> Route hash chain too long!
>>>> Adjust your secret_interval!
>>>>
>>>>
>>>>
>>>> Now i put Your settings as You suggest ... and we will see but i 
>>>> dont know it will help.
>>>> Because i try many of different settings.
>>>>
>>> One important point is the size of hash table, you want something big
>>> for your router.
>>>
>>> # dmesg | grep 'IP route'
>>>   ... IP route cache hash table entries: 524288 (order: 10, 4194304
>>> bytes)
>>>
>>
>> On my machine it is also the same:
>> dmesg | grep 'IP route'
>> IP route cache hash table entries: 524288 (order: 10, 4194304 bytes)
>>
>>
>>> Then if it is correctly sized, dont change gc_thresh or max_size, as
>>> defaults are good.
>>>
>>> I would only change gc_interval to 1, to perform a smooth gc
>>>
>>> And eventually gc_elasticity to 4, 5 or 6 if I had less ram than your
>>> machine.
>>>
>> Some day ago after info about route cache i was have  also this info:
>> Feb  4 13:12:40 TM_01_C1 ------------[ cut here ]------------
>> Feb  4 13:12:40 TM_01_C1 WARNING: at net/sched/sch_generic.c:261 
>> dev_watchdog+0x130/0x1d6()
>> Feb  4 13:12:40 TM_01_C1 Hardware name: X7DCT
>> Feb  4 13:12:40 TM_01_C1 NETDEV WATCHDOG: eth0 (e1000e): transmit 
>> queue 0 timed out
>> Feb  4 13:12:40 TM_01_C1 Modules linked in: oprofile
>> Feb  4 13:12:40 TM_01_C1 Pid: 0, comm: swapper Not tainted 2.6.32 #1
>> Feb  4 13:12:40 TM_01_C1 Call Trace:
>> Feb  4 13:12:40 TM_01_C1 <IRQ>  [<ffffffff812fcaf7>] ? 
>> dev_watchdog+0x130/0x1d6
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038811>] ? 
>> warn_slowpath_common+0x77/0xa3
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81038899>] ? 
>> warn_slowpath_fmt+0x51/0x59
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102897e>] ? activate_task+0x3f/0x4e
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff81034fe5>] ? 
>> try_to_wake_up+0x1eb/0x1f8
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812eb768>] ? 
>> netdev_drivername+0x3b/0x40
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fcaf7>] ? dev_watchdog+0x130/0x1d6
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8102d1e3>] ? __wake_up+0x30/0x44
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff812fc9c7>] ? dev_watchdog+0x0/0x1d6
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810448c4>] ? 
>> run_timer_softirq+0x1ff/0x29d
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff810556ab>] ? ktime_get+0x5f/0xb7
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8103e0fd>] ? __do_softirq+0xd7/0x196
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100be7c>] ? call_softirq+0x1c/0x28
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100d645>] ? do_softirq+0x31/0x66
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8101b148>] ? 
>> smp_apic_timer_interrupt+0x87/0x95
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100b873>] ? 
>> apic_timer_interrupt+0x13/0x20
>> Feb  4 13:12:40 TM_01_C1 <EOI>  [<ffffffff810111f5>] ? 
>> mwait_idle+0x9b/0xa0
>> Feb  4 13:12:40 TM_01_C1 [<ffffffff8100a236>] ? cpu_idle+0x49/0x7c
>> Feb  4 13:12:40 TM_01_C1 ---[ end trace c670a6a17be040e5 ]---
>>
>> And after change kernel to 2.6.33-rc6 another different inf:
>>
>> BUG: soft lockup - CPU#1 stuck for 61s!
>> [events/1:28]
>> Modules linked in:
>> CPU 1
>> Pid: 28, comm: events/1 Not tainted 2.6.33-rc6-git5 #1 X7DCT/X7DCT
>> RIP: 0010:[<ffffffff810a3d89>]  [<ffffffff810a3d89>]
>> kmem_cache_free+0x11b/0x11c
>> RSP: 0018:ffff880028243e50  EFLAGS: 00000292
>> RAX: 0000000000000032 RBX: 000000000000007d RCX: ffff8803190683c0
>> RDX: 0000000000000031 RSI: ffff8803190683c0 RDI: ffff88031f83e680
>> RBP: ffffffff81002893 R08: 0000000000000000 R09: 000000000000007c
>> R10: ffff88030d776800 R11: ffff88030d7768a0 R12: ffff880028243dd0
>> R13: ffffc900008b2f80 R14: ffff88031fa7c800 R15: ffffffff81012da7
>> FS:  0000000000000000(0000) GS:ffff880028240000(0000) 
>> knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 00007fd61d5bd000 CR3: 000000031e55c000 CR4: 00000000000006a0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process events/1 (pid: 28, threadinfo ffff88031f9c8000, task 
>> ffff88031f9a4f80)
>> Stack:
>>  ffffffff8126826f ffff88031faa4600 ffffffff8126834a 000096ba00000023
>> <0>  01ffc90000000024 ffff88031fbb4000 ffff88031faa4600 0000000000000040
>> <0>  0000000000000040 ffff88031faa4788 ffff88031faa4600 0000000000000740
>> Call Trace:
>> <IRQ>
>>  [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>> <EOI>
>>  [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>  [<ffffffff81047944>] ? kthread+0x0/0x81
>>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>> Code: fe 79 4c 00 48 85 db 74 14 48 8b 74 24 10 48 89 ef ff 13 48 83 
>> c3 08 48
>> 83 3b 00 eb ea 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f<c3>  55 48 
>> 89 f5 53 48
>> 89 fb 48 83 ec 08 48 8b 76 18 48 2b 75 10
>> Call Trace:
>> <IRQ>   [<ffffffff8126826f>] ? e1000_put_txbuf+0x62/0x74
>>  [<ffffffff8126834a>] ? e1000_clean_tx_irq+0xc9/0x235
>>  [<ffffffff8126b71b>] ? e1000_clean+0x5c/0x21c
>>  [<ffffffff812f29a3>] ? net_rx_action+0x71/0x15d
>>  [<ffffffff81035311>] ? __do_softirq+0xd7/0x196
>>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>  [<ffffffff81002dac>] ? call_softirq+0x1c/0x28
>> <EOI>   [<ffffffff81004599>] ? do_softirq+0x31/0x63
>>  [<ffffffff81034ec1>] ? local_bh_enable_ip+0x75/0x86
>>  [<ffffffff812f768f>] ? dst_gc_task+0x0/0x1a7
>>  [<ffffffff812f775d>] ? dst_gc_task+0xce/0x1a7
>>  [<ffffffff8136b08c>] ? schedule+0x82c/0x906
>>  [<ffffffff8103c44f>] ? lock_timer_base+0x26/0x4b
>>  [<ffffffff810a41d6>] ? cache_reap+0x0/0x11d
>>  [<ffffffff81044c38>] ? worker_thread+0x14c/0x1dc
>>  [<ffffffff81047dcd>] ? autoremove_wake_function+0x0/0x2e
>>  [<ffffffff81044aec>] ? worker_thread+0x0/0x1dc
>>  [<ffffffff810479bd>] ? kthread+0x79/0x81
>>  [<ffffffff81002cb4>] ? kernel_thread_helper+0x4/0x10
>>  [<ffffffff81047944>] ? kthread+0x0/0x81
>>
>>
>>  [<ffffffff81002cb0>] ? kernel_thread_helper+0x0/0x10
>>
>>
>>
> And other weird thing is that when i make affinity for nics and i bind 
> eth0 to cpu0 and eth1 to cpu2 i think i have too much cpu load:
> mpstat -P ALL 1 10
> Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
> %steal  %guest   %idle
> Average:     all    0.00    0.00    0.00    0.10    1.63   16.71    
> 0.00    0.00   81.56
> Average:       0    0.00    0.00    0.00    0.00    5.10   72.80    
> 0.00    0.00   22.10
> Average:       1    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00  100.00
> Average:       2    0.00    0.00    0.00    0.00    8.00   61.00    
> 0.00    0.00   31.00
> Average:       3    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00  100.00
> Average:       4    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00  100.00
> Average:       5    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00  100.00
> Average:       6    0.00    0.00    0.00    0.70    0.00    0.00    
> 0.00    0.00   99.30
> Average:       7    0.00    0.00    0.00    0.00    0.00    0.00    
> 0.00    0.00  100.00
>
> As You see there is only 22% and 31% cpu idle
> With forwarded traffic like here:
>  bwm-ng v0.6 (probing every 3.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   -         iface                   Rx                   
> Tx                Total
>   
> ============================================================================== 
>
>                lo:           0.00  b/s            0.00  b/s            
> 0.00  b/s
>              eth0:         346.64 Mb/s          487.24 Mb/s          
> 833.88 Mb/s
>              eth1:         487.48 Mb/s          344.14 Mb/s          
> 831.61 Mb/s
>          vlan0811:         273.29 Mb/s          381.71 Mb/s          
> 655.01 Mb/s
>          vlan0508:          64.62 Mb/s          105.54 Mb/s          
> 170.15 Mb/s
>   
> ------------------------------------------------------------------------------ 
>
>             total:           1.14 Gb/s            1.29 Gb/s            
> 2.43 Gb/s
>
>
>
Ok i forget to add that on this router there is traffic management:
tc -s -d filter show dev eth1 | grep flowid | wc -l
9096
tc -s -d filter show dev vlan0811 | grep flowid | wc -l
9096

Those are iproute hashing filters.


Without filters on interfaces i have 50% idle. - so iproute traffic 
management take 30% of cpu more.


>
>>
>>>
>>>
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-02-09  7:31 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-08 13:16 Problem wit route cache Paweł Staszewski
2010-02-08 13:28 ` Eric Dumazet
2010-02-08 13:33   ` Paweł Staszewski
2010-02-08 13:51     ` Eric Dumazet
2010-02-08 13:59       ` Paweł Staszewski
2010-02-08 14:06         ` Eric Dumazet
2010-02-08 14:16           ` Paweł Staszewski
2010-02-08 14:32             ` Eric Dumazet
2010-02-08 19:32               ` [PATCH] dst: call cond_resched() in dst_gc_task() Eric Dumazet
2010-02-08 23:01                 ` David Miller
2010-02-09  6:07                   ` Eric Dumazet
2010-02-08 23:26                 ` Andrew Morton
2010-02-08 23:34                   ` David Miller
2010-02-08 23:37                     ` Andrew Morton
2010-02-08 23:50                       ` David Miller
2010-02-08 23:50                       ` Stephen Hemminger
2010-02-09  6:06                         ` Eric Dumazet
2010-02-09  6:35                           ` Andrew Morton
2010-02-09  7:20                             ` Eric Dumazet
2010-02-09  7:31                               ` Andrew Morton
2010-02-08 14:32             ` Problem wit route cache Paweł Staszewski
2010-02-08 14:45               ` Paweł Staszewski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).