* rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
@ 2009-06-25 15:48 Paweł Staszewski
2009-06-25 21:19 ` Eric Dumazet
0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-25 15:48 UTC (permalink / raw)
To: Linux Network Development list
Hello ALL
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648
and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
Aver depth: 2.28
Max depth: 6
Leaves: 276539
Prefixes: 289922
Internal nodes: 66762
1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5
9: 1 18: 1
Pointers: 691228
Null ptrs: 347928
Total size: 35709 kB
Counters:
---------
gets = 26276593
backtracks = 547306
semantic match passed = 26188746
semantic match miss = 1117
null node hit= 27285055
skipped node resize = 0
Local:
Aver depth: 3.33
Max depth: 4
Leaves: 9
Prefixes: 10
Internal nodes: 8
1: 8
Pointers: 16
Null ptrs: 0
Total size: 2 kB
Counters:
---------
gets = 26642350
backtracks = 1282818
semantic match passed = 18166
semantic match miss = 0
null node hit= 0
skipped node resize = 0
This machine is running bgpd with two bgp peers / full route table
cat /proc/meminfo
MemTotal: 12279032 kB
MemFree: 11521920 kB
Buffers: 80288 kB
Cached: 34416 kB
SwapCached: 0 kB
Active: 286816 kB
Inactive: 82024 kB
Active(anon): 254296 kB
Inactive(anon): 0 kB
Active(file): 32520 kB
Inactive(file): 82024 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 987988 kB
SwapFree: 987988 kB
Dirty: 1140 kB
Writeback: 0 kB
AnonPages: 254164 kB
Mapped: 5440 kB
Slab: 365084 kB
SReclaimable: 28784 kB
SUnreclaim: 336300 kB
PageTables: 2104 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 7127504 kB
Committed_AS: 267704 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 11824 kB
VmallocChunk: 34359707815 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3392 kB
DirectMap2M: 12578816 kB
Interfaces mtu is1500
^ permalink raw reply [flat|nested] 99+ messages in thread* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-25 15:48 rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski @ 2009-06-25 21:19 ` Eric Dumazet 2009-06-25 21:52 ` Paweł Staszewski 2009-06-26 8:03 ` Jarek Poplawski 0 siblings, 2 replies; 99+ messages in thread From: Eric Dumazet @ 2009-06-25 21:19 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list Paweł Staszewski a écrit : > Hello ALL > > Some time ago i report this: > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back > dmesg output: > oprofile: using NMI interrupt. > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation) Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes) But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit should not anymore be applied. Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ? commit 15be75cdb5db442d0e33d37b20832b88f3ccd383 Author: Stephen Hemminger <shemminger@vyatta.com> Date: Thu Apr 10 02:56:38 2008 -0700 IPV4: fib_trie use vmalloc for large tnodes Use vmalloc rather than alloc_pages to avoid wasting memory. The problem is that tnode structure has a power of 2 sized array, plus a header. So the current code wastes almost half the memory allocated because it always needs the next bigger size to hold that small header. This is similar to an earlier patch by Eric, but instead of a list and lock, I used a workqueue to handle the fact that vfree can't be done in interrupt context. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> > > cat /proc/net/fib_triestat > Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. > Main: > Aver depth: 2.28 > Max depth: 6 > Leaves: 276539 > Prefixes: 289922 > Internal nodes: 66762 > 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 > 9: 1 18: 1 > Pointers: 691228 > Null ptrs: 347928 > Total size: 35709 kB > > Counters: > --------- > gets = 26276593 > backtracks = 547306 > semantic match passed = 26188746 > semantic match miss = 1117 > null node hit= 27285055 > skipped node resize = 0 > > Local: > Aver depth: 3.33 > Max depth: 4 > Leaves: 9 > Prefixes: 10 > Internal nodes: 8 > 1: 8 > Pointers: 16 > Null ptrs: 0 > Total size: 2 kB > > Counters: > --------- > gets = 26642350 > backtracks = 1282818 > semantic match passed = 18166 > semantic match miss = 0 > null node hit= 0 > skipped node resize = 0 > > > > This machine is running bgpd with two bgp peers / full route table > > cat /proc/meminfo > MemTotal: 12279032 kB > MemFree: 11521920 kB > Buffers: 80288 kB > Cached: 34416 kB > SwapCached: 0 kB > Active: 286816 kB > Inactive: 82024 kB > Active(anon): 254296 kB > Inactive(anon): 0 kB > Active(file): 32520 kB > Inactive(file): 82024 kB > Unevictable: 0 kB > Mlocked: 0 kB > SwapTotal: 987988 kB > SwapFree: 987988 kB > Dirty: 1140 kB > Writeback: 0 kB > AnonPages: 254164 kB > Mapped: 5440 kB > Slab: 365084 kB > SReclaimable: 28784 kB > SUnreclaim: 336300 kB > PageTables: 2104 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 7127504 kB > Committed_AS: 267704 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 11824 kB > VmallocChunk: 34359707815 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 2048 kB > DirectMap4k: 3392 kB > DirectMap2M: 12578816 kB > > > Interfaces mtu is1500 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-25 21:19 ` Eric Dumazet @ 2009-06-25 21:52 ` Paweł Staszewski 2009-06-25 22:54 ` Eric Dumazet 2009-06-26 8:03 ` Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-25 21:52 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Network Development list cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfe6a000 ioremap 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 phys=dfef5000 ioremap 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef2000 ioremap 0xf800c000-0xf800e000 8192 acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfefb000 ioremap 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef4000 ioremap 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef3000 ioremap 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef1000 ioremap 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef0000 ioremap 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeef000 ioremap 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeee000 ioremap 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeed000 ioremap 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeec000 ioremap 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a phys=fed1c000 ioremap 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 pages=1 vmalloc 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 pages=3 vmalloc 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 pages=3 vmalloc 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 pages=3 vmalloc 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 pages=3 vmalloc 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc Eric Dumazet pisze: > Paweł Staszewski a écrit : > >> Hello ALL >> >> Some time ago i report this: >> http://bugzilla.kernel.org/show_bug.cgi?id=6648 >> >> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back >> dmesg output: >> oprofile: using NMI interrupt. >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> > > Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation) > > Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes) > > But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit > should not anymore be applied. > > Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ? > > > commit 15be75cdb5db442d0e33d37b20832b88f3ccd383 > Author: Stephen Hemminger <shemminger@vyatta.com> > Date: Thu Apr 10 02:56:38 2008 -0700 > > IPV4: fib_trie use vmalloc for large tnodes > > Use vmalloc rather than alloc_pages to avoid wasting memory. > The problem is that tnode structure has a power of 2 sized array, > plus a header. So the current code wastes almost half the memory > allocated because it always needs the next bigger size to hold > that small header. > > This is similar to an earlier patch by Eric, but instead of a list > and lock, I used a workqueue to handle the fact that vfree can't > be done in interrupt context. > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > > >> cat /proc/net/fib_triestat >> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. >> Main: >> Aver depth: 2.28 >> Max depth: 6 >> Leaves: 276539 >> Prefixes: 289922 >> Internal nodes: 66762 >> 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 >> 9: 1 18: 1 >> Pointers: 691228 >> Null ptrs: 347928 >> Total size: 35709 kB >> >> Counters: >> --------- >> gets = 26276593 >> backtracks = 547306 >> semantic match passed = 26188746 >> semantic match miss = 1117 >> null node hit= 27285055 >> skipped node resize = 0 >> >> Local: >> Aver depth: 3.33 >> Max depth: 4 >> Leaves: 9 >> Prefixes: 10 >> Internal nodes: 8 >> 1: 8 >> Pointers: 16 >> Null ptrs: 0 >> Total size: 2 kB >> >> Counters: >> --------- >> gets = 26642350 >> backtracks = 1282818 >> semantic match passed = 18166 >> semantic match miss = 0 >> null node hit= 0 >> skipped node resize = 0 >> >> >> >> This machine is running bgpd with two bgp peers / full route table >> >> cat /proc/meminfo >> MemTotal: 12279032 kB >> MemFree: 11521920 kB >> Buffers: 80288 kB >> Cached: 34416 kB >> SwapCached: 0 kB >> Active: 286816 kB >> Inactive: 82024 kB >> Active(anon): 254296 kB >> Inactive(anon): 0 kB >> Active(file): 32520 kB >> Inactive(file): 82024 kB >> Unevictable: 0 kB >> Mlocked: 0 kB >> SwapTotal: 987988 kB >> SwapFree: 987988 kB >> Dirty: 1140 kB >> Writeback: 0 kB >> AnonPages: 254164 kB >> Mapped: 5440 kB >> Slab: 365084 kB >> SReclaimable: 28784 kB >> SUnreclaim: 336300 kB >> PageTables: 2104 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 7127504 kB >> Committed_AS: 267704 kB >> VmallocTotal: 34359738367 kB >> VmallocUsed: 11824 kB >> VmallocChunk: 34359707815 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 2048 kB >> DirectMap4k: 3392 kB >> DirectMap2M: 12578816 kB >> >> >> Interfaces mtu is1500 >> > > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-25 21:52 ` Paweł Staszewski @ 2009-06-25 22:54 ` Eric Dumazet 2009-06-26 10:06 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Eric Dumazet @ 2009-06-25 22:54 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list Paweł Staszewski a écrit : > > cat /proc/vmallocinfo > 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfe6a000 ioremap > 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 > phys=dfef5000 ioremap > 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef2000 ioremap > 0xf800c000-0xf800e000 8192 > acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap > 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfefb000 ioremap > 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef4000 ioremap > 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef3000 ioremap > 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef1000 ioremap > 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef0000 ioremap > 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeef000 ioremap > 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeee000 ioremap > 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeed000 ioremap > 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeec000 ioremap > 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a > phys=fed1c000 ioremap > 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap > 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap > 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap > 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap > 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 > pages=1 vmalloc > 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc > 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap > 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 > pages=3 vmalloc > 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 > pages=3 vmalloc > 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 > pages=3 vmalloc > 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 > pages=3 vmalloc > 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap > 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap > 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc > 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc > 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc This is from a 32 bit kernel. This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine) Of course, I would like /proc/vmallocinfo on your loaded router, not from a dev machine :) > > > Eric Dumazet pisze: >> Paweł Staszewski a écrit : >> >>> Hello ALL >>> >>> Some time ago i report this: >>> http://bugzilla.kernel.org/show_bug.cgi?id=6648 >>> >>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back >>> dmesg output: >>> oprofile: using NMI interrupt. >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> Fix inflate_threshold_root. Now=15 size=11 bits >>> >> >> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER >> allocation) >> >> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes >> (+ header -> 4194304 bytes) >> >> But since following commit, we should use vmalloc() so this >> PAGE_SIZE<<10) limit >> should not anymore be applied. >> >> Could you do a "cat /proc/vmallocinfo" just to check your big tnodes >> are vmalloced() ? >> >> >> commit 15be75cdb5db442d0e33d37b20832b88f3ccd383 >> Author: Stephen Hemminger <shemminger@vyatta.com> >> Date: Thu Apr 10 02:56:38 2008 -0700 >> >> IPV4: fib_trie use vmalloc for large tnodes >> >> Use vmalloc rather than alloc_pages to avoid wasting memory. >> The problem is that tnode structure has a power of 2 sized array, >> plus a header. So the current code wastes almost half the memory >> allocated because it always needs the next bigger size to hold >> that small header. >> >> This is similar to an earlier patch by Eric, but instead of a list >> and lock, I used a workqueue to handle the fact that vfree can't >> be done in interrupt context. >> >> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> >> Signed-off-by: David S. Miller <davem@davemloft.net> >> >> >> >>> cat /proc/net/fib_triestat >>> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. >>> Main: >>> Aver depth: 2.28 >>> Max depth: 6 >>> Leaves: 276539 >>> Prefixes: 289922 >>> Internal nodes: 66762 >>> 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 >>> 9: 1 18: 1 >>> Pointers: 691228 >>> Null ptrs: 347928 >>> Total size: 35709 kB >>> >>> Counters: >>> --------- >>> gets = 26276593 >>> backtracks = 547306 >>> semantic match passed = 26188746 >>> semantic match miss = 1117 >>> null node hit= 27285055 >>> skipped node resize = 0 >>> >>> Local: >>> Aver depth: 3.33 >>> Max depth: 4 >>> Leaves: 9 >>> Prefixes: 10 >>> Internal nodes: 8 >>> 1: 8 >>> Pointers: 16 >>> Null ptrs: 0 >>> Total size: 2 kB >>> >>> Counters: >>> --------- >>> gets = 26642350 >>> backtracks = 1282818 >>> semantic match passed = 18166 >>> semantic match miss = 0 >>> null node hit= 0 >>> skipped node resize = 0 >>> >>> >>> >>> This machine is running bgpd with two bgp peers / full route table >>> >>> cat /proc/meminfo >>> MemTotal: 12279032 kB >>> MemFree: 11521920 kB >>> Buffers: 80288 kB >>> Cached: 34416 kB >>> SwapCached: 0 kB >>> Active: 286816 kB >>> Inactive: 82024 kB >>> Active(anon): 254296 kB >>> Inactive(anon): 0 kB >>> Active(file): 32520 kB >>> Inactive(file): 82024 kB >>> Unevictable: 0 kB >>> Mlocked: 0 kB >>> SwapTotal: 987988 kB >>> SwapFree: 987988 kB >>> Dirty: 1140 kB >>> Writeback: 0 kB >>> AnonPages: 254164 kB >>> Mapped: 5440 kB >>> Slab: 365084 kB >>> SReclaimable: 28784 kB >>> SUnreclaim: 336300 kB >>> PageTables: 2104 kB >>> NFS_Unstable: 0 kB >>> Bounce: 0 kB >>> WritebackTmp: 0 kB >>> CommitLimit: 7127504 kB >>> Committed_AS: 267704 kB >>> VmallocTotal: 34359738367 kB >>> VmallocUsed: 11824 kB >>> VmallocChunk: 34359707815 kB >>> HugePages_Total: 0 >>> HugePages_Free: 0 >>> HugePages_Rsvd: 0 >>> HugePages_Surp: 0 >>> Hugepagesize: 2048 kB >>> DirectMap4k: 3392 kB >>> DirectMap2M: 12578816 kB >>> >>> >>> Interfaces mtu is1500 >>> >> >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-25 22:54 ` Eric Dumazet @ 2009-06-26 10:06 ` Paweł Staszewski 2009-06-26 10:34 ` Eric Dumazet 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-26 10:06 UTC (permalink / raw) Cc: Linux Network Development list Eric Dumazet pisze: > Paweł Staszewski a écrit : > >> cat /proc/vmallocinfo >> 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfe6a000 ioremap >> 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef5000 ioremap >> 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef2000 ioremap >> 0xf800c000-0xf800e000 8192 >> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap >> 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfefb000 ioremap >> 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef4000 ioremap >> 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef3000 ioremap >> 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef1000 ioremap >> 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef0000 ioremap >> 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeef000 ioremap >> 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeee000 ioremap >> 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeed000 ioremap >> 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeec000 ioremap >> 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a >> phys=fed1c000 ioremap >> 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap >> 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap >> 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap >> 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap >> 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 >> pages=1 vmalloc >> 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc >> 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap >> 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 >> pages=3 vmalloc >> 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 >> pages=3 vmalloc >> 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 >> pages=3 vmalloc >> 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 >> pages=3 vmalloc >> 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap >> 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap >> 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc >> 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc >> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc >> > > This is from a 32 bit kernel. > > This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine) > > Of course, I would like /proc/vmallocinfo on your loaded router, not from > a dev machine :) > > Yes sorry for no info about it. I test the same kernel configurations on one 32bit machine and second 64bit here is meminfo from this 32bit machine working on kernel 2.6.30 cat /proc/meminfo MemTotal: 3625444 kB MemFree: 3043648 kB Buffers: 133968 kB Cached: 36316 kB SwapCached: 0 kB Active: 256868 kB Inactive: 76252 kB Active(anon): 163064 kB Inactive(anon): 0 kB Active(file): 93804 kB Inactive(file): 76252 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 2758160 kB HighFree: 2556136 kB LowTotal: 867284 kB LowFree: 487512 kB SwapTotal: 995896 kB SwapFree: 995896 kB Dirty: 3624 kB Writeback: 0 kB AnonPages: 162912 kB Mapped: 3612 kB Slab: 235888 kB SReclaimable: 46408 kB SUnreclaim: 189480 kB PageTables: 384 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2808616 kB Committed_AS: 170648 kB VmallocTotal: 122880 kB VmallocUsed: 2876 kB VmallocChunk: 109824 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB DirectMap4k: 8184 kB DirectMap4M: 901120 kB and vmallocinfo cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfe6a000 ioremap 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 phys=dfef5000 ioremap 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef2000 ioremap 0xf800c000-0xf800e000 8192 acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfefb000 ioremap 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef4000 ioremap 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef3000 ioremap 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef1000 ioremap 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfef0000 ioremap 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeef000 ioremap 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeee000 ioremap 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeed000 ioremap 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 phys=dfeec000 ioremap 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a phys=fed1c000 ioremap 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 pages=1 vmalloc 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e pages=1 vmalloc 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 pages=3 vmalloc 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 pages=3 vmalloc 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 pages=3 vmalloc 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 pages=3 vmalloc 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc 0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc And next machine with kernel 2.6.29.3 dmesg: Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits Fix inflate_threshold_root. Now=15 size=11 bits cat /proc/meminfo MemTotal: 2072652 kB MemFree: 496960 kB Buffers: 267620 kB Cached: 895212 kB SwapCached: 0 kB Active: 675744 kB Inactive: 703312 kB Active(anon): 215848 kB Inactive(anon): 0 kB Active(file): 459896 kB Inactive(file): 703312 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 1186696 kB HighFree: 151156 kB LowTotal: 885956 kB LowFree: 345804 kB SwapTotal: 1975984 kB SwapFree: 1975984 kB Dirty: 20 kB Writeback: 0 kB AnonPages: 215724 kB Mapped: 6120 kB Slab: 186652 kB SReclaimable: 125832 kB SUnreclaim: 60820 kB PageTables: 416 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3012308 kB Committed_AS: 223692 kB VmallocTotal: 122880 kB VmallocUsed: 3192 kB VmallocChunk: 108436 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 4096 kB DirectMap4k: 8184 kB DirectMap4M: 901120 kB cat /proc/vmallocinfo 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 phys=7fee0000 ioremap 0xf8000000-0xf8005000 20480 acpi_tb_verify_table+0x1d/0x46 phys=7fee3000 ioremap 0xf8006000-0xf8008000 8192 acpi_tb_verify_table+0x1d/0x46 phys=7fee3000 ioremap 0xf800a000-0xf800c000 8192 acpi_tb_verify_table+0x1d/0x46 phys=7fee6000 ioremap 0xf800d000-0xf800f000 8192 reiserfs_init_bitmap_cache+0x3b/0x80 pages=1 vmalloc 0xf8010000-0xf8022000 73728 journal_init+0x30/0x8f0 pages=17 vmalloc 0xf8023000-0xf8025000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 pages=1 vmalloc 0xf8026000-0xf8028000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 pages=1 vmalloc 0xf8029000-0xf802b000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 pages=1 vmalloc 0xf802c000-0xf802e000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 pages=1 vmalloc 0xf802f000-0xf8031000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 pages=1 vmalloc 0xf803e000-0xf8040000 8192 e1000_setup_all_tx_resources+0x57/0x660 pages=1 vmalloc 0xf8040000-0xf8061000 135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap 0xf8062000-0xf8064000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 pages=1 vmalloc 0xf8065000-0xf8067000 8192 e1000_setup_all_tx_resources+0x57/0x660 pages=1 vmalloc 0xf8068000-0xf806a000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 pages=1 vmalloc 0xf806b000-0xf806d000 8192 e1000_setup_all_tx_resources+0x57/0x660 pages=1 vmalloc 0xf806e000-0xf8070000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 pages=1 vmalloc 0xf8080000-0xf80a1000 135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap 0xf80c0000-0xf80e1000 135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap 0xf80e2000-0xf8173000 593920 journal_init+0x56e/0x8f0 pages=144 vmalloc 0xf8174000-0xf8267000 995328 sys_swapon+0x548/0xa30 pages=242 vmalloc 0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc because i have this info on 5 machines that working in ibgp mesh And only one 64bit dev machine that is one of failover member - but i kill this machine after upgrade to kernel 2.6.31-rc1 >> Eric Dumazet pisze: >> >>> Paweł Staszewski a écrit : >>> >>> >>>> Hello ALL >>>> >>>> Some time ago i report this: >>>> http://bugzilla.kernel.org/show_bug.cgi?id=6648 >>>> >>>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back >>>> dmesg output: >>>> oprofile: using NMI interrupt. >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> Fix inflate_threshold_root. Now=15 size=11 bits >>>> >>>> >>> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER >>> allocation) >>> >>> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes >>> (+ header -> 4194304 bytes) >>> >>> But since following commit, we should use vmalloc() so this >>> PAGE_SIZE<<10) limit >>> should not anymore be applied. >>> >>> Could you do a "cat /proc/vmallocinfo" just to check your big tnodes >>> are vmalloced() ? >>> >>> >>> commit 15be75cdb5db442d0e33d37b20832b88f3ccd383 >>> Author: Stephen Hemminger <shemminger@vyatta.com> >>> Date: Thu Apr 10 02:56:38 2008 -0700 >>> >>> IPV4: fib_trie use vmalloc for large tnodes >>> >>> Use vmalloc rather than alloc_pages to avoid wasting memory. >>> The problem is that tnode structure has a power of 2 sized array, >>> plus a header. So the current code wastes almost half the memory >>> allocated because it always needs the next bigger size to hold >>> that small header. >>> >>> This is similar to an earlier patch by Eric, but instead of a list >>> and lock, I used a workqueue to handle the fact that vfree can't >>> be done in interrupt context. >>> >>> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> >>> Signed-off-by: David S. Miller <davem@davemloft.net> >>> >>> >>> >>> >>>> cat /proc/net/fib_triestat >>>> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. >>>> Main: >>>> Aver depth: 2.28 >>>> Max depth: 6 >>>> Leaves: 276539 >>>> Prefixes: 289922 >>>> Internal nodes: 66762 >>>> 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 >>>> 9: 1 18: 1 >>>> Pointers: 691228 >>>> Null ptrs: 347928 >>>> Total size: 35709 kB >>>> >>>> Counters: >>>> --------- >>>> gets = 26276593 >>>> backtracks = 547306 >>>> semantic match passed = 26188746 >>>> semantic match miss = 1117 >>>> null node hit= 27285055 >>>> skipped node resize = 0 >>>> >>>> Local: >>>> Aver depth: 3.33 >>>> Max depth: 4 >>>> Leaves: 9 >>>> Prefixes: 10 >>>> Internal nodes: 8 >>>> 1: 8 >>>> Pointers: 16 >>>> Null ptrs: 0 >>>> Total size: 2 kB >>>> >>>> Counters: >>>> --------- >>>> gets = 26642350 >>>> backtracks = 1282818 >>>> semantic match passed = 18166 >>>> semantic match miss = 0 >>>> null node hit= 0 >>>> skipped node resize = 0 >>>> >>>> >>>> >>>> This machine is running bgpd with two bgp peers / full route table >>>> >>>> cat /proc/meminfo >>>> MemTotal: 12279032 kB >>>> MemFree: 11521920 kB >>>> Buffers: 80288 kB >>>> Cached: 34416 kB >>>> SwapCached: 0 kB >>>> Active: 286816 kB >>>> Inactive: 82024 kB >>>> Active(anon): 254296 kB >>>> Inactive(anon): 0 kB >>>> Active(file): 32520 kB >>>> Inactive(file): 82024 kB >>>> Unevictable: 0 kB >>>> Mlocked: 0 kB >>>> SwapTotal: 987988 kB >>>> SwapFree: 987988 kB >>>> Dirty: 1140 kB >>>> Writeback: 0 kB >>>> AnonPages: 254164 kB >>>> Mapped: 5440 kB >>>> Slab: 365084 kB >>>> SReclaimable: 28784 kB >>>> SUnreclaim: 336300 kB >>>> PageTables: 2104 kB >>>> NFS_Unstable: 0 kB >>>> Bounce: 0 kB >>>> WritebackTmp: 0 kB >>>> CommitLimit: 7127504 kB >>>> Committed_AS: 267704 kB >>>> VmallocTotal: 34359738367 kB >>>> VmallocUsed: 11824 kB >>>> VmallocChunk: 34359707815 kB >>>> HugePages_Total: 0 >>>> HugePages_Free: 0 >>>> HugePages_Rsvd: 0 >>>> HugePages_Surp: 0 >>>> Hugepagesize: 2048 kB >>>> DirectMap4k: 3392 kB >>>> DirectMap2M: 12578816 kB >>>> >>>> >>>> Interfaces mtu is1500 >>>> >>>> >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 10:06 ` Paweł Staszewski @ 2009-06-26 10:34 ` Eric Dumazet 2009-06-26 10:47 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Eric Dumazet @ 2009-06-26 10:34 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list Paweł Staszewski a écrit : > Eric Dumazet pisze: >> Paweł Staszewski a écrit : >> >>> cat /proc/vmallocinfo >>> 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfe6a000 ioremap >>> 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef5000 ioremap >>> 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef2000 ioremap >>> 0xf800c000-0xf800e000 8192 >>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap >>> 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfefb000 ioremap >>> 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef4000 ioremap >>> 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef3000 ioremap >>> 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef1000 ioremap >>> 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfef0000 ioremap >>> 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfeef000 ioremap >>> 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfeee000 ioremap >>> 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfeed000 ioremap >>> 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 >>> phys=dfeec000 ioremap >>> 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a >>> phys=fed1c000 ioremap >>> 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 >>> ioremap >>> 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 >>> ioremap >>> 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap >>> 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 >>> ioremap >>> 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 >>> pages=1 vmalloc >>> 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc >>> 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>> pages=1 vmalloc >>> 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>> pages=1 vmalloc >>> 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>> pages=1 vmalloc >>> 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>> pages=1 vmalloc >>> 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>> pages=1 vmalloc >>> 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 >>> ioremap >>> 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 >>> pages=3 vmalloc >>> 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 >>> pages=3 vmalloc >>> 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 >>> pages=3 vmalloc >>> 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 >>> pages=3 vmalloc >>> 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 >>> ioremap >>> 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 >>> ioremap >>> 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc >>> 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc >>> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc >>> >> >> This is from a 32 bit kernel. >> >> This doesnt match your previous /proc/meminfo (from a 64bit kernel on >> a 12 GB machine) >> >> Of course, I would like /proc/vmallocinfo on your loaded router, not from >> a dev machine :) >> >> > > Yes sorry for no info about it. > I test the same kernel configurations on one 32bit machine and second 64bit > > here is meminfo from this 32bit machine working on kernel 2.6.30 > cat /proc/meminfo > MemTotal: 3625444 kB > MemFree: 3043648 kB > Buffers: 133968 kB > Cached: 36316 kB > SwapCached: 0 kB > Active: 256868 kB > Inactive: 76252 kB > Active(anon): 163064 kB > Inactive(anon): 0 kB > Active(file): 93804 kB > Inactive(file): 76252 kB > Unevictable: 0 kB > Mlocked: 0 kB > HighTotal: 2758160 kB > HighFree: 2556136 kB > LowTotal: 867284 kB > LowFree: 487512 kB > SwapTotal: 995896 kB > SwapFree: 995896 kB > Dirty: 3624 kB > Writeback: 0 kB > AnonPages: 162912 kB > Mapped: 3612 kB > Slab: 235888 kB > SReclaimable: 46408 kB > SUnreclaim: 189480 kB > PageTables: 384 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 2808616 kB > Committed_AS: 170648 kB > VmallocTotal: 122880 kB > VmallocUsed: 2876 kB > VmallocChunk: 109824 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 4096 kB > DirectMap4k: 8184 kB > DirectMap4M: 901120 kB > and vmallocinfo > > cat /proc/vmallocinfo > 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfe6a000 ioremap > 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 > phys=dfef5000 ioremap > 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef2000 ioremap > 0xf800c000-0xf800e000 8192 > acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap > 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfefb000 ioremap > 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef4000 ioremap > 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef3000 ioremap > 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef1000 ioremap > 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfef0000 ioremap > 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeef000 ioremap > 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeee000 ioremap > 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeed000 ioremap > 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=dfeec000 ioremap > 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a > phys=fed1c000 ioremap > 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap > 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap > 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap > 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap > 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 > pages=1 vmalloc > 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc > 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e > pages=1 vmalloc > 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap > 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 > pages=3 vmalloc > 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 > pages=3 vmalloc > 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 > pages=3 vmalloc > 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 > pages=3 vmalloc > 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap > 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap > 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc > 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc > 0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc > > > And next machine with kernel 2.6.29.3 > dmesg: > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > Fix inflate_threshold_root. Now=15 size=11 bits > cat /proc/meminfo > MemTotal: 2072652 kB > MemFree: 496960 kB > Buffers: 267620 kB > Cached: 895212 kB > SwapCached: 0 kB > Active: 675744 kB > Inactive: 703312 kB > Active(anon): 215848 kB > Inactive(anon): 0 kB > Active(file): 459896 kB > Inactive(file): 703312 kB > Unevictable: 0 kB > Mlocked: 0 kB > HighTotal: 1186696 kB > HighFree: 151156 kB > LowTotal: 885956 kB > LowFree: 345804 kB > SwapTotal: 1975984 kB > SwapFree: 1975984 kB > Dirty: 20 kB > Writeback: 0 kB > AnonPages: 215724 kB > Mapped: 6120 kB > Slab: 186652 kB > SReclaimable: 125832 kB > SUnreclaim: 60820 kB > PageTables: 416 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 3012308 kB > Committed_AS: 223692 kB > VmallocTotal: 122880 kB > VmallocUsed: 3192 kB > VmallocChunk: 108436 kB > HugePages_Total: 0 > HugePages_Free: 0 > HugePages_Rsvd: 0 > HugePages_Surp: 0 > Hugepagesize: 4096 kB > DirectMap4k: 8184 kB > DirectMap4M: 901120 kB > cat /proc/vmallocinfo > 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=7fee0000 ioremap > 0xf8000000-0xf8005000 20480 acpi_tb_verify_table+0x1d/0x46 > phys=7fee3000 ioremap > 0xf8006000-0xf8008000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=7fee3000 ioremap > 0xf800a000-0xf800c000 8192 acpi_tb_verify_table+0x1d/0x46 > phys=7fee6000 ioremap > 0xf800d000-0xf800f000 8192 reiserfs_init_bitmap_cache+0x3b/0x80 > pages=1 vmalloc > 0xf8010000-0xf8022000 73728 journal_init+0x30/0x8f0 pages=17 vmalloc > 0xf8023000-0xf8025000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 > pages=1 vmalloc > 0xf8026000-0xf8028000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 > pages=1 vmalloc > 0xf8029000-0xf802b000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 > pages=1 vmalloc > 0xf802c000-0xf802e000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 > pages=1 vmalloc > 0xf802f000-0xf8031000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 > pages=1 vmalloc > 0xf803e000-0xf8040000 8192 e1000_setup_all_tx_resources+0x57/0x660 > pages=1 vmalloc > 0xf8040000-0xf8061000 135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap > 0xf8062000-0xf8064000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 > pages=1 vmalloc > 0xf8065000-0xf8067000 8192 e1000_setup_all_tx_resources+0x57/0x660 > pages=1 vmalloc > 0xf8068000-0xf806a000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 > pages=1 vmalloc > 0xf806b000-0xf806d000 8192 e1000_setup_all_tx_resources+0x57/0x660 > pages=1 vmalloc > 0xf806e000-0xf8070000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 > pages=1 vmalloc > 0xf8080000-0xf80a1000 135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap > 0xf80c0000-0xf80e1000 135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap > 0xf80e2000-0xf8173000 593920 journal_init+0x56e/0x8f0 pages=144 vmalloc > 0xf8174000-0xf8267000 995328 sys_swapon+0x548/0xa30 pages=242 vmalloc > 0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc > > > because i have this info on 5 machines that working in ibgp mesh > And only one 64bit dev machine that is one of failover member - but i > kill this machine after upgrade to kernel 2.6.31-rc1 Yes, I was a fool to ask you to try 2.6.31-rc1, sorry. Even 2.6.30 is too young for a production machine. 2.6.29.5 contains the fixes, Pawel, did you tried this version ? ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 10:34 ` Eric Dumazet @ 2009-06-26 10:47 ` Paweł Staszewski 2009-06-26 10:52 ` Eric Dumazet 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-26 10:47 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Network Development list Eric Dumazet pisze: > Paweł Staszewski a écrit : > >> Eric Dumazet pisze: >> >>> Paweł Staszewski a écrit : >>> >>> >>>> cat /proc/vmallocinfo >>>> 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfe6a000 ioremap >>>> 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef5000 ioremap >>>> 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef2000 ioremap >>>> 0xf800c000-0xf800e000 8192 >>>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap >>>> 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfefb000 ioremap >>>> 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef4000 ioremap >>>> 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef3000 ioremap >>>> 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef1000 ioremap >>>> 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfef0000 ioremap >>>> 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfeef000 ioremap >>>> 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfeee000 ioremap >>>> 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfeed000 ioremap >>>> 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 >>>> phys=dfeec000 ioremap >>>> 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a >>>> phys=fed1c000 ioremap >>>> 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 >>>> ioremap >>>> 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 >>>> ioremap >>>> 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap >>>> 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 >>>> ioremap >>>> 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 >>>> pages=1 vmalloc >>>> 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc >>>> 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>>> pages=1 vmalloc >>>> 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>>> pages=1 vmalloc >>>> 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>>> pages=1 vmalloc >>>> 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>>> pages=1 vmalloc >>>> 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >>>> pages=1 vmalloc >>>> 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 >>>> ioremap >>>> 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 >>>> pages=3 vmalloc >>>> 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 >>>> pages=3 vmalloc >>>> 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 >>>> pages=3 vmalloc >>>> 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 >>>> pages=3 vmalloc >>>> 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 >>>> ioremap >>>> 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 >>>> ioremap >>>> 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc >>>> 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc >>>> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc >>>> >>>> >>> This is from a 32 bit kernel. >>> >>> This doesnt match your previous /proc/meminfo (from a 64bit kernel on >>> a 12 GB machine) >>> >>> Of course, I would like /proc/vmallocinfo on your loaded router, not from >>> a dev machine :) >>> >>> >>> >> Yes sorry for no info about it. >> I test the same kernel configurations on one 32bit machine and second 64bit >> >> here is meminfo from this 32bit machine working on kernel 2.6.30 >> cat /proc/meminfo >> MemTotal: 3625444 kB >> MemFree: 3043648 kB >> Buffers: 133968 kB >> Cached: 36316 kB >> SwapCached: 0 kB >> Active: 256868 kB >> Inactive: 76252 kB >> Active(anon): 163064 kB >> Inactive(anon): 0 kB >> Active(file): 93804 kB >> Inactive(file): 76252 kB >> Unevictable: 0 kB >> Mlocked: 0 kB >> HighTotal: 2758160 kB >> HighFree: 2556136 kB >> LowTotal: 867284 kB >> LowFree: 487512 kB >> SwapTotal: 995896 kB >> SwapFree: 995896 kB >> Dirty: 3624 kB >> Writeback: 0 kB >> AnonPages: 162912 kB >> Mapped: 3612 kB >> Slab: 235888 kB >> SReclaimable: 46408 kB >> SUnreclaim: 189480 kB >> PageTables: 384 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 2808616 kB >> Committed_AS: 170648 kB >> VmallocTotal: 122880 kB >> VmallocUsed: 2876 kB >> VmallocChunk: 109824 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 4096 kB >> DirectMap4k: 8184 kB >> DirectMap4M: 901120 kB >> and vmallocinfo >> >> cat /proc/vmallocinfo >> 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfe6a000 ioremap >> 0xf8000000-0xf8007000 28672 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef5000 ioremap >> 0xf8008000-0xf800a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef2000 ioremap >> 0xf800c000-0xf800e000 8192 >> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap >> 0xf8010000-0xf8012000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfefb000 ioremap >> 0xf8014000-0xf8016000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef4000 ioremap >> 0xf8018000-0xf801a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef3000 ioremap >> 0xf801c000-0xf801e000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef1000 ioremap >> 0xf8020000-0xf8022000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfef0000 ioremap >> 0xf8024000-0xf8026000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeef000 ioremap >> 0xf8028000-0xf802a000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeee000 ioremap >> 0xf802c000-0xf802e000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeed000 ioremap >> 0xf8030000-0xf8032000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=dfeec000 ioremap >> 0xf8038000-0xf803d000 20480 ich_force_enable_hpet+0x69/0x15a >> phys=fed1c000 ioremap >> 0xf803e000-0xf8040000 8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap >> 0xf8040000-0xf8046000 24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap >> 0xf8048000-0xf804a000 8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap >> 0xf804c000-0xf804e000 8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap >> 0xf804f000-0xf8051000 8192 reiserfs_init_bitmap_cache+0x32/0x65 >> pages=1 vmalloc >> 0xf8052000-0xf8064000 73728 journal_init+0x30/0x82a pages=17 vmalloc >> 0xf8065000-0xf8067000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8068000-0xf806a000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf806b000-0xf806d000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf806e000-0xf8070000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8071000-0xf8073000 8192 reiserfs_allocate_list_bitmaps+0x27/0x7e >> pages=1 vmalloc >> 0xf8080000-0xf80a1000 135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap >> 0xf80a2000-0xf80a6000 16384 e1000e_setup_rx_resources+0x20/0xf7 >> pages=3 vmalloc >> 0xf80a7000-0xf80ab000 16384 e1000e_setup_tx_resources+0x17/0x96 >> pages=3 vmalloc >> 0xf80ac000-0xf80b0000 16384 e1000e_setup_rx_resources+0x20/0xf7 >> pages=3 vmalloc >> 0xf80b1000-0xf80b5000 16384 e1000e_setup_tx_resources+0x17/0x96 >> pages=3 vmalloc >> 0xf80c0000-0xf80e1000 135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap >> 0xf8100000-0xf8121000 135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap >> 0xf8122000-0xf81b3000 593920 journal_init+0x65b/0x82a pages=144 vmalloc >> 0xf81b4000-0xf822f000 503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc >> 0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc >> >> >> And next machine with kernel 2.6.29.3 >> dmesg: >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> cat /proc/meminfo >> MemTotal: 2072652 kB >> MemFree: 496960 kB >> Buffers: 267620 kB >> Cached: 895212 kB >> SwapCached: 0 kB >> Active: 675744 kB >> Inactive: 703312 kB >> Active(anon): 215848 kB >> Inactive(anon): 0 kB >> Active(file): 459896 kB >> Inactive(file): 703312 kB >> Unevictable: 0 kB >> Mlocked: 0 kB >> HighTotal: 1186696 kB >> HighFree: 151156 kB >> LowTotal: 885956 kB >> LowFree: 345804 kB >> SwapTotal: 1975984 kB >> SwapFree: 1975984 kB >> Dirty: 20 kB >> Writeback: 0 kB >> AnonPages: 215724 kB >> Mapped: 6120 kB >> Slab: 186652 kB >> SReclaimable: 125832 kB >> SUnreclaim: 60820 kB >> PageTables: 416 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 3012308 kB >> Committed_AS: 223692 kB >> VmallocTotal: 122880 kB >> VmallocUsed: 3192 kB >> VmallocChunk: 108436 kB >> HugePages_Total: 0 >> HugePages_Free: 0 >> HugePages_Rsvd: 0 >> HugePages_Surp: 0 >> Hugepagesize: 4096 kB >> DirectMap4k: 8184 kB >> DirectMap4M: 901120 kB >> cat /proc/vmallocinfo >> 0xf7ffe000-0xf8000000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=7fee0000 ioremap >> 0xf8000000-0xf8005000 20480 acpi_tb_verify_table+0x1d/0x46 >> phys=7fee3000 ioremap >> 0xf8006000-0xf8008000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=7fee3000 ioremap >> 0xf800a000-0xf800c000 8192 acpi_tb_verify_table+0x1d/0x46 >> phys=7fee6000 ioremap >> 0xf800d000-0xf800f000 8192 reiserfs_init_bitmap_cache+0x3b/0x80 >> pages=1 vmalloc >> 0xf8010000-0xf8022000 73728 journal_init+0x30/0x8f0 pages=17 vmalloc >> 0xf8023000-0xf8025000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 >> pages=1 vmalloc >> 0xf8026000-0xf8028000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 >> pages=1 vmalloc >> 0xf8029000-0xf802b000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 >> pages=1 vmalloc >> 0xf802c000-0xf802e000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 >> pages=1 vmalloc >> 0xf802f000-0xf8031000 8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 >> pages=1 vmalloc >> 0xf803e000-0xf8040000 8192 e1000_setup_all_tx_resources+0x57/0x660 >> pages=1 vmalloc >> 0xf8040000-0xf8061000 135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap >> 0xf8062000-0xf8064000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 >> pages=1 vmalloc >> 0xf8065000-0xf8067000 8192 e1000_setup_all_tx_resources+0x57/0x660 >> pages=1 vmalloc >> 0xf8068000-0xf806a000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 >> pages=1 vmalloc >> 0xf806b000-0xf806d000 8192 e1000_setup_all_tx_resources+0x57/0x660 >> pages=1 vmalloc >> 0xf806e000-0xf8070000 8192 e1000_setup_all_rx_resources+0x57/0x6d0 >> pages=1 vmalloc >> 0xf8080000-0xf80a1000 135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap >> 0xf80c0000-0xf80e1000 135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap >> 0xf80e2000-0xf8173000 593920 journal_init+0x56e/0x8f0 pages=144 vmalloc >> 0xf8174000-0xf8267000 995328 sys_swapon+0x548/0xa30 pages=242 vmalloc >> 0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc >> >> >> because i have this info on 5 machines that working in ibgp mesh >> And only one 64bit dev machine that is one of failover member - but i >> kill this machine after upgrade to kernel 2.6.31-rc1 >> > > Yes, I was a fool to ask you to try 2.6.31-rc1, sorry. > > No problem with this test i lost only one test failover and no traffic lost when system switch to primary routers. :) > Even 2.6.30 is too young for a production machine. > I alvays make like this - i have iBGP mesh with main access path of machines on stable 2.6.28.9 kernels and second failover path on machines that use newest kernel for testing in this case 2.6.29 but after some problems i try also 2.6.30 yestarday. > 2.6.29.5 contains the fixes, Pawel, did you tried this version ? > > > I will try 2.6.29.5 today Thanks Paweł Staszewski ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 10:47 ` Paweł Staszewski @ 2009-06-26 10:52 ` Eric Dumazet 2009-06-26 17:26 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Eric Dumazet @ 2009-06-26 10:52 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list Paweł Staszewski a écrit : > Eric Dumazet pisze: >> >> Yes, I was a fool to ask you to try 2.6.31-rc1, sorry. >> >> > No problem with this test i lost only one test failover and no traffic > lost when system switch to primary routers. :) >> Even 2.6.30 is too young for a production machine. >> > I alvays make like this - i have iBGP mesh with main access path of > machines on stable 2.6.28.9 kernels and second failover path on > machines that use newest kernel for testing in this case 2.6.29 but > after some problems i try also 2.6.30 yestarday. >> 2.6.29.5 contains the fixes, Pawel, did you tried this version ? >> >> >> > I will try 2.6.29.5 today > OK thanks Please report (while machine has enough load) output of rtstat -c20 -i1 (rtstat is a symbolic link to lnstat, if not provided by your distro) ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 10:52 ` Eric Dumazet @ 2009-06-26 17:26 ` Paweł Staszewski 0 siblings, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-26 17:26 UTC (permalink / raw) To: Eric Dumazet; +Cc: Linux Network Development list > OK thanks > > Please report (while machine has enough load) output of > > rtstat -c20 -i1 > > (rtstat is a symbolic link to lnstat, if not provided by your distro) > > > > here you have rtstat -i 1 -c 20 rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache| entries| in_hit|in_slow_|in_slow_|in_no_ro| in_brd|in_marti|in_marti| out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis| | | tot| mc| ute| | an_dst| an_src| | _tot| _mc| | ed| miss| verflow| _search|t_search| 93362|22930850| 1671866| 0| 1369| 2| 0| 0| 53432| 1324| 0| 0| 0| 0| 0| 4783896| 11985| 92067| 101426| 5315| 0| 2| 0| 0| 0| 258| 2| 0| 0| 0| 0| 0| 21893| 6| 90561| 100094| 4666| 0| 6| 0| 0| 0| 267| 1| 0| 0| 0| 0| 0| 23433| 30| 90101| 98672| 5630| 0| 2| 0| 0| 0| 253| 0| 0| 0| 0| 0| 0| 24386| 34| 89994| 99962| 5654| 0| 6| 0| 0| 0| 266| 2| 0| 0| 0| 0| 0| 26251| 38| 95209| 91974| 14860| 0| 9| 0| 0| 0| 236| 31| 0| 0| 0| 0| 0| 14238| 35| 95323| 101714| 10126| 0| 14| 0| 0| 0| 255| 9| 0| 0| 0| 0| 0| 8532| 21| 94814| 99918| 8539| 0| 5| 0| 0| 0| 258| 4| 0| 0| 0| 0| 0| 11069| 24| 98510| 93929| 12672| 0| 13| 0| 0| 0| 238| 31| 0| 0| 0| 0| 0| 12704| 34| 98983| 96131| 11128| 0| 12| 0| 0| 0| 252| 10| 0| 0| 0| 0| 0| 7142| 18| 98824| 99036| 8995| 0| 5| 0| 0| 0| 256| 3| 0| 0| 0| 0| 0| 9343| 16| 97868| 100032| 7544| 0| 5| 0| 0| 0| 254| 1| 0| 0| 0| 0| 0| 11902| 17| 96929| 101942| 6722| 0| 7| 0| 0| 0| 263| 3| 0| 0| 0| 0| 0| 13778| 46| 95932| 100725| 6217| 0| 4| 0| 0| 0| 259| 3| 0| 0| 0| 0| 0| 15175| 47| 94432| 102074| 5549| 0| 7| 0| 0| 0| 268| 2| 0| 0| 0| 0| 0| 16996| 54| 92986| 103602| 5187| 0| 3| 0| 0| 0| 260| 0| 0| 0| 0| 0| 0| 18333| 43| 91387| 103934| 4666| 0| 6| 0| 0| 0| 261| 3| 0| 0| 0| 0| 0| 19316| 46| 90615| 104916| 5333| 0| 5| 0| 0| 0| 260| 4| 0| 0| 0| 0| 0| 21376| 48| 89941| 101375| 5189| 0| 8| 0| 0| 0| 270| 0| 0| 0| 0| 0| 0| 22249| 47| 89744| 101425| 5529| 0| 6| 0| 0| 0| 263| 4| 0| 0| 0| 0| 0| 24089| 57| ath the same time cpu load 19:24:34 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:35 all 0.00 0.00 0.00 0.00 1.39 20.36 0.00 0.00 78.25 19:24:35 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 1 0.00 0.00 0.00 0.00 5.00 74.00 0.00 0.00 21.00 19:24:35 2 0.00 0.00 0.00 0.00 5.00 73.00 0.00 0.00 22.00 19:24:35 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:35 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:36 all 0.00 0.00 0.00 0.00 1.21 16.03 0.00 0.00 82.77 19:24:36 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 1 0.00 0.00 0.00 0.00 5.05 75.76 0.00 0.00 19.19 19:24:36 2 0.00 0.00 0.99 0.00 5.94 69.31 0.00 0.00 23.76 19:24:36 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:36 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:37 all 0.00 0.00 0.14 0.00 1.64 20.19 0.00 0.00 78.04 19:24:37 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 1 0.00 0.00 0.00 0.00 5.94 73.27 0.00 0.00 20.79 19:24:37 2 0.00 0.00 0.00 0.00 7.00 75.00 0.00 0.00 18.00 19:24:37 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:37 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:38 all 0.00 0.00 0.00 0.00 0.90 14.24 0.00 0.00 84.86 19:24:38 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 1 0.00 0.00 0.00 0.00 4.00 73.00 0.00 0.00 23.00 19:24:38 2 0.00 0.00 0.00 0.00 5.05 69.70 0.00 0.00 25.25 19:24:38 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:38 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:39 all 0.00 0.00 0.00 0.00 2.43 29.80 0.00 0.00 67.77 19:24:39 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 1 0.00 0.00 0.00 0.00 5.00 67.00 0.00 0.00 28.00 19:24:39 2 0.00 0.00 0.00 0.00 5.94 67.33 0.00 0.00 26.73 19:24:39 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:39 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:40 all 0.00 0.00 0.00 0.00 1.43 14.15 0.00 0.00 84.42 19:24:40 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 1 0.00 0.00 0.00 0.00 5.94 68.32 0.00 0.00 25.74 19:24:40 2 0.00 0.00 0.00 0.00 7.07 71.72 0.00 0.00 21.21 19:24:40 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:40 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 19:24:41 all 0.00 0.00 0.00 0.00 1.40 17.68 0.00 0.00 80.92 19:24:41 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:41 1 0.00 0.00 0.00 0.00 6.06 70.71 0.00 0.00 23.23 19:24:41 2 0.00 0.00 0.00 0.00 5.88 67.65 0.00 0.00 26.47 19:24:41 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:41 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:41 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:41 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 19:24:41 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 42 0 0 1 0 1 0 0 IO-APIC-edge timer 1: 0 0 0 1 0 0 0 1 IO-APIC-edge i8042 9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi 14: 0 0 0 0 0 0 0 0 IO-APIC-edge ide0 15: 0 0 0 0 0 0 0 0 IO-APIC-edge ide1 29: 1139988 4692793 89662 3 0 1 0 3 PCI-MSI-edge eth0 30: 0 2 6207546 1 0 3 0 0 PCI-MSI-edge eth1 31: 0 1 1 0 0 0 0 0 PCI-MSI-edge 32: 0 0 0 0 0 0 2 0 PCI-MSI-edge 33: 1 1 0 0 0 0 0 0 PCI-MSI-edge 34: 0 0 0 1 0 1 0 0 PCI-MSI-edge 35: 0 0 0 1 0 0 0 1 PCI-MSI-edge 36: 0 0 0 0 1 0 0 1 PCI-MSI-edge 37: 1 0 0 0 0 1 0 0 PCI-MSI-edge 38: 0 0 1 0 1 0 0 0 PCI-MSI-edge 39: 0 0 2 0 0 0 0 0 PCI-MSI-edge 40: 0 0 0 0 0 0 2 0 PCI-MSI-edge 41: 0 2 0 0 0 0 0 0 PCI-MSI-edge 42: 0 0 0 0 0 2 0 0 PCI-MSI-edge 43: 0 0 0 2 0 0 0 0 PCI-MSI-edge 44: 0 0 0 0 0 0 0 2 PCI-MSI-edge 45: 2 0 0 0 0 0 0 0 PCI-MSI-edge 46: 0 0 0 0 2 0 0 0 PCI-MSI-edge 48: 191 200 185 213 219 219 227 214 PCI-MSI-edge ahci 49: 0 1 1 0 0 2 1 0 PCI-MSI-edge ioat-msi NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts LOC: 1083019 6233788 7735401 12394 15178 10718 21192 8515 Local timer interrupts RES: 921 44 33 20 13 8 10 12 Rescheduling interrupts CAL: 20 85 88 87 90 90 91 86 Function call interrupts TLB: 103 114 918 929 95 113 973 990 TLB shootdowns SPU: 0 0 0 0 0 0 0 0 Spurious interrupts ERR: 0 MIS: 0 i use smp_affinity eth0 is on cpu1 eth1 is on cpu2 all test on kernel 2.6.29.5 and only one info in dmesg about Fix inflate_threshold Fix inflate_threshold_root. Now=15 size=11 bits This info appear when bgpd process start to learn routes from peers ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-25 21:19 ` Eric Dumazet 2009-06-25 21:52 ` Paweł Staszewski @ 2009-06-26 8:03 ` Jarek Poplawski 2009-06-26 9:19 ` Robert Olsson 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 8:03 UTC (permalink / raw) To: Eric Dumazet Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On 25-06-2009 23:19, Eric Dumazet wrote: > Pawe? Staszewski a ?crit : >> Hello ALL >> >> Some time ago i report this: >> http://bugzilla.kernel.org/show_bug.cgi?id=6648 >> >> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back >> dmesg output: >> oprofile: using NMI interrupt. >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits >> Fix inflate_threshold_root. Now=15 size=11 bits > > Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation) > > Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes) > > But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit > should not anymore be applied. > On the other hand, even if there is no problem with memory, it seems because of hitting max_resize the threshold should be changed, e.g. by reverting the patch below. Jarek P. commit 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 Author: Robert Olsson <robert.olsson@its.uu.se> Date: Mon Mar 19 16:29:58 2007 -0700 [IPV4]: fib_trie root node settings The threshold for root node can be more aggressive set to get better tree compression. The new setting mekes the root grow from 16 to 19 bits and substansial improvemnt in Aver depth this with the current table of 214393 prefixes But really the dynamic resize should need more investigation both in terms convergence and performance and maybe it should be possible to change... Maybe just for the brave to start with or we may have to back this out. diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 5d2b43d..9be7da7 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -292,8 +292,8 @@ static inline void check_tnode(const struct tnode *tn) static int halve_threshold = 25; static int inflate_threshold = 50; -static int halve_threshold_root = 15; -static int inflate_threshold_root = 25; +static int halve_threshold_root = 8; +static int inflate_threshold_root = 15; static void __alias_free_mem(struct rcu_head *head) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 8:03 ` Jarek Poplawski @ 2009-06-26 9:19 ` Robert Olsson 2009-06-26 9:37 ` Jarek Poplawski 2009-06-27 19:20 ` Jarek Poplawski 0 siblings, 2 replies; 99+ messages in thread From: Robert Olsson @ 2009-06-26 9:19 UTC (permalink / raw) To: Jarek Poplawski Cc: Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list Jarek Poplawski writes: > >> oprofile: using NMI interrupt. > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > >> Fix inflate_threshold_root. Now=15 size=11 bits > On the other hand, even if there is no problem with memory, it seems > because of hitting max_resize the threshold should be changed, e.g. > by reverting the patch below. You seem to have some temporary memory problem. So the printout might be a bit misleading in this case. We really like to keep the root node as big as we can to keep the tree as flat as possible for performance reasons. (We're even more motivated now when we can disable the route cache) So I'll guess the next insert/delete inflates the root node to be within the interval. So I'll assume this just a temporary failure? I would be nice to have *threshholds* settable by /proc or /sys. I would use this in the other direction to trade memory for even faster lookups. But maybe experts memory allocation has some good suggestions. Cheers. --ro ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 9:19 ` Robert Olsson @ 2009-06-26 9:37 ` Jarek Poplawski 2009-06-26 10:26 ` Jorge Boncompte [DTI2] 2009-06-26 12:42 ` Robert Olsson 2009-06-27 19:20 ` Jarek Poplawski 1 sibling, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 9:37 UTC (permalink / raw) To: Robert Olsson Cc: Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 11:19:07AM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > > >> oprofile: using NMI interrupt. > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > > On the other hand, even if there is no problem with memory, it seems > > because of hitting max_resize the threshold should be changed, e.g. > > by reverting the patch below. > > You seem to have some temporary memory problem. So the printout might be > a bit misleading in this case. We really like to keep the root node as big > as we can to keep the tree as flat as possible for performance reasons. > (We're even more motivated now when we can disable the route cache) > > So I'll guess the next insert/delete inflates the root node to be within > the interval. So I'll assume this just a temporary failure? > > I would be nice to have *threshholds* settable by /proc or /sys. I would > use this in the other direction to trade memory for even faster lookups. > > But maybe experts memory allocation has some good suggestions. > Pawel has reported these problems for a long time: http://bugzilla.kernel.org/show_bug.cgi?id=6648 So, until it's fully investigated, it seems some 'fast' fix is needed here. Cheers, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 9:37 ` Jarek Poplawski @ 2009-06-26 10:26 ` Jorge Boncompte [DTI2] 2009-06-26 12:42 ` Robert Olsson 1 sibling, 0 replies; 99+ messages in thread From: Jorge Boncompte [DTI2] @ 2009-06-26 10:26 UTC (permalink / raw) To: jarkao2 Cc: Robert Olsson, Eric Dumazet, pstaszewski, Robert Olsson, Linux Network Development list Jarek Poplawski escribió: > Pawel has reported these problems for a long time: > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > So, until it's fully investigated, it seems some 'fast' fix is needed > here. I have never reported these problems but I am definitely seeing the same message on kernel 2.6.29.5, usually, when one of my BGP peers goes down. So, just a "me too". Regards, Jorge ----------------- [ 1198.333854] Fix inflate_threshold_root. Now=15 size=11 bits [ 1198.437028] Fix inflate_threshold_root. Now=15 size=11 bits [ 1198.460848] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.240223] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.279723] Fix inflate_threshold_root. Now=15 size=11 bits [ 1199.383081] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.154893] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.191711] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.223242] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.270299] Fix inflate_threshold_root. Now=15 size=11 bits [ 1200.355795] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.239254] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.271995] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.349351] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.384676] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.428801] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.457315] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.485710] Fix inflate_threshold_root. Now=15 size=11 bits [ 1206.513691] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.039681] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.069224] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.108840] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.141450] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.172317] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.197824] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.224711] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.251566] Fix inflate_threshold_root. Now=15 size=11 bits [ 1209.289603] Fix inflate_threshold_root. Now=15 size=11 bits [ 1211.561178] Fix inflate_threshold_root. Now=15 size=11 bits [ 1211.598062] Fix inflate_threshold_root. Now=15 size=11 bits [ 1211.633238] Fix inflate_threshold_root. Now=15 size=11 bits [ 1211.684420] Fix inflate_threshold_root. Now=15 size=11 bits [ 1216.507853] Fix inflate_threshold_root. Now=15 size=11 bits ----------------- cat /proc/meminfo MemTotal: 515732 kB MemFree: 139544 kB Buffers: 4992 kB Cached: 8488 kB SwapCached: 0 kB Active: 295904 kB Inactive: 8132 kB Active(anon): 291716 kB Inactive(anon): 0 kB Active(file): 4188 kB Inactive(file): 8132 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 290556 kB Mapped: 2320 kB Slab: 42392 kB SReclaimable: 1096 kB SUnreclaim: 41296 kB PageTables: 512 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 257864 kB Committed_AS: 294496 kB VmallocTotal: 515448 kB VmallocUsed: 3140 kB VmallocChunk: 501096 kB DirectMap4k: 8128 kB DirectMap4M: 516096 kB ----------------- cat /proc/vmallocinfo 0xe07f0000-0xe07f5000 20480 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap 0xe07f6000-0xe07f8000 8192 acpi_tb_verify_table+0x20/0x4a phys=1ffff000 ioremap 0xe07fa000-0xe07fc000 8192 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap 0xe07fe000-0xe0800000 8192 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap 0xe0801000-0xe080d000 49152 cramfs_uncompress_init+0x18/0x57 pages=11 vmalloc 0xe080e000-0xe0810000 8192 e100_probe+0x1db/0x471 phys=fdde0000 ioremap 0xe0812000-0xe0814000 8192 e100_probe+0x1db/0x471 phys=fdd80000 ioremap 0xe0816000-0xe0818000 8192 e100_probe+0x1db/0x471 phys=fbbf0000 ioremap 0xe081a000-0xe081c000 8192 e100_probe+0x1db/0x471 phys=fbbe0000 ioremap 0xe081e000-0xe0820000 8192 ahc_linux_pci_reserve_mem_region+0x49/0x72 [aic7xxx] phys=fe9f0000 ioremap 0xe0820000-0xe0822000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe0822000-0xe0825000 12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc 0xe0826000-0xe0828000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe082d000-0xe0832000 20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc 0xe0833000-0xe0836000 12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc 0xe0838000-0xe083c000 16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc 0xe083c000-0xe0840000 16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc 0xe0840000-0xe0861000 135168 e1000_probe+0x18a/0x83b phys=febc0000 ioremap 0xe0862000-0xe0871000 61440 module_alloc_update_bounds+0x8/0x2c pages=14 vmalloc 0xe0875000-0xe087f000 40960 module_alloc_update_bounds+0x8/0x2c pages=9 vmalloc 0xe0880000-0xe0889000 36864 module_alloc_update_bounds+0x8/0x2c pages=8 vmalloc 0xe088d000-0xe0897000 40960 module_alloc_update_bounds+0x8/0x2c pages=9 vmalloc 0xe0897000-0xe08b5000 122880 module_alloc_update_bounds+0x8/0x2c pages=29 vmalloc 0xe08bd000-0xe08c1000 16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc 0xe08c7000-0xe08c9000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe08c9000-0xe08cc000 12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc 0xe08d3000-0xe08d7000 16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc 0xe08d8000-0xe08dd000 20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc 0xe08de000-0xe08e5000 28672 module_alloc_update_bounds+0x8/0x2c pages=6 vmalloc 0xe08e6000-0xe08e8000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe08e8000-0xe08ed000 20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc 0xe08f3000-0xe08f6000 12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc 0xe08ff000-0xe0902000 12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc 0xe090e000-0xe0914000 24576 module_alloc_update_bounds+0x8/0x2c pages=5 vmalloc 0xe091e000-0xe0922000 16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc 0xe092e000-0xe0934000 24576 module_alloc_update_bounds+0x8/0x2c pages=5 vmalloc 0xe0935000-0xe0942000 53248 module_alloc_update_bounds+0x8/0x2c pages=12 vmalloc 0xe095d000-0xe0979000 114688 module_alloc_update_bounds+0x8/0x2c pages=27 vmalloc 0xe097e000-0xe0980000 8192 ahc_linux_pci_reserve_mem_region+0x49/0x72 [aic7xxx] phys=fe9e0000 ioremap 0xe0990000-0xe0992000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe099a000-0xe099c000 8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc 0xe0a00000-0xe0b01000 1052672 he_start+0x204/0x1126 [he] phys=fe800000 ioremap 0xe14bf000-0xe15c1000 1056768 tnode_new+0x18/0x48 pages=257 vmalloc 0xe15ed000-0xe15f0000 12288 xt_alloc_table_info+0x68/0x97 [x_tables] pages=2 vmalloc 0xe15f1000-0xe15f4000 12288 xt_alloc_table_info+0x68/0x97 [x_tables] pages=2 vmalloc ----------------- ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 9:37 ` Jarek Poplawski 2009-06-26 10:26 ` Jorge Boncompte [DTI2] @ 2009-06-26 12:42 ` Robert Olsson 2009-06-26 12:54 ` Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-06-26 12:42 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list Jarek Poplawski writes: > > But maybe memory allocation experts has some good suggestions. > > Pawel has reported these problems for a long time: > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > So, until it's fully investigated, it seems some 'fast' fix is needed > here. We talked about having a fixed pre-allocated root-node long ago but it's only optimisation for routers w. full BGP. Best if memory problems got solved. Cheers --ro ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 12:42 ` Robert Olsson @ 2009-06-26 12:54 ` Jarek Poplawski 2009-06-26 13:28 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 12:54 UTC (permalink / raw) To: Robert Olsson Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 02:42:12PM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > > > But maybe memory allocation experts has some good suggestions. > > > > Pawel has reported these problems for a long time: > > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > > > So, until it's fully investigated, it seems some 'fast' fix is needed > > here. > > We talked about having a fixed pre-allocated root-node long ago but it's only > optimisation for routers w. full BGP. Best if memory problems got solved. > I think the current process of rebalancing can allocate and hold unnecessarily long a lot of 'temp' memory, so probably something like the patch below could be useful. It should be applied to the 2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even compile-test it now). Cheers, Jarek P. --- (for testing) net/ipv4/fib_trie.c | 24 ++++++++++++++++++------ 1 files changed, 18 insertions(+), 6 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..c2fc862 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg) vfree(tn); } +static void __tnode_free(struct tnode *tn) +{ + if (size <= PAGE_SIZE) + kfree(tn); + else + vfree(tn); +} + static void __tnode_free_rcu(struct rcu_head *head) { struct tnode *tn = container_of(head, struct tnode, rcu); @@ -402,7 +410,7 @@ static void tnode_free_flush(void) while ((tn = tnode_free_head)) { tnode_free_head = tn->tnode_free; tn->tnode_free = NULL; - tnode_free(tn); + __tnode_free(tn); } } @@ -1020,19 +1028,23 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) tnode_put_child_reorg((struct tnode *)tp, cindex, (struct node *)tn, wasfull); - tp = node_parent((struct node *) tn); + synchronize_rcu(); tnode_free_flush(); + tp = node_parent((struct node *) tn); if (!tp) break; tn = tp; } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); - - rcu_assign_pointer(t->trie, (struct node *)tn); - tnode_free_flush(); + rcu_assign_pointer(t->trie, (struct node *)tn); + synchronize_rcu(); + tnode_free_flush(); + } else { + rcu_assign_pointer(t->trie, (struct node *)tn); + } return; } --- commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Mon Jun 15 02:31:29 2009 -0700 ipv4: Fix fib_trie rebalancing While doing trie_rebalance(): resize(), inflate(), halve() RCU free tnodes before updating their parents. It depends on RCU delaying the real destruction, but if RCU readers start after call_rcu() and before parent update they could access freed memory. It is currently prevented with preempt_disable() on the update side, but it's not safe, except maybe classic RCU, plus it conflicts with memory allocations with GFP_KERNEL flag used from these functions. This patch explicitly delays freeing of tnodes by adding them to the list, which is flushed after the update is finished. Reported-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 538d2a9..d1a39b1 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n, static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn) call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + + if (node_parent((struct node *) tn)) { + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; + } else { + tnode_free(tn); + } +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn) put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) t_key cindex, key; struct tnode *tp; - preempt_disable(); key = tn->key; while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { @@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); + tnode_free_flush(); if (!tp) break; tn = tp; } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); + tnode_free_flush(); + } - preempt_enable(); return (struct node *)tn; } --- commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Thu Jun 18 00:28:51 2009 -0700 ipv4: Fix fib_trie rebalancing, part 2 My previous patch, which explicitly delays freeing of tnodes by adding them to the list to flush them after the update is finished, isn't strict enough. It treats exceptionally tnodes without parent, assuming they are newly created, so "invisible" for the read side yet. But the top tnode doesn't have parent as well, so we have to exclude all exceptions (at least until a better way is found). Additionally we need to move rcu assignment of this node before flushing, so the return type of the trie_rebalance() function is changed. Reported-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index d1a39b1..012cf5a 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn) static void tnode_free_safe(struct tnode *tn) { BUG_ON(IS_LEAF(tn)); - - if (node_parent((struct node *) tn)) { - tn->tnode_free = tnode_free_head; - tnode_free_head = tn; - } else { - tnode_free(tn); - } + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; } static void tnode_free_flush(void) @@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; t_key cindex, key; @@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) { + if (IS_TNODE(tn)) tn = (struct tnode *)resize(t, (struct tnode *)tn); - tnode_free_flush(); - } - return (struct node *)tn; + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); + + return; } /* only used from updater-side */ @@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); done: return fa_head; } @@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l) if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 12:54 ` Jarek Poplawski @ 2009-06-26 13:28 ` Jarek Poplawski 2009-06-26 13:52 ` Robert Olsson 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 13:28 UTC (permalink / raw) To: Robert Olsson Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 12:54:49PM +0000, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 02:42:12PM +0200, Robert Olsson wrote: > > > > Jarek Poplawski writes: > > > > > > But maybe memory allocation experts has some good suggestions. > > > > > > Pawel has reported these problems for a long time: > > > http://bugzilla.kernel.org/show_bug.cgi?id=6648 > > > > > > So, until it's fully investigated, it seems some 'fast' fix is needed > > > here. > > > > We talked about having a fixed pre-allocated root-node long ago but it's only > > optimisation for routers w. full BGP. Best if memory problems got solved. > > > > I think the current process of rebalancing can allocate and hold > unnecessarily long a lot of 'temp' memory, so probably something > like the patch below could be useful. It should be applied to the > 2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even > compile-test it now). > Alternatively here is a faster version with less synchronize_rcu(). Jarek P. --- (take 2 - for testing) net/ipv4/fib_trie.c | 27 +++++++++++++++++++++------ 1 files changed, 21 insertions(+), 6 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..2936b2e 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg) vfree(tn); } +static void __tnode_free(struct tnode *tn) +{ + if (size <= PAGE_SIZE) + kfree(tn); + else + vfree(tn); +} + static void __tnode_free_rcu(struct rcu_head *head) { struct tnode *tn = container_of(head, struct tnode, rcu); @@ -402,7 +410,7 @@ static void tnode_free_flush(void) while ((tn = tnode_free_head)) { tnode_free_head = tn->tnode_free; tn->tnode_free = NULL; - tnode_free(tn); + __tnode_free(tn); } } @@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); - tnode_free_flush(); if (!tp) break; tn = tp; } + if (tnode_free_head) { + synchronize_rcu(); + tnode_free_flush(); + } + /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); - - rcu_assign_pointer(t->trie, (struct node *)tn); - tnode_free_flush(); + rcu_assign_pointer(t->trie, (struct node *)tn); + synchronize_rcu(); + tnode_free_flush(); + } else { + rcu_assign_pointer(t->trie, (struct node *)tn); + } return; } --- commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Mon Jun 15 02:31:29 2009 -0700 ipv4: Fix fib_trie rebalancing While doing trie_rebalance(): resize(), inflate(), halve() RCU free tnodes before updating their parents. It depends on RCU delaying the real destruction, but if RCU readers start after call_rcu() and before parent update they could access freed memory. It is currently prevented with preempt_disable() on the update side, but it's not safe, except maybe classic RCU, plus it conflicts with memory allocations with GFP_KERNEL flag used from these functions. This patch explicitly delays freeing of tnodes by adding them to the list, which is flushed after the update is finished. Reported-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 538d2a9..d1a39b1 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n, static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn) call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + + if (node_parent((struct node *) tn)) { + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; + } else { + tnode_free(tn); + } +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn) put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) t_key cindex, key; struct tnode *tp; - preempt_disable(); key = tn->key; while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { @@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); + tnode_free_flush(); if (!tp) break; tn = tp; } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); + tnode_free_flush(); + } - preempt_enable(); return (struct node *)tn; } --- commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Thu Jun 18 00:28:51 2009 -0700 ipv4: Fix fib_trie rebalancing, part 2 My previous patch, which explicitly delays freeing of tnodes by adding them to the list to flush them after the update is finished, isn't strict enough. It treats exceptionally tnodes without parent, assuming they are newly created, so "invisible" for the read side yet. But the top tnode doesn't have parent as well, so we have to exclude all exceptions (at least until a better way is found). Additionally we need to move rcu assignment of this node before flushing, so the return type of the trie_rebalance() function is changed. Reported-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index d1a39b1..012cf5a 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn) static void tnode_free_safe(struct tnode *tn) { BUG_ON(IS_LEAF(tn)); - - if (node_parent((struct node *) tn)) { - tn->tnode_free = tnode_free_head; - tnode_free_head = tn; - } else { - tnode_free(tn); - } + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; } static void tnode_free_flush(void) @@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; t_key cindex, key; @@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) { + if (IS_TNODE(tn)) tn = (struct tnode *)resize(t, (struct tnode *)tn); - tnode_free_flush(); - } - return (struct node *)tn; + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); + + return; } /* only used from updater-side */ @@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); done: return fa_head; } @@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l) if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 13:28 ` Jarek Poplawski @ 2009-06-26 13:52 ` Robert Olsson 2009-06-26 15:10 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-06-26 13:52 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list Jarek Poplawski writes: Thanks, Should be worth testing so we synchronize_rcu instead of doing call_rcu's Cheers --ro > Alternatively here is a faster version with less synchronize_rcu(). > > Jarek P. > > --- (take 2 - for testing) > > net/ipv4/fib_trie.c | 27 +++++++++++++++++++++------ > 1 files changed, 21 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..2936b2e 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg) > vfree(tn); > } > > +static void __tnode_free(struct tnode *tn) > +{ > + if (size <= PAGE_SIZE) > + kfree(tn); > + else > + vfree(tn); > +} > + > static void __tnode_free_rcu(struct rcu_head *head) > { > struct tnode *tn = container_of(head, struct tnode, rcu); > @@ -402,7 +410,7 @@ static void tnode_free_flush(void) > while ((tn = tnode_free_head)) { > tnode_free_head = tn->tnode_free; > tn->tnode_free = NULL; > - tnode_free(tn); > + __tnode_free(tn); > } > } > > @@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > - tnode_free_flush(); > if (!tp) > break; > tn = tp; > } > > + if (tnode_free_head) { > + synchronize_rcu(); > + tnode_free_flush(); > + } > + > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > tn = (struct tnode *)resize(t, (struct tnode *)tn); > - > - rcu_assign_pointer(t->trie, (struct node *)tn); > - tnode_free_flush(); > + rcu_assign_pointer(t->trie, (struct node *)tn); > + synchronize_rcu(); > + tnode_free_flush(); > + } else { > + rcu_assign_pointer(t->trie, (struct node *)tn); > + } > > return; > } > > > > --- > commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f > Author: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon Jun 15 02:31:29 2009 -0700 > > ipv4: Fix fib_trie rebalancing > > While doing trie_rebalance(): resize(), inflate(), halve() RCU free > tnodes before updating their parents. It depends on RCU delaying the > real destruction, but if RCU readers start after call_rcu() and before > parent update they could access freed memory. > > It is currently prevented with preempt_disable() on the update side, > but it's not safe, except maybe classic RCU, plus it conflicts with > memory allocations with GFP_KERNEL flag used from these functions. > > This patch explicitly delays freeing of tnodes by adding them to the > list, which is flushed after the update is finished. > > Reported-by: Yan Zheng <zheng.yan@oracle.com> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 538d2a9..d1a39b1 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -123,6 +123,7 @@ struct tnode { > union { > struct rcu_head rcu; > struct work_struct work; > + struct tnode *tnode_free; > }; > struct node *child[0]; > }; > @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n, > static struct node *resize(struct trie *t, struct tnode *tn); > static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > +/* tnodes to free after resize(); protected by RTNL */ > +static struct tnode *tnode_free_head; > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn) > call_rcu(&tn->rcu, __tnode_free_rcu); > } > > +static void tnode_free_safe(struct tnode *tn) > +{ > + BUG_ON(IS_LEAF(tn)); > + > + if (node_parent((struct node *) tn)) { > + tn->tnode_free = tnode_free_head; > + tnode_free_head = tn; > + } else { > + tnode_free(tn); > + } > +} > + > +static void tnode_free_flush(void) > +{ > + struct tnode *tn; > + > + while ((tn = tnode_free_head)) { > + tnode_free_head = tn->tnode_free; > + tn->tnode_free = NULL; > + tnode_free(tn); > + } > +} > + > static struct leaf *leaf_new(void) > { > struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); > @@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) > > /* No children */ > if (tn->empty_children == tnode_child_length(tn)) { > - tnode_free(tn); > + tnode_free_safe(tn); > return NULL; > } > /* One child */ > @@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) > > /* compress one level */ > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > /* > @@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn) > /* compress one level */ > > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > > @@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) > put_child(t, tn, 2*i, inode->child[0]); > put_child(t, tn, 2*i+1, inode->child[1]); > > - tnode_free(inode); > + tnode_free_safe(inode); > continue; > } > > @@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn) > put_child(t, tn, 2*i, resize(t, left)); > put_child(t, tn, 2*i+1, resize(t, right)); > > - tnode_free(inode); > + tnode_free_safe(inode); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn) > put_child(t, newBinNode, 1, right); > put_child(t, tn, i/2, resize(t, newBinNode)); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > t_key cindex, key; > struct tnode *tp; > > - preempt_disable(); > key = tn->key; > > while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { > @@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > + tnode_free_flush(); > if (!tp) > break; > tn = tp; > } > > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > tn = (struct tnode *)resize(t, (struct tnode *)tn); > + tnode_free_flush(); > + } > > - preempt_enable(); > return (struct node *)tn; > } > > --- > commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f > Author: Jarek Poplawski <jarkao2@gmail.com> > Date: Thu Jun 18 00:28:51 2009 -0700 > > ipv4: Fix fib_trie rebalancing, part 2 > > My previous patch, which explicitly delays freeing of tnodes by adding > them to the list to flush them after the update is finished, isn't > strict enough. It treats exceptionally tnodes without parent, assuming > they are newly created, so "invisible" for the read side yet. > > But the top tnode doesn't have parent as well, so we have to exclude > all exceptions (at least until a better way is found). Additionally we > need to move rcu assignment of this node before flushing, so the > return type of the trie_rebalance() function is changed. > > Reported-by: Yan Zheng <zheng.yan@oracle.com> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index d1a39b1..012cf5a 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn) > static void tnode_free_safe(struct tnode *tn) > { > BUG_ON(IS_LEAF(tn)); > - > - if (node_parent((struct node *) tn)) { > - tn->tnode_free = tnode_free_head; > - tnode_free_head = tn; > - } else { > - tnode_free(tn); > - } > + tn->tnode_free = tnode_free_head; > + tnode_free_head = tn; > } > > static void tnode_free_flush(void) > @@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key) > return NULL; > } > > -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > +static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > t_key cindex, key; > @@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > } > > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) { > + if (IS_TNODE(tn)) > tn = (struct tnode *)resize(t, (struct tnode *)tn); > - tnode_free_flush(); > - } > > - return (struct node *)tn; > + rcu_assign_pointer(t->trie, (struct node *)tn); > + tnode_free_flush(); > + > + return; > } > > /* only used from updater-side */ > @@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen) > > /* Rebalance the trie */ > > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > done: > return fa_head; > } > @@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l) > if (tp) { > t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); > put_child(t, (struct tnode *)tp, cindex, NULL); > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > } else > rcu_assign_pointer(t->trie, NULL); > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 13:52 ` Robert Olsson @ 2009-06-26 15:10 ` Jarek Poplawski 2009-06-26 15:30 ` Paul E. McKenney 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 15:10 UTC (permalink / raw) To: Robert Olsson Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > Thanks, > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > Alas take 2 (nor 1) doesn't compile, so here it is again. Thanks, Jarek P. --- (take 3 - for testing) net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ 1 files changed, 24 insertions(+), 6 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..1a4c4b7 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) vfree(tn); } +static void __tnode_free(struct tnode *tn) +{ + size_t size = sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); + + if (size <= PAGE_SIZE) + kfree(tn); + else + vfree(tn); +} + static void __tnode_free_rcu(struct rcu_head *head) { struct tnode *tn = container_of(head, struct tnode, rcu); @@ -402,7 +413,7 @@ static void tnode_free_flush(void) while ((tn = tnode_free_head)) { tnode_free_head = tn->tnode_free; tn->tnode_free = NULL; - tnode_free(tn); + __tnode_free(tn); } } @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); - tnode_free_flush(); if (!tp) break; tn = tp; } + if (tnode_free_head) { + synchronize_rcu(); + tnode_free_flush(); + } + /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); - - rcu_assign_pointer(t->trie, (struct node *)tn); - tnode_free_flush(); + rcu_assign_pointer(t->trie, (struct node *)tn); + synchronize_rcu(); + tnode_free_flush(); + } else { + rcu_assign_pointer(t->trie, (struct node *)tn); + } return; } ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 15:10 ` Jarek Poplawski @ 2009-06-26 15:30 ` Paul E. McKenney 2009-06-26 15:54 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paul E. McKenney @ 2009-06-26 15:30 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > Jarek Poplawski writes: > > > > Thanks, > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. So the idea is to balance memory and latency, so that large changes (those affecting the root node) get at least one synchronize_rcu(), while smaller changes just use call_rcu(), correct? This means that the amount of memory awaiting an RCU grace period is limited, but the algorithm avoids per-node synchronize_rcu() overhead. If I understand the goal correctly, looks good! (Give or take my limited understanding of fib_trie and is usage, of course.) Thanx, Paul > Thanks, > Jarek P. > --- (take 3 - for testing) > > net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ > 1 files changed, 24 insertions(+), 6 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..1a4c4b7 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) > vfree(tn); > } > > +static void __tnode_free(struct tnode *tn) > +{ > + size_t size = sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > + > + if (size <= PAGE_SIZE) > + kfree(tn); > + else > + vfree(tn); > +} > + > static void __tnode_free_rcu(struct rcu_head *head) > { > struct tnode *tn = container_of(head, struct tnode, rcu); > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > while ((tn = tnode_free_head)) { > tnode_free_head = tn->tnode_free; > tn->tnode_free = NULL; > - tnode_free(tn); > + __tnode_free(tn); > } > } > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > - tnode_free_flush(); > if (!tp) > break; > tn = tp; > } > > + if (tnode_free_head) { > + synchronize_rcu(); > + tnode_free_flush(); > + } > + > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > tn = (struct tnode *)resize(t, (struct tnode *)tn); > - > - rcu_assign_pointer(t->trie, (struct node *)tn); > - tnode_free_flush(); > + rcu_assign_pointer(t->trie, (struct node *)tn); > + synchronize_rcu(); > + tnode_free_flush(); > + } else { > + rcu_assign_pointer(t->trie, (struct node *)tn); > + } > > return; > } > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 15:30 ` Paul E. McKenney @ 2009-06-26 15:54 ` Jarek Poplawski 2009-06-26 16:15 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 15:54 UTC (permalink / raw) To: Paul E. McKenney Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote: > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > > > Jarek Poplawski writes: > > > > > > Thanks, > > > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. > > So the idea is to balance memory and latency, so that large changes > (those affecting the root node) get at least one synchronize_rcu(), > while smaller changes just use call_rcu(), correct? This means that > the amount of memory awaiting an RCU grace period is limited, but > the algorithm avoids per-node synchronize_rcu() overhead. > > If I understand the goal correctly, looks good! (Give or take my > limited understanding of fib_trie and is usage, of course.) The goal is practically to replace all call_rcu() during trie_rebalance() with synchronize_rcu() (except some freeing after ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this memory after trie_rebalance() has finished, that's why there were problems with enabled preemption. So this patch tries to do/force this a bit earlier - at least before the top/largest node is rebalanced. Thanks, Jarek P. > > Thanx, Paul > > > Thanks, > > Jarek P. > > --- (take 3 - for testing) > > > > net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ > > 1 files changed, 24 insertions(+), 6 deletions(-) > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 012cf5a..1a4c4b7 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) > > vfree(tn); > > } > > > > +static void __tnode_free(struct tnode *tn) > > +{ > > + size_t size = sizeof(struct tnode) + > > + (sizeof(struct node *) << tn->bits); > > + > > + if (size <= PAGE_SIZE) > > + kfree(tn); > > + else > > + vfree(tn); > > +} > > + > > static void __tnode_free_rcu(struct rcu_head *head) > > { > > struct tnode *tn = container_of(head, struct tnode, rcu); > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > > while ((tn = tnode_free_head)) { > > tnode_free_head = tn->tnode_free; > > tn->tnode_free = NULL; > > - tnode_free(tn); > > + __tnode_free(tn); > > } > > } > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > (struct node *)tn, wasfull); > > > > tp = node_parent((struct node *) tn); > > - tnode_free_flush(); > > if (!tp) > > break; > > tn = tp; > > } > > > > + if (tnode_free_head) { > > + synchronize_rcu(); > > + tnode_free_flush(); > > + } > > + > > /* Handle last (top) tnode */ > > - if (IS_TNODE(tn)) > > + if (IS_TNODE(tn)) { > > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > - > > - rcu_assign_pointer(t->trie, (struct node *)tn); > > - tnode_free_flush(); > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > + synchronize_rcu(); > > + tnode_free_flush(); > > + } else { > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > + } > > > > return; > > } > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 15:54 ` Jarek Poplawski @ 2009-06-26 16:15 ` Jarek Poplawski 2009-06-26 16:23 ` Paul E. McKenney 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 16:15 UTC (permalink / raw) To: Paul E. McKenney Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote: > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > > > > > Jarek Poplawski writes: > > > > > > > > Thanks, > > > > > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > > > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. > > > > So the idea is to balance memory and latency, so that large changes > > (those affecting the root node) get at least one synchronize_rcu(), > > while smaller changes just use call_rcu(), correct? This means that > > the amount of memory awaiting an RCU grace period is limited, but > > the algorithm avoids per-node synchronize_rcu() overhead. > > > > If I understand the goal correctly, looks good! (Give or take my > > limited understanding of fib_trie and is usage, of course.) > > The goal is practically to replace all call_rcu() during > trie_rebalance() with synchronize_rcu() (except some freeing after > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this > memory after trie_rebalance() has finished, that's why there were > problems with enabled preemption. So this patch tries to do/force > this a bit earlier - at least before the top/largest node is > rebalanced. On the other hand, we could probably stay with call_rcu() plus only one synchronize_rcu() before the top node's resize() if you think it's enough here? Thanks, Jarek P. > > > > > Thanx, Paul > > > > > Thanks, > > > Jarek P. > > > --- (take 3 - for testing) > > > > > > net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ > > > 1 files changed, 24 insertions(+), 6 deletions(-) > > > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > > index 012cf5a..1a4c4b7 100644 > > > --- a/net/ipv4/fib_trie.c > > > +++ b/net/ipv4/fib_trie.c > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) > > > vfree(tn); > > > } > > > > > > +static void __tnode_free(struct tnode *tn) > > > +{ > > > + size_t size = sizeof(struct tnode) + > > > + (sizeof(struct node *) << tn->bits); > > > + > > > + if (size <= PAGE_SIZE) > > > + kfree(tn); > > > + else > > > + vfree(tn); > > > +} > > > + > > > static void __tnode_free_rcu(struct rcu_head *head) > > > { > > > struct tnode *tn = container_of(head, struct tnode, rcu); > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > > > while ((tn = tnode_free_head)) { > > > tnode_free_head = tn->tnode_free; > > > tn->tnode_free = NULL; > > > - tnode_free(tn); > > > + __tnode_free(tn); > > > } > > > } > > > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > > (struct node *)tn, wasfull); > > > > > > tp = node_parent((struct node *) tn); > > > - tnode_free_flush(); > > > if (!tp) > > > break; > > > tn = tp; > > > } > > > > > > + if (tnode_free_head) { > > > + synchronize_rcu(); > > > + tnode_free_flush(); > > > + } > > > + > > > /* Handle last (top) tnode */ > > > - if (IS_TNODE(tn)) > > > + if (IS_TNODE(tn)) { > > > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > > - > > > - rcu_assign_pointer(t->trie, (struct node *)tn); > > > - tnode_free_flush(); > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > + synchronize_rcu(); > > > + tnode_free_flush(); > > > + } else { > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > + } > > > > > > return; > > > } > > > -- > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 16:15 ` Jarek Poplawski @ 2009-06-26 16:23 ` Paul E. McKenney 2009-06-26 16:45 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paul E. McKenney @ 2009-06-26 16:23 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote: > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote: > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > > > > > > > Jarek Poplawski writes: > > > > > > > > > > Thanks, > > > > > > > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > > > > > > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. > > > > > > So the idea is to balance memory and latency, so that large changes > > > (those affecting the root node) get at least one synchronize_rcu(), > > > while smaller changes just use call_rcu(), correct? This means that > > > the amount of memory awaiting an RCU grace period is limited, but > > > the algorithm avoids per-node synchronize_rcu() overhead. > > > > > > If I understand the goal correctly, looks good! (Give or take my > > > limited understanding of fib_trie and is usage, of course.) > > > > The goal is practically to replace all call_rcu() during > > trie_rebalance() with synchronize_rcu() (except some freeing after > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this > > memory after trie_rebalance() has finished, that's why there were > > problems with enabled preemption. So this patch tries to do/force > > this a bit earlier - at least before the top/largest node is > > rebalanced. > > On the other hand, we could probably stay with call_rcu() plus only > one synchronize_rcu() before the top node's resize() if you think it's > enough here? Well, my first task is to understand the problem/goal. ;-) My guess from what you said above is that use of call_rcu(), when combined with changes to the trie in rapid succession, is resulting in excessive memory awaiting a grace period. Is this the case, or am I confused? Thanx, Paul > Thanks, > Jarek P. > > > > > > > > > Thanx, Paul > > > > > > > Thanks, > > > > Jarek P. > > > > --- (take 3 - for testing) > > > > > > > > net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ > > > > 1 files changed, 24 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > > > index 012cf5a..1a4c4b7 100644 > > > > --- a/net/ipv4/fib_trie.c > > > > +++ b/net/ipv4/fib_trie.c > > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) > > > > vfree(tn); > > > > } > > > > > > > > +static void __tnode_free(struct tnode *tn) > > > > +{ > > > > + size_t size = sizeof(struct tnode) + > > > > + (sizeof(struct node *) << tn->bits); > > > > + > > > > + if (size <= PAGE_SIZE) > > > > + kfree(tn); > > > > + else > > > > + vfree(tn); > > > > +} > > > > + > > > > static void __tnode_free_rcu(struct rcu_head *head) > > > > { > > > > struct tnode *tn = container_of(head, struct tnode, rcu); > > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > > > > while ((tn = tnode_free_head)) { > > > > tnode_free_head = tn->tnode_free; > > > > tn->tnode_free = NULL; > > > > - tnode_free(tn); > > > > + __tnode_free(tn); > > > > } > > > > } > > > > > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > > > (struct node *)tn, wasfull); > > > > > > > > tp = node_parent((struct node *) tn); > > > > - tnode_free_flush(); > > > > if (!tp) > > > > break; > > > > tn = tp; > > > > } > > > > > > > > + if (tnode_free_head) { > > > > + synchronize_rcu(); > > > > + tnode_free_flush(); > > > > + } > > > > + > > > > /* Handle last (top) tnode */ > > > > - if (IS_TNODE(tn)) > > > > + if (IS_TNODE(tn)) { > > > > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > > > - > > > > - rcu_assign_pointer(t->trie, (struct node *)tn); > > > > - tnode_free_flush(); > > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > > + synchronize_rcu(); > > > > + tnode_free_flush(); > > > > + } else { > > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > > + } > > > > > > > > return; > > > > } > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > > the body of a message to majordomo@vger.kernel.org > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 16:23 ` Paul E. McKenney @ 2009-06-26 16:45 ` Jarek Poplawski 2009-06-26 17:05 ` Paul E. McKenney 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 16:45 UTC (permalink / raw) To: Paul E. McKenney Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote: > On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote: > > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote: > > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote: > > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > > > > > > > > > Jarek Poplawski writes: > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > > > > > > > > > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. > > > > > > > > So the idea is to balance memory and latency, so that large changes > > > > (those affecting the root node) get at least one synchronize_rcu(), > > > > while smaller changes just use call_rcu(), correct? This means that > > > > the amount of memory awaiting an RCU grace period is limited, but > > > > the algorithm avoids per-node synchronize_rcu() overhead. > > > > > > > > If I understand the goal correctly, looks good! (Give or take my > > > > limited understanding of fib_trie and is usage, of course.) > > > > > > The goal is practically to replace all call_rcu() during > > > trie_rebalance() with synchronize_rcu() (except some freeing after > > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this > > > memory after trie_rebalance() has finished, that's why there were > > > problems with enabled preemption. So this patch tries to do/force > > > this a bit earlier - at least before the top/largest node is > > > rebalanced. > > > > On the other hand, we could probably stay with call_rcu() plus only > > one synchronize_rcu() before the top node's resize() if you think it's > > enough here? > > Well, my first task is to understand the problem/goal. ;-) > > My guess from what you said above is that use of call_rcu(), when > combined with changes to the trie in rapid succession, is resulting > in excessive memory awaiting a grace period. Is this the case, or am I > confused? Exactly! (I guess... ;-) Thanks, Jarek P. > > > > > > > > > > > > > Thanx, Paul > > > > > > > > > Thanks, > > > > > Jarek P. > > > > > --- (take 3 - for testing) > > > > > > > > > > net/ipv4/fib_trie.c | 30 ++++++++++++++++++++++++------ > > > > > 1 files changed, 24 insertions(+), 6 deletions(-) > > > > > > > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > > > > index 012cf5a..1a4c4b7 100644 > > > > > --- a/net/ipv4/fib_trie.c > > > > > +++ b/net/ipv4/fib_trie.c > > > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg) > > > > > vfree(tn); > > > > > } > > > > > > > > > > +static void __tnode_free(struct tnode *tn) > > > > > +{ > > > > > + size_t size = sizeof(struct tnode) + > > > > > + (sizeof(struct node *) << tn->bits); > > > > > + > > > > > + if (size <= PAGE_SIZE) > > > > > + kfree(tn); > > > > > + else > > > > > + vfree(tn); > > > > > +} > > > > > + > > > > > static void __tnode_free_rcu(struct rcu_head *head) > > > > > { > > > > > struct tnode *tn = container_of(head, struct tnode, rcu); > > > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > > > > > while ((tn = tnode_free_head)) { > > > > > tnode_free_head = tn->tnode_free; > > > > > tn->tnode_free = NULL; > > > > > - tnode_free(tn); > > > > > + __tnode_free(tn); > > > > > } > > > > > } > > > > > > > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > > > > (struct node *)tn, wasfull); > > > > > > > > > > tp = node_parent((struct node *) tn); > > > > > - tnode_free_flush(); > > > > > if (!tp) > > > > > break; > > > > > tn = tp; > > > > > } > > > > > > > > > > + if (tnode_free_head) { > > > > > + synchronize_rcu(); > > > > > + tnode_free_flush(); > > > > > + } > > > > > + > > > > > /* Handle last (top) tnode */ > > > > > - if (IS_TNODE(tn)) > > > > > + if (IS_TNODE(tn)) { > > > > > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > > > > - > > > > > - rcu_assign_pointer(t->trie, (struct node *)tn); > > > > > - tnode_free_flush(); > > > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > > > + synchronize_rcu(); > > > > > + tnode_free_flush(); > > > > > + } else { > > > > > + rcu_assign_pointer(t->trie, (struct node *)tn); > > > > > + } > > > > > > > > > > return; > > > > > } > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 16:45 ` Jarek Poplawski @ 2009-06-26 17:05 ` Paul E. McKenney 2009-06-26 18:05 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paul E. McKenney @ 2009-06-26 17:05 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 06:45:57PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote: > > On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote: > > > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote: > > > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote: > > > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote: > > > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote: > > > > > > > > > > > > > > Jarek Poplawski writes: > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Should be worth testing so we synchronize_rcu instead of doing call_rcu's > > > > > > > > > > > > > > > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again. > > > > > > > > > > So the idea is to balance memory and latency, so that large changes > > > > > (those affecting the root node) get at least one synchronize_rcu(), > > > > > while smaller changes just use call_rcu(), correct? This means that > > > > > the amount of memory awaiting an RCU grace period is limited, but > > > > > the algorithm avoids per-node synchronize_rcu() overhead. > > > > > > > > > > If I understand the goal correctly, looks good! (Give or take my > > > > > limited understanding of fib_trie and is usage, of course.) > > > > > > > > The goal is practically to replace all call_rcu() during > > > > trie_rebalance() with synchronize_rcu() (except some freeing after > > > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this > > > > memory after trie_rebalance() has finished, that's why there were > > > > problems with enabled preemption. So this patch tries to do/force > > > > this a bit earlier - at least before the top/largest node is > > > > rebalanced. > > > > > > On the other hand, we could probably stay with call_rcu() plus only > > > one synchronize_rcu() before the top node's resize() if you think it's > > > enough here? > > > > Well, my first task is to understand the problem/goal. ;-) > > > > My guess from what you said above is that use of call_rcu(), when > > combined with changes to the trie in rapid succession, is resulting > > in excessive memory awaiting a grace period. Is this the case, or am I > > confused? > > Exactly! (I guess... ;-) ;-) In that case, simply invoking synchronize_rcu() every once and awhile should take care of things. This could be at the end of every large trie operation, or you could even count the call_rcu() invocations and do a synchronize_rcu() every 100th, 1,000th, or whatever, based on the amount of memory available. Thanx, Paul ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 17:05 ` Paul E. McKenney @ 2009-06-26 18:05 ` Jarek Poplawski 2009-06-26 18:21 ` Paul E. McKenney 2009-06-26 20:26 ` Robert Olsson 0 siblings, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 18:05 UTC (permalink / raw) To: Paul E. McKenney Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote: ... > In that case, simply invoking synchronize_rcu() every once and awhile > should take care of things. This could be at the end of every large > trie operation, or you could even count the call_rcu() invocations and > do a synchronize_rcu() every 100th, 1,000th, or whatever, based on > the amount of memory available. OK, for now the minimal change for testing (2.6.30 needs previously mentioned two commits from 2.6.31-rc). (I guess I'll send it with a changelog after net-next is opened.) Thanks, Jarek P. --- (take 4 - for testing) net/ipv4/fib_trie.c | 8 ++++++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..98b31a1 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; t_key cindex, key; - struct tnode *tp; + struct tnode *tp, *oldtnode = tn; key = tn->key; @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { + /* force memory freeing after last changes */ + if (oldtnode != tn) + synchronize_rcu(); tn = (struct tnode *)resize(t, (struct tnode *)tn); + } rcu_assign_pointer(t->trie, (struct node *)tn); tnode_free_flush(); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 18:05 ` Jarek Poplawski @ 2009-06-26 18:21 ` Paul E. McKenney 2009-06-26 20:19 ` Jarek Poplawski 2009-06-26 20:26 ` Robert Olsson 1 sibling, 1 reply; 99+ messages in thread From: Paul E. McKenney @ 2009-06-26 18:21 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 08:05:45PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote: > ... > > In that case, simply invoking synchronize_rcu() every once and awhile > > should take care of things. This could be at the end of every large > > trie operation, or you could even count the call_rcu() invocations and > > do a synchronize_rcu() every 100th, 1,000th, or whatever, based on > > the amount of memory available. > > OK, for now the minimal change for testing (2.6.30 needs previously > mentioned two commits from 2.6.31-rc). (I guess I'll send it with a > changelog after net-next is opened.) Looks promising to me!!! Thanx, Paul > Thanks, > Jarek P. > --- (take 4 - for testing) > > net/ipv4/fib_trie.c | 8 ++++++-- > 1 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..98b31a1 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > t_key cindex, key; > - struct tnode *tp; > + struct tnode *tp, *oldtnode = tn; > > key = tn->key; > > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > } > > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > + /* force memory freeing after last changes */ > + if (oldtnode != tn) > + synchronize_rcu(); > tn = (struct tnode *)resize(t, (struct tnode *)tn); > + } > > rcu_assign_pointer(t->trie, (struct node *)tn); > tnode_free_flush(); ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 18:21 ` Paul E. McKenney @ 2009-06-26 20:19 ` Jarek Poplawski 0 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 20:19 UTC (permalink / raw) To: Paul E. McKenney Cc: Robert Olsson, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 11:21:43AM -0700, Paul E. McKenney wrote: > On Fri, Jun 26, 2009 at 08:05:45PM +0200, Jarek Poplawski wrote: > > On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote: > > ... > > > In that case, simply invoking synchronize_rcu() every once and awhile > > > should take care of things. This could be at the end of every large > > > trie operation, or you could even count the call_rcu() invocations and > > > do a synchronize_rcu() every 100th, 1,000th, or whatever, based on > > > the amount of memory available. > > > > OK, for now the minimal change for testing (2.6.30 needs previously > > mentioned two commits from 2.6.31-rc). (I guess I'll send it with a > > changelog after net-next is opened.) > > Looks promising to me!!! > Alas, after rethinking, there is one detail which bothers me. Those largest allocs here are done with vmalloc and freed with RCU by schedule_work(). So, I wonder if just because of this, the previous version doing it directly isn't more reliable anyway. Of course, it's my bad I didn't point it while describing the problem earlier. (I knew I missed something...;-) Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 18:05 ` Jarek Poplawski 2009-06-26 18:21 ` Paul E. McKenney @ 2009-06-26 20:26 ` Robert Olsson 2009-06-26 20:37 ` Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-06-26 20:26 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list Yes looks like a good solution but maybe it safest to synchronize unconditionally? diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..9cb8623 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) } /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { + /* force memory freeing after last changes */ + synchronize_rcu(); tn = (struct tnode *)resize(t, (struct tnode *)tn); + } rcu_assign_pointer(t->trie, (struct node *)tn); tnode_free_flush(); Cheers --ro Jarek Poplawski writes: > net/ipv4/fib_trie.c | 8 ++++++-- > 1 files changed, 6 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..98b31a1 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > t_key cindex, key; > - struct tnode *tp; > + struct tnode *tp, *oldtnode = tn; > > key = tn->key; > > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > } > > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > + /* force memory freeing after last changes */ > + if (oldtnode != tn) > + synchronize_rcu(); > tn = (struct tnode *)resize(t, (struct tnode *)tn); > + } > > rcu_assign_pointer(t->trie, (struct node *)tn); > tnode_free_flush(); > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 20:26 ` Robert Olsson @ 2009-06-26 20:37 ` Jarek Poplawski 2009-06-26 21:20 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 20:37 UTC (permalink / raw) To: Robert Olsson Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 10:26:53PM +0200, Robert Olsson wrote: > > > Yes looks like a good solution but maybe it safest to synchronize unconditionally? Hmm... I lost around half an hour for this doubt... Sure! (Unless there are some strange cases which very often create and destroy very small tables?) Thanks, Jarek P. > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..9cb8623 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > } > > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > + /* force memory freeing after last changes */ > + synchronize_rcu(); > tn = (struct tnode *)resize(t, (struct tnode *)tn); > + } > > rcu_assign_pointer(t->trie, (struct node *)tn); > tnode_free_flush(); > > Cheers > --ro > > Jarek Poplawski writes: > > > net/ipv4/fib_trie.c | 8 ++++++-- > > 1 files changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 012cf5a..98b31a1 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > { > > int wasfull; > > t_key cindex, key; > > - struct tnode *tp; > > + struct tnode *tp, *oldtnode = tn; > > > > key = tn->key; > > > > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > > } > > > > /* Handle last (top) tnode */ > > - if (IS_TNODE(tn)) > > + if (IS_TNODE(tn)) { > > + /* force memory freeing after last changes */ > > + if (oldtnode != tn) > > + synchronize_rcu(); > > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > + } > > > > rcu_assign_pointer(t->trie, (struct node *)tn); > > tnode_free_flush(); > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 20:37 ` Jarek Poplawski @ 2009-06-26 21:20 ` Jarek Poplawski 0 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-26 21:20 UTC (permalink / raw) To: Robert Olsson Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Robert Olsson, Linux Network Development list On Fri, Jun 26, 2009 at 10:37:13PM +0200, Jarek Poplawski wrote: > On Fri, Jun 26, 2009 at 10:26:53PM +0200, Robert Olsson wrote: > > > > > > Yes looks like a good solution but maybe it safest to synchronize unconditionally? > > Hmm... I lost around half an hour for this doubt... Sure! (Unless > there are some strange cases which very often create and destroy very > small tables?) ...or maybe even only updating such small tables very often? Btw., Robert, I wondered about some design details lately, especially about pointer to a parent. I didn't see it in the basic docs, so it seems it could be avoided. It seems to be a problem with RCU, unless I miss something: if there were no going back from children to parents it seems we could free those "temporary" (created by inflate() and halve() and destroyed before resize() has finished) earlier. Another problem with this, it seems, are possibly false lookups (if we go back to the new parent which doesn't have it's parent or other nodes updated). So, was there so much performance gain to introduce this? Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-26 9:19 ` Robert Olsson 2009-06-26 9:37 ` Jarek Poplawski @ 2009-06-27 19:20 ` Jarek Poplawski 2009-06-27 20:51 ` Jarek Poplawski 2009-06-28 11:04 ` Robert Olsson 1 sibling, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-27 19:20 UTC (permalink / raw) To: Robert Olsson Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Robert Olsson wrote, On 06/26/2009 11:19 AM: > Jarek Poplawski writes: > > > >> oprofile: using NMI interrupt. > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > >> Fix inflate_threshold_root. Now=15 size=11 bits > > > On the other hand, even if there is no problem with memory, it seems > > because of hitting max_resize the threshold should be changed, e.g. > > by reverting the patch below. > > You seem to have some temporary memory problem. So the printout might be > a bit misleading in this case. We really like to keep the root node as big > as we can to keep the tree as flat as possible for performance reasons. > (We're even more motivated now when we can disable the route cache) > > So I'll guess the next insert/delete inflates the root node to be within > the interval. So I'll assume this just a temporary failure? > > I would be nice to have *threshholds* settable by /proc or /sys. I would > use this in the other direction to trade memory for even faster lookups. > > But maybe experts memory allocation has some good suggestions. Robert, you and Eric pointed at memory problems, so I thought I missed something. But after the second look I see "skipped node resize" should show this, but it's always zero on these reports. So, isn't it possible the current inflate_threshold_root is simply unreachable with some conditions, at least within 10 loops? Then these settable thresholds might be more useful here than memory fixes, but here is some idea to try handle this automatically within some limits. The patch below increases inflate_threshold_root (only) up to ~50% of its initial value if needed, and should be able to go back sometimes. Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with some offsets.) Thanks, Jarek P. --- net/ipv4/fib_trie.c | 23 ++++++++++++++++------- 1 files changed, 16 insertions(+), 7 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..1dc1bb4 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -318,6 +318,7 @@ static const int halve_threshold = 25; static const int inflate_threshold = 50; static const int halve_threshold_root = 8; static const int inflate_threshold_root = 15; +static int inflate_threshold_root_fix; static void __alias_free_mem(struct rcu_head *head) @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* Keep root node larger */ if (!tn->parent) - inflate_threshold_use = inflate_threshold_root; + inflate_threshold_use = inflate_threshold_root + + inflate_threshold_root_fix; else inflate_threshold_use = inflate_threshold; @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) } if (max_resize < 0) { - if (!tn->parent) - pr_warning("Fix inflate_threshold_root." - " Now=%d size=%d bits\n", - inflate_threshold_root, tn->bits); - else + if (!tn->parent) { + if (inflate_threshold_root_fix * 2 < + inflate_threshold_root) + inflate_threshold_root_fix++; + else + pr_warning("Fix inflate_threshold_root." + " Now=%d size=%d bits fix=%d\n", + inflate_threshold_root, tn->bits, + inflate_threshold_root_fix); + } else { pr_warning("Fix inflate_threshold." " Now=%d size=%d bits\n", inflate_threshold, tn->bits); - } + } + } else if (max_resize < 5 && !tn->parent && inflate_threshold_root_fix) + inflate_threshold_root_fix--; check_tnode(tn); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-27 19:20 ` Jarek Poplawski @ 2009-06-27 20:51 ` Jarek Poplawski 2009-06-28 0:28 ` Paweł Staszewski 2009-06-28 11:11 ` Robert Olsson 2009-06-28 11:04 ` Robert Olsson 1 sibling, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-27 20:51 UTC (permalink / raw) To: Robert Olsson Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote: ... > Then these settable thresholds might be more useful here than memory > fixes, but here is some idea to try handle this automatically within > some limits. The patch below increases inflate_threshold_root (only) > up to ~50% of its initial value if needed, and should be able to go > back sometimes. > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with > some offsets.) A tiny adjustment in the last if... Jarek P. --- (take 2) net/ipv4/fib_trie.c | 23 ++++++++++++++++------- 1 files changed, 16 insertions(+), 7 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..1dc1bb4 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -318,6 +318,7 @@ static const int halve_threshold = 25; static const int inflate_threshold = 50; static const int halve_threshold_root = 8; static const int inflate_threshold_root = 15; +static int inflate_threshold_root_fix; static void __alias_free_mem(struct rcu_head *head) @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) /* Keep root node larger */ if (!tn->parent) - inflate_threshold_use = inflate_threshold_root; + inflate_threshold_use = inflate_threshold_root + + inflate_threshold_root_fix; else inflate_threshold_use = inflate_threshold; @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) } if (max_resize < 0) { - if (!tn->parent) - pr_warning("Fix inflate_threshold_root." - " Now=%d size=%d bits\n", - inflate_threshold_root, tn->bits); - else + if (!tn->parent) { + if (inflate_threshold_root_fix * 2 < + inflate_threshold_root) + inflate_threshold_root_fix++; + else + pr_warning("Fix inflate_threshold_root." + " Now=%d size=%d bits fix=%d\n", + inflate_threshold_root, tn->bits, + inflate_threshold_root_fix); + } else { pr_warning("Fix inflate_threshold." " Now=%d size=%d bits\n", inflate_threshold, tn->bits); - } + } + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) + inflate_threshold_root_fix--; check_tnode(tn); ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-27 20:51 ` Jarek Poplawski @ 2009-06-28 0:28 ` Paweł Staszewski 2009-06-28 11:11 ` Robert Olsson 1 sibling, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-28 0:28 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Thanks Jarek I apply this patch to 2.6.29.5 For some results we must wait to "rush hours" when there will be more traffic / routes :) Regards Paweł Staszewski Jarek Poplawski pisze: > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote: > ... > >> Then these settable thresholds might be more useful here than memory >> fixes, but here is some idea to try handle this automatically within >> some limits. The patch below increases inflate_threshold_root (only) >> up to ~50% of its initial value if needed, and should be able to go >> back sometimes. >> >> Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with >> some offsets.) >> > > A tiny adjustment in the last if... > > Jarek P. > --- (take 2) > > net/ipv4/fib_trie.c | 23 ++++++++++++++++------- > 1 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..1dc1bb4 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -318,6 +318,7 @@ static const int halve_threshold = 25; > static const int inflate_threshold = 50; > static const int halve_threshold_root = 8; > static const int inflate_threshold_root = 15; > +static int inflate_threshold_root_fix; > > > static void __alias_free_mem(struct rcu_head *head) > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) > /* Keep root node larger */ > > if (!tn->parent) > - inflate_threshold_use = inflate_threshold_root; > + inflate_threshold_use = inflate_threshold_root + > + inflate_threshold_root_fix; > else > inflate_threshold_use = inflate_threshold; > > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) > } > > if (max_resize < 0) { > - if (!tn->parent) > - pr_warning("Fix inflate_threshold_root." > - " Now=%d size=%d bits\n", > - inflate_threshold_root, tn->bits); > - else > + if (!tn->parent) { > + if (inflate_threshold_root_fix * 2 < > + inflate_threshold_root) > + inflate_threshold_root_fix++; > + else > + pr_warning("Fix inflate_threshold_root." > + " Now=%d size=%d bits fix=%d\n", > + inflate_threshold_root, tn->bits, > + inflate_threshold_root_fix); > + } else { > pr_warning("Fix inflate_threshold." > " Now=%d size=%d bits\n", > inflate_threshold, tn->bits); > - } > + } > + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) > + inflate_threshold_root_fix--; > > check_tnode(tn); > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-27 20:51 ` Jarek Poplawski 2009-06-28 0:28 ` Paweł Staszewski @ 2009-06-28 11:11 ` Robert Olsson 2009-06-29 7:57 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-06-28 11:11 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list When testing please monitor size of root node and and aver depth Cheers --ro Jarek Poplawski writes: > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote: > ... > > Then these settable thresholds might be more useful here than memory > > fixes, but here is some idea to try handle this automatically within > > some limits. The patch below increases inflate_threshold_root (only) > > up to ~50% of its initial value if needed, and should be able to go > > back sometimes. > > > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with > > some offsets.) > > A tiny adjustment in the last if... > > Jarek P. > --- (take 2) > > net/ipv4/fib_trie.c | 23 ++++++++++++++++------- > 1 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..1dc1bb4 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -318,6 +318,7 @@ static const int halve_threshold = 25; > static const int inflate_threshold = 50; > static const int halve_threshold_root = 8; > static const int inflate_threshold_root = 15; > +static int inflate_threshold_root_fix; > > > static void __alias_free_mem(struct rcu_head *head) > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) > /* Keep root node larger */ > > if (!tn->parent) > - inflate_threshold_use = inflate_threshold_root; > + inflate_threshold_use = inflate_threshold_root + > + inflate_threshold_root_fix; > else > inflate_threshold_use = inflate_threshold; > > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) > } > > if (max_resize < 0) { > - if (!tn->parent) > - pr_warning("Fix inflate_threshold_root." > - " Now=%d size=%d bits\n", > - inflate_threshold_root, tn->bits); > - else > + if (!tn->parent) { > + if (inflate_threshold_root_fix * 2 < > + inflate_threshold_root) > + inflate_threshold_root_fix++; > + else > + pr_warning("Fix inflate_threshold_root." > + " Now=%d size=%d bits fix=%d\n", > + inflate_threshold_root, tn->bits, > + inflate_threshold_root_fix); > + } else { > pr_warning("Fix inflate_threshold." > " Now=%d size=%d bits\n", > inflate_threshold, tn->bits); > - } > + } > + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) > + inflate_threshold_root_fix--; > > check_tnode(tn); > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 11:11 ` Robert Olsson @ 2009-06-29 7:57 ` Paweł Staszewski 0 siblings, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 7:57 UTC (permalink / raw) To: Robert Olsson Cc: Jarek Poplawski, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Robert Olsson pisze: > When testing please monitor size of root node and and aver depth > > Cheers > --ro > Some fib_triestats - kernel.2.6.29.5 with first Jarek patch. Dump every 10sec: Mon Jun 29 11:54:31 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276978 Prefixes: 290448 Internal nodes: 66813 1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691606 Null ptrs: 347816 Total size: 18403 kB Counters: --------- gets = 390981859 backtracks = 5332465 semantic match passed = 390452936 semantic match miss = 30198 null node hit= 375522207 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391017445 backtracks = 121012874 semantic match passed = 37565 semantic match miss = 0 null node hit= 261583 skipped node resize = 0 Mon Jun 29 11:54:41 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276976 Prefixes: 290446 Internal nodes: 66813 1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691606 Null ptrs: 347818 Total size: 18403 kB Counters: --------- gets = 391061852 backtracks = 5334173 semantic match passed = 390532664 semantic match miss = 30199 null node hit= 375595706 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391097445 backtracks = 121039213 semantic match passed = 37570 semantic match miss = 0 null node hit= 261589 skipped node resize = 0 Mon Jun 29 11:54:51 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276978 Prefixes: 290448 Internal nodes: 66813 1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691606 Null ptrs: 347816 Total size: 18403 kB Counters: --------- gets = 391177325 backtracks = 5336127 semantic match passed = 390647917 semantic match miss = 30208 null node hit= 375699713 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391212932 backtracks = 121075919 semantic match passed = 37586 semantic match miss = 0 null node hit= 261701 skipped node resize = 0 Mon Jun 29 11:55:01 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276978 Prefixes: 290448 Internal nodes: 66813 1: 34703 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691606 Null ptrs: 347816 Total size: 18403 kB Counters: --------- gets = 391254016 backtracks = 5337816 semantic match passed = 390724361 semantic match miss = 30214 null node hit= 375768712 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391289631 backtracks = 121101285 semantic match passed = 37598 semantic match miss = 0 null node hit= 261707 skipped node resize = 0 Mon Jun 29 11:55:11 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276976 Prefixes: 290445 Internal nodes: 66812 1: 34702 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691604 Null ptrs: 347817 Total size: 18402 kB Counters: --------- gets = 391317389 backtracks = 5339175 semantic match passed = 390787523 semantic match miss = 30215 null node hit= 375827612 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391353001 backtracks = 121122087 semantic match passed = 37599 semantic match miss = 0 null node hit= 261709 skipped node resize = 0 Mon Jun 29 11:55:21 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276981 Prefixes: 290451 Internal nodes: 66813 1: 34704 2: 13943 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691604 Null ptrs: 347811 Total size: 18403 kB Counters: --------- gets = 391434307 backtracks = 5340855 semantic match passed = 390904256 semantic match miss = 30225 null node hit= 375931780 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391469942 backtracks = 121157220 semantic match passed = 37619 semantic match miss = 0 null node hit= 261753 skipped node resize = 0 Mon Jun 29 11:55:31 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276981 Prefixes: 290451 Internal nodes: 66813 1: 34704 2: 13943 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691604 Null ptrs: 347811 Total size: 18403 kB Counters: --------- gets = 391519852 backtracks = 5342208 semantic match passed = 390989658 semantic match miss = 30234 null node hit= 376010537 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391555492 backtracks = 121181992 semantic match passed = 37625 semantic match miss = 0 null node hit= 261762 skipped node resize = 0 Mon Jun 29 11:55:42 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276978 Prefixes: 290447 Internal nodes: 66812 1: 34703 2: 13943 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691602 Null ptrs: 347813 Total size: 18403 kB Counters: --------- gets = 391589032 backtracks = 5343757 semantic match passed = 391058601 semantic match miss = 30237 null node hit= 376075389 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391624673 backtracks = 121202115 semantic match passed = 37628 semantic match miss = 0 null node hit= 261763 skipped node resize = 0 Mon Jun 29 11:55:52 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276985 Prefixes: 290455 Internal nodes: 66814 1: 34704 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691608 Null ptrs: 347810 Total size: 18403 kB Counters: --------- gets = 391723925 backtracks = 5345934 semantic match passed = 391193292 semantic match miss = 30242 null node hit= 376195655 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391759578 backtracks = 121241080 semantic match passed = 37640 semantic match miss = 0 null node hit= 261804 skipped node resize = 0 Mon Jun 29 11:56:02 2009 Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.28 Max depth: 7 Leaves: 276985 Prefixes: 290455 Internal nodes: 66814 1: 34704 2: 13944 3: 9921 4: 4807 5: 2273 6: 1158 7: 5 9: 1 18: 1 Pointers: 691608 Null ptrs: 347810 Total size: 18403 kB Counters: --------- gets = 391811219 backtracks = 5347635 semantic match passed = 391280357 semantic match miss = 30250 null node hit= 376276182 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 391846880 backtracks = 121265316 semantic match passed = 37648 semantic match miss = 0 null node hit= 261813 skipped node resize = 0 > > > Jarek Poplawski writes: > > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote: > > ... > > > Then these settable thresholds might be more useful here than memory > > > fixes, but here is some idea to try handle this automatically within > > > some limits. The patch below increases inflate_threshold_root (only) > > > up to ~50% of its initial value if needed, and should be able to go > > > back sometimes. > > > > > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with > > > some offsets.) > > > > A tiny adjustment in the last if... > > > > Jarek P. > > --- (take 2) > > > > net/ipv4/fib_trie.c | 23 ++++++++++++++++------- > > 1 files changed, 16 insertions(+), 7 deletions(-) > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 012cf5a..1dc1bb4 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -318,6 +318,7 @@ static const int halve_threshold = 25; > > static const int inflate_threshold = 50; > > static const int halve_threshold_root = 8; > > static const int inflate_threshold_root = 15; > > +static int inflate_threshold_root_fix; > > > > > > static void __alias_free_mem(struct rcu_head *head) > > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) > > /* Keep root node larger */ > > > > if (!tn->parent) > > - inflate_threshold_use = inflate_threshold_root; > > + inflate_threshold_use = inflate_threshold_root + > > + inflate_threshold_root_fix; > > else > > inflate_threshold_use = inflate_threshold; > > > > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) > > } > > > > if (max_resize < 0) { > > - if (!tn->parent) > > - pr_warning("Fix inflate_threshold_root." > > - " Now=%d size=%d bits\n", > > - inflate_threshold_root, tn->bits); > > - else > > + if (!tn->parent) { > > + if (inflate_threshold_root_fix * 2 < > > + inflate_threshold_root) > > + inflate_threshold_root_fix++; > > + else > > + pr_warning("Fix inflate_threshold_root." > > + " Now=%d size=%d bits fix=%d\n", > > + inflate_threshold_root, tn->bits, > > + inflate_threshold_root_fix); > > + } else { > > pr_warning("Fix inflate_threshold." > > " Now=%d size=%d bits\n", > > inflate_threshold, tn->bits); > > - } > > + } > > + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) > > + inflate_threshold_root_fix--; > > > > check_tnode(tn); > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-27 19:20 ` Jarek Poplawski 2009-06-27 20:51 ` Jarek Poplawski @ 2009-06-28 11:04 ` Robert Olsson 2009-06-28 12:03 ` Jarek Poplawski 2009-06-28 14:35 ` Jarek Poplawski 1 sibling, 2 replies; 99+ messages in thread From: Robert Olsson @ 2009-06-28 11:04 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Jarek Poplawski writes: > Robert, you and Eric pointed at memory problems, so I thought I missed > something. But after the second look I see "skipped node resize" should > show this, but it's always zero on these reports. So, isn't it possible > the current inflate_threshold_root is simply unreachable with some > conditions, at least within 10 loops? > > Then these settable thresholds might be more useful here than memory > fixes, but here is some idea to try handle this automatically within > some limits. The patch below increases inflate_threshold_root (only) > up to ~50% of its initial value if needed, and should be able to go > back sometimes. Yes we keep the old tnode size and the convergence interval was some of the concerns. That why this checks was added. Still we want to inflate the root node to a very max. So this approach with halving or doubling tnodes towards the root node was the suggest by Dyntree paper. I asked Stefan (one of the authors) if we could get safe and very offensive settings. But from what I understood there was no easy way to calculate this. So any bright ideas in this area are very welcome. But we should also monitor size of root and average tree depth so we don't take an to defensive approach just to solve this case. The memory patches and "manual RCU" are also interesting to address the case with PREEMTP's. Inserts and deletes are also very fast due to the flat tree so I think we can "slow down" this a bit if need to be safe with all PREEMPT's. Thanks for giving this area energy. Cheers --ro > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with > some offsets.) > > Thanks, > Jarek P. > --- > > net/ipv4/fib_trie.c | 23 ++++++++++++++++------- > 1 files changed, 16 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..1dc1bb4 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -318,6 +318,7 @@ static const int halve_threshold = 25; > static const int inflate_threshold = 50; > static const int halve_threshold_root = 8; > static const int inflate_threshold_root = 15; > +static int inflate_threshold_root_fix; > > > static void __alias_free_mem(struct rcu_head *head) > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn) > /* Keep root node larger */ > > if (!tn->parent) > - inflate_threshold_use = inflate_threshold_root; > + inflate_threshold_use = inflate_threshold_root + > + inflate_threshold_root_fix; > else > inflate_threshold_use = inflate_threshold; > > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn) > } > > if (max_resize < 0) { > - if (!tn->parent) > - pr_warning("Fix inflate_threshold_root." > - " Now=%d size=%d bits\n", > - inflate_threshold_root, tn->bits); > - else > + if (!tn->parent) { > + if (inflate_threshold_root_fix * 2 < > + inflate_threshold_root) > + inflate_threshold_root_fix++; > + else > + pr_warning("Fix inflate_threshold_root." > + " Now=%d size=%d bits fix=%d\n", > + inflate_threshold_root, tn->bits, > + inflate_threshold_root_fix); > + } else { > pr_warning("Fix inflate_threshold." > " Now=%d size=%d bits\n", > inflate_threshold, tn->bits); > - } > + } > + } else if (max_resize < 5 && !tn->parent && inflate_threshold_root_fix) > + inflate_threshold_root_fix--; > > check_tnode(tn); > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 11:04 ` Robert Olsson @ 2009-06-28 12:03 ` Jarek Poplawski 2009-06-28 14:35 ` Jarek Poplawski 1 sibling, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-28 12:03 UTC (permalink / raw) To: Robert Olsson Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: ... > Yes we keep the old tnode size and the convergence interval was some > of the concerns. That why this checks was added. Still we want to > inflate the root node to a very max. > > So this approach with halving or doubling tnodes towards the root > node was the suggest by Dyntree paper. I asked Stefan (one of the > authors) if we could get safe and very offensive settings. But > from what I understood there was no easy way to calculate this. > So any bright ideas in this area are very welcome. But we should > also monitor size of root and average tree depth so we don't > take an to defensive approach just to solve this case. Yes, but with this offensive approach it seems the current level of warnings could be too alarming. Btw., because of a design flaw in my current patch this _fix variable, which should be logically per trie/ table, can be reset by changes of other tables now, but I think it all could be fine tuned more in the future. Of course if there are people interested in testing/reporting this more. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 11:04 ` Robert Olsson 2009-06-28 12:03 ` Jarek Poplawski @ 2009-06-28 14:35 ` Jarek Poplawski 2009-06-28 15:32 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-28 14:35 UTC (permalink / raw) To: Robert Olsson Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: ... > The memory patches and "manual RCU" are also interesting to address > the case with PREEMTP's. Since 2.6.29 looks like prefered here, and there were a lot of takes in this thread, I attach below a combined all-in-one patch including: - 2.6.29 -> 2.6.30 preemption disable patch - 2 RCU vs. preemption fixes from 2.6.31-rc - "manual RCU" patch to force vfree/kfree before root's resize (take 3) - "automatic" inflate_threshold_root fix (take 2) Thanks, Jarek P. --- (for 2.6.29.x or even .28 or .27; any testing appreciated) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 +++ b/net/ipv4/fib_trie.c 2009-06-28 15:45:02.000000000 +0200 @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -315,6 +318,7 @@ static const int halve_threshold = 25; static const int inflate_threshold = 50; static const int halve_threshold_root = 8; static const int inflate_threshold_root = 15; +static int inflate_threshold_root_fix; static void __alias_free_mem(struct rcu_head *head) @@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st vfree(tn); } +static void __tnode_free(struct tnode *tn) +{ + size_t size = sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); + + if (size <= PAGE_SIZE) + kfree(tn); + else + vfree(tn); +} + static void __tnode_free_rcu(struct rcu_head *head) { struct tnode *tn = container_of(head, struct tnode, rcu); @@ -385,6 +400,24 @@ static inline void tnode_free(struct tno call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + __tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +528,7 @@ static struct node *resize(struct trie * /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +542,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -581,7 +614,8 @@ static struct node *resize(struct trie * /* Keep root node larger */ if (!tn->parent) - inflate_threshold_use = inflate_threshold_root; + inflate_threshold_use = inflate_threshold_root + + inflate_threshold_root_fix; else inflate_threshold_use = inflate_threshold; @@ -605,15 +639,22 @@ static struct node *resize(struct trie * } if (max_resize < 0) { - if (!tn->parent) - pr_warning("Fix inflate_threshold_root." - " Now=%d size=%d bits\n", - inflate_threshold_root, tn->bits); - else + if (!tn->parent) { + if (inflate_threshold_root_fix * 2 < + inflate_threshold_root) + inflate_threshold_root_fix++; + else + pr_warning("Fix inflate_threshold_root." + " Now=%d size=%d bits fix=%d\n", + inflate_threshold_root, tn->bits, + inflate_threshold_root_fix); + } else { pr_warning("Fix inflate_threshold." " Now=%d size=%d bits\n", inflate_threshold, tn->bits); - } + } + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) + inflate_threshold_root_fix--; check_tnode(tn); @@ -670,7 +711,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +926,7 @@ static struct tnode *halve(struct trie * put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; - t_key cindex, key = tn->key; + t_key cindex, key; struct tnode *tp; + key = tn->key; + while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { cindex = tkey_extract_bits(key, tp->pos, tp->bits); wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); @@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc tn = tp; } + if (tnode_free_head) { + synchronize_rcu(); + tnode_free_flush(); + } + /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); + rcu_assign_pointer(t->trie, (struct node *)tn); + synchronize_rcu(); + tnode_free_flush(); + } else { + rcu_assign_pointer(t->trie, (struct node *)tn); + } - return (struct node *)tn; + return; } /* only used from updater-side */ @@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); done: return fa_head; } @@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 14:35 ` Jarek Poplawski @ 2009-06-28 15:32 ` Paweł Staszewski 2009-06-28 15:48 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-28 15:32 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list After 18 hours from apply first Jarek patch i have no info about Fix inflate_threshold_root even if i make: "clear ip bgp *" on router So i change Jarek patch from previous to this new one for test and we will see ... Regards Pawel Staszewski Jarek Poplawski pisze: > On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: > ... > >> The memory patches and "manual RCU" are also interesting to address >> the case with PREEMTP's. >> > > Since 2.6.29 looks like prefered here, and there were a lot of takes > in this thread, I attach below a combined all-in-one patch including: > - 2.6.29 -> 2.6.30 preemption disable patch > - 2 RCU vs. preemption fixes from 2.6.31-rc > - "manual RCU" patch to force vfree/kfree before root's resize (take 3) > - "automatic" inflate_threshold_root fix (take 2) > > Thanks, > Jarek P. > > --- (for 2.6.29.x or even .28 or .27; any testing appreciated) > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 > +++ b/net/ipv4/fib_trie.c 2009-06-28 15:45:02.000000000 +0200 > @@ -123,6 +123,7 @@ struct tnode { > union { > struct rcu_head rcu; > struct work_struct work; > + struct tnode *tnode_free; > }; > struct node *child[0]; > }; > @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct > static struct node *resize(struct trie *t, struct tnode *tn); > static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > +/* tnodes to free after resize(); protected by RTNL */ > +static struct tnode *tnode_free_head; > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -315,6 +318,7 @@ static const int halve_threshold = 25; > static const int inflate_threshold = 50; > static const int halve_threshold_root = 8; > static const int inflate_threshold_root = 15; > +static int inflate_threshold_root_fix; > > > static void __alias_free_mem(struct rcu_head *head) > @@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st > vfree(tn); > } > > +static void __tnode_free(struct tnode *tn) > +{ > + size_t size = sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > + > + if (size <= PAGE_SIZE) > + kfree(tn); > + else > + vfree(tn); > +} > + > static void __tnode_free_rcu(struct rcu_head *head) > { > struct tnode *tn = container_of(head, struct tnode, rcu); > @@ -385,6 +400,24 @@ static inline void tnode_free(struct tno > call_rcu(&tn->rcu, __tnode_free_rcu); > } > > +static void tnode_free_safe(struct tnode *tn) > +{ > + BUG_ON(IS_LEAF(tn)); > + tn->tnode_free = tnode_free_head; > + tnode_free_head = tn; > +} > + > +static void tnode_free_flush(void) > +{ > + struct tnode *tn; > + > + while ((tn = tnode_free_head)) { > + tnode_free_head = tn->tnode_free; > + tn->tnode_free = NULL; > + __tnode_free(tn); > + } > +} > + > static struct leaf *leaf_new(void) > { > struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); > @@ -495,7 +528,7 @@ static struct node *resize(struct trie * > > /* No children */ > if (tn->empty_children == tnode_child_length(tn)) { > - tnode_free(tn); > + tnode_free_safe(tn); > return NULL; > } > /* One child */ > @@ -509,7 +542,7 @@ static struct node *resize(struct trie * > > /* compress one level */ > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > /* > @@ -581,7 +614,8 @@ static struct node *resize(struct trie * > /* Keep root node larger */ > > if (!tn->parent) > - inflate_threshold_use = inflate_threshold_root; > + inflate_threshold_use = inflate_threshold_root + > + inflate_threshold_root_fix; > else > inflate_threshold_use = inflate_threshold; > > @@ -605,15 +639,22 @@ static struct node *resize(struct trie * > } > > if (max_resize < 0) { > - if (!tn->parent) > - pr_warning("Fix inflate_threshold_root." > - " Now=%d size=%d bits\n", > - inflate_threshold_root, tn->bits); > - else > + if (!tn->parent) { > + if (inflate_threshold_root_fix * 2 < > + inflate_threshold_root) > + inflate_threshold_root_fix++; > + else > + pr_warning("Fix inflate_threshold_root." > + " Now=%d size=%d bits fix=%d\n", > + inflate_threshold_root, tn->bits, > + inflate_threshold_root_fix); > + } else { > pr_warning("Fix inflate_threshold." > " Now=%d size=%d bits\n", > inflate_threshold, tn->bits); > - } > + } > + } else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix) > + inflate_threshold_root_fix--; > > check_tnode(tn); > > @@ -670,7 +711,7 @@ static struct node *resize(struct trie * > /* compress one level */ > > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > > @@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, inode->child[0]); > put_child(t, tn, 2*i+1, inode->child[1]); > > - tnode_free(inode); > + tnode_free_safe(inode); > continue; > } > > @@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, resize(t, left)); > put_child(t, tn, 2*i+1, resize(t, right)); > > - tnode_free(inode); > + tnode_free_safe(inode); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -885,7 +926,7 @@ static struct tnode *halve(struct trie * > put_child(t, newBinNode, 1, right); > put_child(t, tn, i/2, resize(t, newBinNode)); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key) > return NULL; > } > > -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > +static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > - t_key cindex, key = tn->key; > + t_key cindex, key; > struct tnode *tp; > > + key = tn->key; > + > while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { > cindex = tkey_extract_bits(key, tp->pos, tp->bits); > wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); > @@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc > tn = tp; > } > > + if (tnode_free_head) { > + synchronize_rcu(); > + tnode_free_flush(); > + } > + > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > tn = (struct tnode *)resize(t, (struct tnode *)tn); > + rcu_assign_pointer(t->trie, (struct node *)tn); > + synchronize_rcu(); > + tnode_free_flush(); > + } else { > + rcu_assign_pointer(t->trie, (struct node *)tn); > + } > > - return (struct node *)tn; > + return; > } > > /* only used from updater-side */ > @@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node > > /* Rebalance the trie */ > > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > done: > return fa_head; > } > @@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie > if (tp) { > t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); > put_child(t, (struct tnode *)tp, cindex, NULL); > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > } else > rcu_assign_pointer(t->trie, NULL); > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 15:32 ` Paweł Staszewski @ 2009-06-28 15:48 ` Paweł Staszewski 2009-06-28 19:56 ` Jarek Poplawski 2009-06-28 21:36 ` Jarek Poplawski 0 siblings, 2 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-28 15:48 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list After apply this patch something is wrong Traffic is not forwarded no info in dmesg / no info from bgp and also i can't connect to bgpd process I revert kernel to past version with first Jarek patch Paweł Staszewski pisze: > > > After 18 hours from apply first Jarek patch i have no info about Fix > inflate_threshold_root > even if i make: "clear ip bgp *" on router > So i change Jarek patch from previous to this new one for test and we > will see ... > > Regards > Pawel Staszewski > > > Jarek Poplawski pisze: >> On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: >> ... >> >>> The memory patches and "manual RCU" are also interesting to address >>> the case with PREEMTP's. >>> >> >> Since 2.6.29 looks like prefered here, and there were a lot of takes >> in this thread, I attach below a combined all-in-one patch including: >> - 2.6.29 -> 2.6.30 preemption disable patch >> - 2 RCU vs. preemption fixes from 2.6.31-rc >> - "manual RCU" patch to force vfree/kfree before root's resize (take 3) >> - "automatic" inflate_threshold_root fix (take 2) >> >> Thanks, >> Jarek P. >> >> --- (for 2.6.29.x or even .28 or .27; any testing appreciated) >> >> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >> --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 >> +++ b/net/ipv4/fib_trie.c 2009-06-28 15:45:02.000000000 +0200 >> @@ -123,6 +123,7 @@ struct tnode { >> union { >> struct rcu_head rcu; >> struct work_struct work; >> + struct tnode *tnode_free; >> }; >> struct node *child[0]; >> }; >> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct >> static struct node *resize(struct trie *t, struct tnode *tn); >> static struct tnode *inflate(struct trie *t, struct tnode *tn); >> static struct tnode *halve(struct trie *t, struct tnode *tn); >> +/* tnodes to free after resize(); protected by RTNL */ >> +static struct tnode *tnode_free_head; >> >> static struct kmem_cache *fn_alias_kmem __read_mostly; >> static struct kmem_cache *trie_leaf_kmem __read_mostly; >> @@ -315,6 +318,7 @@ static const int halve_threshold = 25; >> static const int inflate_threshold = 50; >> static const int halve_threshold_root = 8; >> static const int inflate_threshold_root = 15; >> +static int inflate_threshold_root_fix; >> >> >> static void __alias_free_mem(struct rcu_head *head) >> @@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st >> vfree(tn); >> } >> >> +static void __tnode_free(struct tnode *tn) >> +{ >> + size_t size = sizeof(struct tnode) + >> + (sizeof(struct node *) << tn->bits); >> + >> + if (size <= PAGE_SIZE) >> + kfree(tn); >> + else >> + vfree(tn); >> +} >> + >> static void __tnode_free_rcu(struct rcu_head *head) >> { >> struct tnode *tn = container_of(head, struct tnode, rcu); >> @@ -385,6 +400,24 @@ static inline void tnode_free(struct tno >> call_rcu(&tn->rcu, __tnode_free_rcu); >> } >> >> +static void tnode_free_safe(struct tnode *tn) >> +{ >> + BUG_ON(IS_LEAF(tn)); >> + tn->tnode_free = tnode_free_head; >> + tnode_free_head = tn; >> +} >> + >> +static void tnode_free_flush(void) >> +{ >> + struct tnode *tn; >> + >> + while ((tn = tnode_free_head)) { >> + tnode_free_head = tn->tnode_free; >> + tn->tnode_free = NULL; >> + __tnode_free(tn); >> + } >> +} >> + >> static struct leaf *leaf_new(void) >> { >> struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); >> @@ -495,7 +528,7 @@ static struct node *resize(struct trie * >> >> /* No children */ >> if (tn->empty_children == tnode_child_length(tn)) { >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return NULL; >> } >> /* One child */ >> @@ -509,7 +542,7 @@ static struct node *resize(struct trie * >> >> /* compress one level */ >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> /* >> @@ -581,7 +614,8 @@ static struct node *resize(struct trie * >> /* Keep root node larger */ >> >> if (!tn->parent) >> - inflate_threshold_use = inflate_threshold_root; >> + inflate_threshold_use = inflate_threshold_root + >> + inflate_threshold_root_fix; >> else >> inflate_threshold_use = inflate_threshold; >> >> @@ -605,15 +639,22 @@ static struct node *resize(struct trie * >> } >> >> if (max_resize < 0) { >> - if (!tn->parent) >> - pr_warning("Fix inflate_threshold_root." >> - " Now=%d size=%d bits\n", >> - inflate_threshold_root, tn->bits); >> - else >> + if (!tn->parent) { >> + if (inflate_threshold_root_fix * 2 < >> + inflate_threshold_root) >> + inflate_threshold_root_fix++; >> + else >> + pr_warning("Fix inflate_threshold_root." >> + " Now=%d size=%d bits fix=%d\n", >> + inflate_threshold_root, tn->bits, >> + inflate_threshold_root_fix); >> + } else { >> pr_warning("Fix inflate_threshold." >> " Now=%d size=%d bits\n", >> inflate_threshold, tn->bits); >> - } >> + } >> + } else if (max_resize > 4 && !tn->parent && >> inflate_threshold_root_fix) >> + inflate_threshold_root_fix--; >> >> check_tnode(tn); >> >> @@ -670,7 +711,7 @@ static struct node *resize(struct trie * >> /* compress one level */ >> >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> >> @@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, inode->child[0]); >> put_child(t, tn, 2*i+1, inode->child[1]); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> continue; >> } >> >> @@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, resize(t, left)); >> put_child(t, tn, 2*i+1, resize(t, right)); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -885,7 +926,7 @@ static struct tnode *halve(struct trie * >> put_child(t, newBinNode, 1, right); >> put_child(t, tn, i/2, resize(t, newBinNode)); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key) >> return NULL; >> } >> >> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) >> +static void trie_rebalance(struct trie *t, struct tnode *tn) >> { >> int wasfull; >> - t_key cindex, key = tn->key; >> + t_key cindex, key; >> struct tnode *tp; >> >> + key = tn->key; >> + >> while (tn != NULL && (tp = node_parent((struct node *)tn)) != >> NULL) { >> cindex = tkey_extract_bits(key, tp->pos, tp->bits); >> wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); >> @@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc >> tn = tp; >> } >> >> + if (tnode_free_head) { >> + synchronize_rcu(); >> + tnode_free_flush(); >> + } >> + >> /* Handle last (top) tnode */ >> - if (IS_TNODE(tn)) >> + if (IS_TNODE(tn)) { >> tn = (struct tnode *)resize(t, (struct tnode *)tn); >> + rcu_assign_pointer(t->trie, (struct node *)tn); >> + synchronize_rcu(); >> + tnode_free_flush(); >> + } else { >> + rcu_assign_pointer(t->trie, (struct node *)tn); >> + } >> >> - return (struct node *)tn; >> + return; >> } >> >> /* only used from updater-side */ >> @@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node >> >> /* Rebalance the trie */ >> >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp); >> done: >> return fa_head; >> } >> @@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie >> if (tp) { >> t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); >> put_child(t, (struct tnode *)tp, cindex, NULL); >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp); >> } else >> rcu_assign_pointer(t->trie, NULL); >> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 15:48 ` Paweł Staszewski @ 2009-06-28 19:56 ` Jarek Poplawski 2009-06-28 21:36 ` Jarek Poplawski 1 sibling, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-28 19:56 UTC (permalink / raw) To: Paweł Staszewski Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote: > > After apply this patch something is wrong > > Traffic is not forwarded > no info in dmesg / no info from bgp > and also i can't connect to bgpd process > > I revert kernel to past version with first Jarek patch > Thank you very much, Pawel, for trying this. I'm starting to look for the reason. In the meantime try to get some fib_trie stats for Robert. Jarek P. > > > Paweł Staszewski pisze: >> >> >> After 18 hours from apply first Jarek patch i have no info about Fix >> inflate_threshold_root >> even if i make: "clear ip bgp *" on router >> So i change Jarek patch from previous to this new one for test and we >> will see ... >> >> Regards >> Pawel Staszewski >> >> >> Jarek Poplawski pisze: >>> On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote: >>> ... >>> >>>> The memory patches and "manual RCU" are also interesting to address >>>> the case with PREEMTP's. >>>> >>> >>> Since 2.6.29 looks like prefered here, and there were a lot of takes >>> in this thread, I attach below a combined all-in-one patch including: >>> - 2.6.29 -> 2.6.30 preemption disable patch >>> - 2 RCU vs. preemption fixes from 2.6.31-rc >>> - "manual RCU" patch to force vfree/kfree before root's resize (take 3) >>> - "automatic" inflate_threshold_root fix (take 2) >>> >>> Thanks, >>> Jarek P. >>> >>> --- (for 2.6.29.x or even .28 or .27; any testing appreciated) >>> >>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >>> --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 >>> +++ b/net/ipv4/fib_trie.c 2009-06-28 15:45:02.000000000 +0200 ... ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 15:48 ` Paweł Staszewski 2009-06-28 19:56 ` Jarek Poplawski @ 2009-06-28 21:36 ` Jarek Poplawski 2009-06-29 8:08 ` Paweł Staszewski 2009-06-29 8:33 ` [PATCH net-2.6] " Jarek Poplawski 1 sibling, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-28 21:36 UTC (permalink / raw) To: Paweł Staszewski, David Miller Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list To David Miller: since among patches tested negatively by Pawel are current 2 fixes from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise, please withdraw them until they are tested alone. Thanks. To Pawel: On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote: > > After apply this patch something is wrong > > Traffic is not forwarded > no info in dmesg / no info from bgp > and also i can't connect to bgpd process > > I revert kernel to past version with first Jarek patch > Since checking this can take time I attach here a patch with only changes which are currently in 2.6.31-rc. Of course, this part can be broken as well, so it's up to you: if you could try it with caution somewhere it would be very helpful; otherwise don't bother. It could be applied to 2.6.29 with or without this currently working patch. Thanks, Jarek P. --- (for 2.6.29.x, .28 or .27) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 +++ b/net/ipv4/fib_trie.c 2009-06-28 23:06:02.000000000 +0200 @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +516,7 @@ static struct node *resize(struct trie * /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +530,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -670,7 +691,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; - t_key cindex, key = tn->key; + t_key cindex, key; struct tnode *tp; + key = tn->key; + while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { cindex = tkey_extract_bits(key, tp->pos, tp->bits); wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); @@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); + tnode_free_flush(); if (!tp) break; tn = tp; @@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc if (IS_TNODE(tn)) tn = (struct tnode *)resize(t, (struct tnode *)tn); - return (struct node *)tn; + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); + + return; } /* only used from updater-side */ @@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); done: return fa_head; } @@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 21:36 ` Jarek Poplawski @ 2009-06-29 8:08 ` Paweł Staszewski 2009-06-29 8:47 ` Paweł Staszewski 2009-06-29 8:33 ` [PATCH net-2.6] " Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 8:08 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Jarek Poplawski pisze: > To David Miller: > since among patches tested negatively by Pawel are current 2 fixes > from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise, > please withdraw them until they are tested alone. Thanks. > > To Pawel: > On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote: > >> After apply this patch something is wrong >> >> Traffic is not forwarded >> no info in dmesg / no info from bgp >> and also i can't connect to bgpd process >> >> I revert kernel to past version with first Jarek patch >> >> > > Since checking this can take time I attach here a patch with only > changes which are currently in 2.6.31-rc. Of course, this part can be > broken as well, so it's up to you: if you could try it with caution > somewhere it would be very helpful; otherwise don't bother. > > It could be applied to 2.6.29 with or without this currently working > patch. > > Ok. I applied this patch 15mins ago to 2.6.29.5 and now it's working - traffic is forwarded. Some fib_triestats cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.29 Max depth: 6 Leaves: 277015 Prefixes: 290493 Internal nodes: 67115 1: 35733 2: 13635 3: 9544 4: 4832 5: 2239 6: 1125 7: 5 9: 1 18: 1 Pointers: 686614 Null ptrs: 342485 Total size: 18396 kB Counters: --------- gets = 3956301 backtracks = 192497 semantic match passed = 3895955 semantic match miss = 133 null node hit= 4306948 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 3960981 backtracks = 2152441 semantic match passed = 4757 semantic match miss = 0 null node hit= 194997 skipped node resize = 0 > Thanks, > Jarek P. > --- (for 2.6.29.x, .28 or .27) > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 > +++ b/net/ipv4/fib_trie.c 2009-06-28 23:06:02.000000000 +0200 > @@ -123,6 +123,7 @@ struct tnode { > union { > struct rcu_head rcu; > struct work_struct work; > + struct tnode *tnode_free; > }; > struct node *child[0]; > }; > @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct > static struct node *resize(struct trie *t, struct tnode *tn); > static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > +/* tnodes to free after resize(); protected by RTNL */ > +static struct tnode *tnode_free_head; > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno > call_rcu(&tn->rcu, __tnode_free_rcu); > } > > +static void tnode_free_safe(struct tnode *tn) > +{ > + BUG_ON(IS_LEAF(tn)); > + tn->tnode_free = tnode_free_head; > + tnode_free_head = tn; > +} > + > +static void tnode_free_flush(void) > +{ > + struct tnode *tn; > + > + while ((tn = tnode_free_head)) { > + tnode_free_head = tn->tnode_free; > + tn->tnode_free = NULL; > + tnode_free(tn); > + } > +} > + > static struct leaf *leaf_new(void) > { > struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); > @@ -495,7 +516,7 @@ static struct node *resize(struct trie * > > /* No children */ > if (tn->empty_children == tnode_child_length(tn)) { > - tnode_free(tn); > + tnode_free_safe(tn); > return NULL; > } > /* One child */ > @@ -509,7 +530,7 @@ static struct node *resize(struct trie * > > /* compress one level */ > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > /* > @@ -670,7 +691,7 @@ static struct node *resize(struct trie * > /* compress one level */ > > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > > @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, inode->child[0]); > put_child(t, tn, 2*i+1, inode->child[1]); > > - tnode_free(inode); > + tnode_free_safe(inode); > continue; > } > > @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, resize(t, left)); > put_child(t, tn, 2*i+1, resize(t, right)); > > - tnode_free(inode); > + tnode_free_safe(inode); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * > put_child(t, newBinNode, 1, right); > put_child(t, tn, i/2, resize(t, newBinNode)); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) > return NULL; > } > > -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > +static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > - t_key cindex, key = tn->key; > + t_key cindex, key; > struct tnode *tp; > > + key = tn->key; > + > while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { > cindex = tkey_extract_bits(key, tp->pos, tp->bits); > wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); > @@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > + tnode_free_flush(); > if (!tp) > break; > tn = tp; > @@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc > if (IS_TNODE(tn)) > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > - return (struct node *)tn; > + rcu_assign_pointer(t->trie, (struct node *)tn); > + tnode_free_flush(); > + > + return; > } > > /* only used from updater-side */ > @@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node > > /* Rebalance the trie */ > > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > done: > return fa_head; > } > @@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie > if (tp) { > t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); > put_child(t, (struct tnode *)tp, cindex, NULL); > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > } else > rcu_assign_pointer(t->trie, NULL); > > > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 8:08 ` Paweł Staszewski @ 2009-06-29 8:47 ` Paweł Staszewski 2009-06-29 9:27 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 8:47 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list But With all this patches i have the same problem with CPU load Every time when route cache entries are purged cpu load is increasing from 1% to 40 / 80% it depends I see that on 64bit machine when route cache entries are going down i have almost 80% load on each cpu where ethernet card is binded by smp_affinity But on 32bit machine cpu load reported by mpstat is half that on 64bit machine here is example from 32bit machine ( mpstat + rtstat -k entries ) Linux 2.6.29.5 (TM_02_C1) 06/29/09 _i686_ (2 CPU) 12:36:54 CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle RT CACHE ENTRIES (from rtstat) 12:36:57 all 0.00 0.00 0.00 0.00 1.51 15.08 0.00 0.00 83.42 83346 12:36:58 all 0.00 0.00 0.00 0.00 1.01 7.58 0.00 0.00 91.41 85988 12:36:59 all 0.00 0.00 0.00 0.00 0.50 1.01 0.00 0.00 98.49 89979 12:37:00 all 0.00 0.00 0.50 0.00 0.00 1.51 0.00 0.00 97.99 93652 12:37:01 all 0.00 0.00 0.00 0.00 0.00 2.01 0.00 0.00 97.99 96533 12:37:02 all 0.00 0.00 0.00 0.00 0.51 1.01 0.00 0.00 98.48 99451 12:37:03 all 0.00 0.00 0.00 0.00 0.00 2.49 0.00 0.00 97.51 102018 12:37:04 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 104153 12:37:05 all 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 105979 12:37:06 all 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 107684 12:37:07 all 0.00 0.00 0.00 0.00 0.00 1.53 0.00 0.00 98.47 109070 12:37:08 all 0.00 0.00 0.00 0.00 0.00 1.51 0.00 0.00 98.49 110462 12:37:09 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 112301 12:37:10 all 0.00 0.00 0.00 0.00 2.00 20.00 0.00 0.00 78.00 111535 12:37:11 all 0.00 0.00 0.00 0.00 2.49 34.33 0.00 0.00 63.18 108659 12:37:12 all 0.00 0.00 0.00 0.00 3.03 28.28 0.00 0.00 68.69 105534 12:37:13 all 0.00 0.00 0.00 0.00 3.98 30.85 0.00 0.00 65.17 103341 12:37:14 all 0.00 0.00 0.00 0.00 4.50 30.50 0.00 0.00 65.00 101307 12:37:15 all 5.56 0.00 0.00 0.00 1.52 28.79 0.00 0.00 64.14 97435 12:37:16 all 11.39 0.00 0.50 0.00 4.95 30.69 0.00 0.00 52.48 93908 12:37:17 all 1.51 0.00 0.00 0.00 1.01 27.64 0.00 0.00 69.85 90229 12:37:18 all 0.00 0.00 0.00 0.00 2.99 27.36 0.00 0.00 69.65 87030 12:37:19 all 0.00 0.00 0.00 0.00 3.02 29.65 0.00 0.00 67.34 84324 12:37:20 all 0.00 0.00 0.00 0.00 2.99 30.35 0.00 0.00 66.67 82167 12:37:21 all 0.00 0.00 0.00 0.00 1.98 31.68 0.00 0.00 66.34 80121 12:37:22 all 0.00 0.00 0.00 0.00 1.51 30.65 0.00 0.00 67.84 77850 12:37:23 all 0.00 0.00 0.00 0.00 2.50 28.50 0.00 0.00 69.00 76005 12:37:24 all 0.00 0.00 0.00 0.00 1.98 23.27 0.00 0.00 74.75 74538 12:37:25 all 0.00 0.00 0.49 0.00 2.93 22.44 0.00 0.00 74.15 76923 12:37:26 all 0.00 0.00 0.00 0.00 1.51 15.58 0.00 0.00 82.91 79396 12:37:27 all 0.00 0.00 0.00 0.00 0.50 7.96 0.00 0.00 91.54 81835 12:37:28 all 0.00 0.00 0.00 0.00 0.50 3.52 0.00 0.00 95.98 84169 12:37:29 all 0.00 0.00 0.00 0.00 0.00 2.02 0.00 0.00 97.98 87740 12:37:30 all 0.00 0.00 0.00 0.00 0.51 1.52 0.00 0.00 97.98 91152 12:37:31 all 0.00 0.00 0.00 0.00 0.00 1.99 0.00 0.00 98.01 94102 12:37:32 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 97032 12:37:33 all 0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.00 99.50 99685 12:37:34 all 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 99.00 101970 12:37:35 all 0.00 0.00 0.00 0.00 0.50 1.00 0.00 0.00 98.50 103814 12:37:36 all 0.00 0.00 0.00 0.00 0.00 1.52 0.00 0.00 98.48 104793 12:37:37 all 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 106214 12:37:38 all 0.00 0.00 0.00 0.00 0.50 1.01 0.00 0.00 98.49 107300 12:37:39 all 0.00 0.00 0.00 0.00 0.00 13.00 0.00 0.00 87.00 111951 12:37:40 all 0.00 0.00 0.00 0.00 2.50 29.50 0.00 0.00 68.00 111215 12:37:41 all 0.00 0.00 0.00 0.00 2.01 30.65 0.00 0.00 67.34 108023 12:37:42 all 0.00 0.00 0.00 0.00 2.99 29.85 0.00 0.00 67.16 104751 12:37:43 all 0.00 0.00 0.00 0.00 2.00 31.00 0.00 0.00 67.00 100827 12:37:44 all 0.00 0.00 0.00 0.00 3.00 27.00 0.00 0.00 70.00 97184 12:37:45 all 0.00 0.00 0.00 0.00 2.50 29.00 0.00 0.00 68.50 93904 12:37:46 all 0.00 0.00 0.00 0.00 3.02 30.15 0.00 0.00 66.83 90979 12:37:47 all 0.00 0.00 0.00 0.00 2.49 27.86 0.00 0.00 69.65 88315 12:37:48 all 0.00 0.00 0.00 0.00 2.48 31.19 0.00 0.00 66.34 87777 12:37:49 all 0.00 0.00 0.00 0.00 2.94 32.35 0.00 0.00 64.71 89218 12:37:50 all 0.00 0.00 0.00 0.00 3.00 32.50 0.00 0.00 64.50 85896 12:37:51 all 0.00 0.00 0.00 0.00 2.50 30.00 0.00 0.00 67.50 82712 12:37:52 all 0.50 0.00 0.00 0.00 2.49 30.85 0.00 0.00 66.17 79137 12:37:53 all 0.00 0.00 0.50 0.00 2.00 28.50 0.00 0.00 69.00 75644 12:37:54 all 0.00 0.00 0.00 0.00 2.51 30.65 0.00 0.00 66.83 72843 12:37:55 all 0.00 0.00 0.50 0.00 3.48 28.36 0.00 0.00 67.66 73460 Paweł Staszewski pisze: > Jarek Poplawski pisze: >> To David Miller: >> since among patches tested negatively by Pawel are current 2 fixes >> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise, >> please withdraw them until they are tested alone. Thanks. >> >> To Pawel: >> On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote: >> >>> After apply this patch something is wrong >>> >>> Traffic is not forwarded >>> no info in dmesg / no info from bgp >>> and also i can't connect to bgpd process >>> >>> I revert kernel to past version with first Jarek patch >>> >>> >> >> Since checking this can take time I attach here a patch with only >> changes which are currently in 2.6.31-rc. Of course, this part can be >> broken as well, so it's up to you: if you could try it with caution >> somewhere it would be very helpful; otherwise don't bother. >> >> It could be applied to 2.6.29 with or without this currently working >> patch. >> >> > > Ok. > I applied this patch 15mins ago to 2.6.29.5 and now it's working - > traffic is forwarded. > > Some fib_triestats > cat /proc/net/fib_triestat > Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. > Main: > Aver depth: 2.29 > Max depth: 6 > Leaves: 277015 > Prefixes: 290493 > Internal nodes: 67115 > 1: 35733 2: 13635 3: 9544 4: 4832 5: 2239 6: 1125 7: 5 > 9: 1 18: 1 > Pointers: 686614 > Null ptrs: 342485 > Total size: 18396 kB > > Counters: > --------- > gets = 3956301 > backtracks = 192497 > semantic match passed = 3895955 > semantic match miss = 133 > null node hit= 4306948 > skipped node resize = 0 > > Local: > Aver depth: 3.75 > Max depth: 5 > Leaves: 12 > Prefixes: 13 > Internal nodes: 10 > 1: 9 2: 1 > Pointers: 22 > Null ptrs: 1 > Total size: 2 kB > > Counters: > --------- > gets = 3960981 > backtracks = 2152441 > semantic match passed = 4757 > semantic match miss = 0 > null node hit= 194997 > skipped node resize = 0 > > > >> Thanks, >> Jarek P. >> --- (for 2.6.29.x, .28 or .27) >> >> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >> --- a/net/ipv4/fib_trie.c 2009-06-27 20:25:06.000000000 +0200 >> +++ b/net/ipv4/fib_trie.c 2009-06-28 23:06:02.000000000 +0200 >> @@ -123,6 +123,7 @@ struct tnode { >> union { >> struct rcu_head rcu; >> struct work_struct work; >> + struct tnode *tnode_free; >> }; >> struct node *child[0]; >> }; >> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct >> static struct node *resize(struct trie *t, struct tnode *tn); >> static struct tnode *inflate(struct trie *t, struct tnode *tn); >> static struct tnode *halve(struct trie *t, struct tnode *tn); >> +/* tnodes to free after resize(); protected by RTNL */ >> +static struct tnode *tnode_free_head; >> >> static struct kmem_cache *fn_alias_kmem __read_mostly; >> static struct kmem_cache *trie_leaf_kmem __read_mostly; >> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno >> call_rcu(&tn->rcu, __tnode_free_rcu); >> } >> >> +static void tnode_free_safe(struct tnode *tn) >> +{ >> + BUG_ON(IS_LEAF(tn)); >> + tn->tnode_free = tnode_free_head; >> + tnode_free_head = tn; >> +} >> + >> +static void tnode_free_flush(void) >> +{ >> + struct tnode *tn; >> + >> + while ((tn = tnode_free_head)) { >> + tnode_free_head = tn->tnode_free; >> + tn->tnode_free = NULL; >> + tnode_free(tn); >> + } >> +} >> + >> static struct leaf *leaf_new(void) >> { >> struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); >> @@ -495,7 +516,7 @@ static struct node *resize(struct trie * >> >> /* No children */ >> if (tn->empty_children == tnode_child_length(tn)) { >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return NULL; >> } >> /* One child */ >> @@ -509,7 +530,7 @@ static struct node *resize(struct trie * >> >> /* compress one level */ >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> /* >> @@ -670,7 +691,7 @@ static struct node *resize(struct trie * >> /* compress one level */ >> >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> >> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, inode->child[0]); >> put_child(t, tn, 2*i+1, inode->child[1]); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> continue; >> } >> >> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, resize(t, left)); >> put_child(t, tn, 2*i+1, resize(t, right)); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * >> put_child(t, newBinNode, 1, right); >> put_child(t, tn, i/2, resize(t, newBinNode)); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) >> return NULL; >> } >> >> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) >> +static void trie_rebalance(struct trie *t, struct tnode *tn) >> { >> int wasfull; >> - t_key cindex, key = tn->key; >> + t_key cindex, key; >> struct tnode *tp; >> >> + key = tn->key; >> + >> while (tn != NULL && (tp = node_parent((struct node *)tn)) != >> NULL) { >> cindex = tkey_extract_bits(key, tp->pos, tp->bits); >> wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); >> @@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc >> (struct node *)tn, wasfull); >> >> tp = node_parent((struct node *) tn); >> + tnode_free_flush(); >> if (!tp) >> break; >> tn = tp; >> @@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc >> if (IS_TNODE(tn)) >> tn = (struct tnode *)resize(t, (struct tnode *)tn); >> >> - return (struct node *)tn; >> + rcu_assign_pointer(t->trie, (struct node *)tn); >> + tnode_free_flush(); >> + >> + return; >> } >> >> /* only used from updater-side */ >> @@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node >> >> /* Rebalance the trie */ >> >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp); >> done: >> return fa_head; >> } >> @@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie >> if (tp) { >> t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); >> put_child(t, (struct tnode *)tp, cindex, NULL); >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp); >> } else >> rcu_assign_pointer(t->trie, NULL); >> >> >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 8:47 ` Paweł Staszewski @ 2009-06-29 9:27 ` Jarek Poplawski 2009-06-29 9:43 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-29 9:27 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Mon, Jun 29, 2009 at 10:47:44AM +0200, Paweł Staszewski wrote: > But > With all this patches i have the same problem with CPU load > Every time when route cache entries are purged cpu load is increasing > from 1% to 40 / 80% it depends > > I see that on 64bit machine when route cache entries are going down i > have almost 80% load on each cpu where ethernet card is binded by > smp_affinity > But on 32bit machine cpu load reported by mpstat is half that on 64bit > machine > here is example from 32bit machine ( mpstat + rtstat -k entries ) > > Linux 2.6.29.5 (TM_02_C1) 06/29/09 _i686_ (2 CPU) > > 12:36:54 CPU %usr %nice %sys %iowait %irq %soft %steal > %guest %idle RT CACHE ENTRIES (from rtstat) > 12:36:57 all 0.00 0.00 0.00 0.00 1.51 15.08 0.00 > 0.00 83.42 83346 I guess Eric is thinking about this. Btw., two little suggestions: it should be easier to track if these route cache reports stay in its starting thread ("weird problem"?), and if you could send these stats/logs as attachements or turn off line wrapping, please? ;-) Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 9:27 ` Jarek Poplawski @ 2009-06-29 9:43 ` Paweł Staszewski 0 siblings, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 9:43 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Jarek Poplawski pisze: > On Mon, Jun 29, 2009 at 10:47:44AM +0200, Paweł Staszewski wrote: > >> But >> With all this patches i have the same problem with CPU load >> Every time when route cache entries are purged cpu load is increasing >> from 1% to 40 / 80% it depends >> >> I see that on 64bit machine when route cache entries are going down i >> have almost 80% load on each cpu where ethernet card is binded by >> smp_affinity >> But on 32bit machine cpu load reported by mpstat is half that on 64bit >> machine >> here is example from 32bit machine ( mpstat + rtstat -k entries ) >> >> Linux 2.6.29.5 (TM_02_C1) 06/29/09 _i686_ (2 CPU) >> >> 12:36:54 CPU %usr %nice %sys %iowait %irq %soft %steal >> %guest %idle RT CACHE ENTRIES (from rtstat) >> 12:36:57 all 0.00 0.00 0.00 0.00 1.51 15.08 0.00 >> 0.00 83.42 83346 >> > > I guess Eric is thinking about this. Btw., two little suggestions: > it should be easier to track if these route cache reports stay in its > starting thread ("weird problem"?), and if you could send these > stats/logs as attachements or turn off line wrapping, please? ;-) > > Thanks, > Jarek P. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Sorry Jarek for combining problems :) And yes i will apply next stats in attachements :) ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-28 21:36 ` Jarek Poplawski 2009-06-29 8:08 ` Paweł Staszewski @ 2009-06-29 8:33 ` Jarek Poplawski 2009-06-29 9:51 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-29 8:33 UTC (permalink / raw) To: David Miller Cc: =?UTF-8?B?UGF3ZcWCIFN0YXN6ZXdza2k=?=, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On 28-06-2009 23:36, Jarek Poplawski wrote: > To David Miller: > since among patches tested negatively by Pawel are current 2 fixes > from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise, > please withdraw them until they are tested alone. Thanks. David, IMHO this fix is needed in net-2.6 even if it doesn't fix the problem reported by Pawel (there could be still something more). Pawel, I see you decided to test my previous patch, but try to add this one on top. Thanks, Jarek P. -------------------> ipv4: Fix fib_trie rebalancing, part 3 Alas current delaying of freeing old tnodes by RCU in trie_rebalance is still not enough because we can free a top tnode before updating a t->trie pointer. Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/ipv4/fib_trie.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 012cf5a..00a54b2 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); + if (!tp) + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); if (!tp) break; ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 8:33 ` [PATCH net-2.6] " Jarek Poplawski @ 2009-06-29 9:51 ` Paweł Staszewski 2009-06-29 10:47 ` Jarek Poplawski 2009-06-29 10:58 ` [PATCH net-2.6] " Jarek Poplawski 0 siblings, 2 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 9:51 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list [-- Attachment #1: Type: text/plain, Size: 1699 bytes --] I apply this patch fib_triestats in attached file :) Jarek Poplawski pisze: > On 28-06-2009 23:36, Jarek Poplawski wrote: > >> To David Miller: >> since among patches tested negatively by Pawel are current 2 fixes >> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise, >> please withdraw them until they are tested alone. Thanks. >> > > David, IMHO this fix is needed in net-2.6 even if it doesn't fix the > problem reported by Pawel (there could be still something more). > > Pawel, I see you decided to test my previous patch, but try to add > this one on top. > > Thanks, > Jarek P. > -------------------> > ipv4: Fix fib_trie rebalancing, part 3 > > Alas current delaying of freeing old tnodes by RCU in trie_rebalance > is still not enough because we can free a top tnode before updating a > t->trie pointer. > > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > --- > > net/ipv4/fib_trie.c | 3 +++ > 1 files changed, 3 insertions(+), 0 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 012cf5a..00a54b2 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn) > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > + if (!tp) > + rcu_assign_pointer(t->trie, (struct node *)tn); > + > tnode_free_flush(); > if (!tp) > break; > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > [-- Attachment #2: fib_triestats.txt --] [-- Type: text/plain, Size: 3032 bytes --] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.29 Max depth: 7 Leaves: 276909 Prefixes: 290383 Internal nodes: 66893 1: 34715 2: 14024 3: 9889 4: 4833 5: 2275 6: 1150 7: 5 9: 1 18: 1 Pointers: 691662 Null ptrs: 347861 Total size: 18403 kB Counters: --------- gets = 2297579 backtracks = 131491 semantic match passed = 2233070 semantic match miss = 42 null node hit= 2016883 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 2302102 backtracks = 1545197 semantic match passed = 4536 semantic match miss = 0 null node hit= 192664 skipped node resize = 0 ---------------------------------------------------------------- cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.29 Max depth: 7 Leaves: 276904 Prefixes: 290378 Internal nodes: 66889 1: 34711 2: 14023 3: 9890 4: 4833 5: 2275 6: 1150 7: 5 9: 1 18: 1 Pointers: 691658 Null ptrs: 347866 Total size: 18402 kB Counters: --------- gets = 3006945 backtracks = 138787 semantic match passed = 2942047 semantic match miss = 85 null node hit= 2826377 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 3011504 backtracks = 1796587 semantic match passed = 4577 semantic match miss = 0 null node hit= 192747 skipped node resize = 0 -------------------------------------------------------------- cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.29 Max depth: 7 Leaves: 276904 Prefixes: 290378 Internal nodes: 66891 1: 34710 2: 14025 3: 9892 4: 4832 5: 2275 6: 1150 7: 5 9: 1 18: 1 Pointers: 691664 Null ptrs: 347870 Total size: 18402 kB Counters: --------- gets = 3320633 backtracks = 141904 semantic match passed = 3255585 semantic match miss = 99 null node hit= 3177543 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 3325226 backtracks = 1904022 semantic match passed = 4601 semantic match miss = 0 null node hit= 192782 skipped node resize = 0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 9:51 ` Paweł Staszewski @ 2009-06-29 10:47 ` Jarek Poplawski 2009-06-29 16:24 ` Paweł Staszewski 2009-06-30 7:09 ` Jarek Poplawski 2009-06-29 10:58 ` [PATCH net-2.6] " Jarek Poplawski 1 sibling, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-29 10:47 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > I apply this patch > > fib_triestats in attached file :) Great! But it would be nice to check if this (accidentally ;-) might fix the previous problem, so I attach below the patch with "manual RCU", which btw. (or even more important) should verify RCU use here. It should be applied on top of this last "Fix..., part3". And again: it's quite probable it can fail, so with caution, no hurry (it can wait for quiet time)... Many thanks, Jarek P. --------------------> (synchronize_rcu take 4) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 +++ b/net/ipv4/fib_trie.c 2009-06-29 10:04:22.000000000 +0000 @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st vfree(tn); } +static void __tnode_free(struct tnode *tn) +{ + size_t size = sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); + + if (size <= PAGE_SIZE) + kfree(tn); + else + vfree(tn); +} + static void __tnode_free_rcu(struct rcu_head *head) { struct tnode *tn = container_of(head, struct tnode, rcu); @@ -402,7 +413,7 @@ static void tnode_free_flush(void) while ((tn = tnode_free_head)) { tnode_free_head = tn->tnode_free; tn->tnode_free = NULL; - tnode_free(tn); + __tnode_free(tn); } } @@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie * (struct node *)tn, wasfull); tp = node_parent((struct node *) tn); - if (!tp) + if (!tp) { rcu_assign_pointer(t->trie, (struct node *)tn); - - tnode_free_flush(); - if (!tp) break; + } tn = tp; } + if (tnode_free_head) { + synchronize_rcu(); + tnode_free_flush(); + } + /* Handle last (top) tnode */ - if (IS_TNODE(tn)) + if (IS_TNODE(tn)) { tn = (struct tnode *)resize(t, (struct tnode *)tn); - - rcu_assign_pointer(t->trie, (struct node *)tn); - tnode_free_flush(); + rcu_assign_pointer(t->trie, (struct node *)tn); + synchronize_rcu(); + tnode_free_flush(); + } else { + rcu_assign_pointer(t->trie, (struct node *)tn); + } return; } ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 10:47 ` Jarek Poplawski @ 2009-06-29 16:24 ` Paweł Staszewski 2009-06-29 17:09 ` Jarek Poplawski 2009-06-30 7:09 ` Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-29 16:24 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list [-- Attachment #1: Type: text/plain, Size: 2975 bytes --] Jarek Poplawski pisze: > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > >> I apply this patch >> >> fib_triestats in attached file :) >> > > Great! But it would be nice to check if this (accidentally ;-) might > fix the previous problem, so I attach below the patch with "manual > RCU", which btw. (or even more important) should verify RCU use here. > > After this patches all is OK now i don't see Fix inflate_threshold_root. Even if i make "clear ip bgp * " Before this patches when i make clear ip bgp there was always info in dmesg about "Fix inflate_threshold_root" > It should be applied on top of this last "Fix..., part3". And > again: it's quite probable it can fail, so with caution, no hurry > (it can wait for quiet time)... > > After apply this last patch - traffic is not forwarded again :) i was fast and have only some fib_triestats in attached file before failover switch routers. This stats are from machine with this last patch that makes kernel to stop forwarding > Many thanks, > Jarek P. > --------------------> (synchronize_rcu take 4) > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 > +++ b/net/ipv4/fib_trie.c 2009-06-29 10:04:22.000000000 +0000 > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st > vfree(tn); > } > > +static void __tnode_free(struct tnode *tn) > +{ > + size_t size = sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > + > + if (size <= PAGE_SIZE) > + kfree(tn); > + else > + vfree(tn); > +} > + > static void __tnode_free_rcu(struct rcu_head *head) > { > struct tnode *tn = container_of(head, struct tnode, rcu); > @@ -402,7 +413,7 @@ static void tnode_free_flush(void) > while ((tn = tnode_free_head)) { > tnode_free_head = tn->tnode_free; > tn->tnode_free = NULL; > - tnode_free(tn); > + __tnode_free(tn); > } > } > > @@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie * > (struct node *)tn, wasfull); > > tp = node_parent((struct node *) tn); > - if (!tp) > + if (!tp) { > rcu_assign_pointer(t->trie, (struct node *)tn); > - > - tnode_free_flush(); > - if (!tp) > break; > + } > tn = tp; > } > > + if (tnode_free_head) { > + synchronize_rcu(); > + tnode_free_flush(); > + } > + > /* Handle last (top) tnode */ > - if (IS_TNODE(tn)) > + if (IS_TNODE(tn)) { > tn = (struct tnode *)resize(t, (struct tnode *)tn); > - > - rcu_assign_pointer(t->trie, (struct node *)tn); > - tnode_free_flush(); > + rcu_assign_pointer(t->trie, (struct node *)tn); > + synchronize_rcu(); > + tnode_free_flush(); > + } else { > + rcu_assign_pointer(t->trie, (struct node *)tn); > + } > > return; > } > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > [-- Attachment #2: fib_triestats.txt --] [-- Type: text/plain, Size: 890 bytes --] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 3.51 Max depth: 8 Leaves: 3089 Prefixes: 3156 Internal nodes: 1167 1: 737 2: 202 3: 150 4: 40 5: 28 6: 8 7: 1 10: 1 Pointers: 6682 Null ptrs: 2427 Total size: 214 kB Counters: --------- gets = 1554240 backtracks = 916511 semantic match passed = 1127691 semantic match miss = 27 null node hit= 1439140 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 1554767 backtracks = 572694 semantic match passed = 534 semantic match miss = 0 null node hit= 288 skipped node resize = 0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 16:24 ` Paweł Staszewski @ 2009-06-29 17:09 ` Jarek Poplawski 0 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-29 17:09 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Mon, Jun 29, 2009 at 06:24:47PM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: >> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: >> >>> I apply this patch >>> >>> fib_triestats in attached file :) >>> >> >> Great! But it would be nice to check if this (accidentally ;-) might >> fix the previous problem, so I attach below the patch with "manual >> RCU", which btw. (or even more important) should verify RCU use here. >> >> > After this patches all is OK now i don't see Fix inflate_threshold_root. > Even if i make "clear ip bgp * " > > Before this patches when i make clear ip bgp there was always info in > dmesg about "Fix inflate_threshold_root" > > >> It should be applied on top of this last "Fix..., part3". And >> again: it's quite probable it can fail, so with caution, no hurry >> (it can wait for quiet time)... >> >> > After apply this last patch - traffic is not forwarded again :) > i was fast and have only some fib_triestats in attached file before > failover switch routers. > > This stats are from machine with this last patch that makes kernel to > stop forwarding OK, I'll look at it again. Thanks for testing! Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 10:47 ` Jarek Poplawski 2009-06-29 16:24 ` Paweł Staszewski @ 2009-06-30 7:09 ` Jarek Poplawski 2009-06-30 20:16 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-30 7:09 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote: > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > > I apply this patch > > > > fib_triestats in attached file :) > > Great! But it would be nice to check if this (accidentally ;-) might > fix the previous problem, so I attach below the patch with "manual > RCU", which btw. (or even more important) should verify RCU use here. > > It should be applied on top of this last "Fix..., part3". And > again: it's quite probable it can fail, so with caution, no hurry > (it can wait for quiet time)... Pawel, here is another try to check what's going on here, so just like before, but this one on top of these 2 last working patches, plus quite time... (Stats aren't necessary; if these are some doubts let me know.) Thanks, Jarek P. --------------------> (synchronize_rcu take 5) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 +++ b/net/ipv4/fib_trie.c 2009-06-30 06:50:35.000000000 +0000 @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie * rcu_assign_pointer(t->trie, (struct node *)tn); tnode_free_flush(); + synchronize_rcu(); return; } ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 7:09 ` Jarek Poplawski @ 2009-06-30 20:16 ` Paweł Staszewski 2009-06-30 20:41 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-30 20:16 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list Jarek Poplawski pisze: > On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote: > >> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: >> >>> I apply this patch >>> >>> fib_triestats in attached file :) >>> >> Great! But it would be nice to check if this (accidentally ;-) might >> fix the previous problem, so I attach below the patch with "manual >> RCU", which btw. (or even more important) should verify RCU use here. >> >> It should be applied on top of this last "Fix..., part3". And >> again: it's quite probable it can fail, so with caution, no hurry >> (it can wait for quiet time)... >> > > Pawel, here is another try to check what's going on here, so just > like before, but this one on top of these 2 last working patches, > plus quite time... (Stats aren't necessary; if these are some doubts > let me know.) > > Thanks, > Jarek P. > --------------------> (synchronize_rcu take 5) > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 > +++ b/net/ipv4/fib_trie.c 2009-06-30 06:50:35.000000000 +0000 > @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie * > > rcu_assign_pointer(t->trie, (struct node *)tn); > tnode_free_flush(); > + synchronize_rcu(); > > return; > } > Apply and tested Traffic is not forwarded after apply this patch.:) > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 20:16 ` Paweł Staszewski @ 2009-06-30 20:41 ` Jarek Poplawski 2009-06-30 23:31 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-30 20:41 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Tue, Jun 30, 2009 at 10:16:57PM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: >> On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote: >> >>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: >>> >>>> I apply this patch >>>> >>>> fib_triestats in attached file :) >>>> >>> Great! But it would be nice to check if this (accidentally ;-) might >>> fix the previous problem, so I attach below the patch with "manual >>> RCU", which btw. (or even more important) should verify RCU use here. >>> >>> It should be applied on top of this last "Fix..., part3". And >>> again: it's quite probable it can fail, so with caution, no hurry >>> (it can wait for quiet time)... >>> >> >> Pawel, here is another try to check what's going on here, so just >> like before, but this one on top of these 2 last working patches, >> plus quite time... (Stats aren't necessary; if these are some doubts >> let me know.) >> >> Thanks, >> Jarek P. >> --------------------> (synchronize_rcu take 5) >> >> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >> --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 >> +++ b/net/ipv4/fib_trie.c 2009-06-30 06:50:35.000000000 +0000 >> @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie * >> rcu_assign_pointer(t->trie, (struct node *)tn); >> tnode_free_flush(); >> + synchronize_rcu(); >> return; >> } >> > > Apply and tested > > Traffic is not forwarded after apply this patch.:) A little comment: these last 2 patches weren't exactly to fix the problem you reported, which should be mostly fixed by the earlier patch. There is some other bug, which you omit with CONFIG_PREEMPT_NONE (but it's not for sure there is no by effects). So, I'd like to be sure you are willing and can (without too much risk) to do more such tests. Alas I've no way to generate similar conditions so it would simply have to wait for somebody else. Many thanks again, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 20:41 ` Jarek Poplawski @ 2009-06-30 23:31 ` Paweł Staszewski 2009-07-01 6:36 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-06-30 23:31 UTC (permalink / raw) To: Jarek Poplawski, Linux Network Development list Jarek Poplawski pisze: > On Tue, Jun 30, 2009 at 10:16:57PM +0200, Paweł Staszewski wrote: > >> Jarek Poplawski pisze: >> >>> On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote: >>> >>> >>>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: >>>> >>>> >>>>> I apply this patch >>>>> >>>>> fib_triestats in attached file :) >>>>> >>>>> >>>> Great! But it would be nice to check if this (accidentally ;-) might >>>> fix the previous problem, so I attach below the patch with "manual >>>> RCU", which btw. (or even more important) should verify RCU use here. >>>> >>>> It should be applied on top of this last "Fix..., part3". And >>>> again: it's quite probable it can fail, so with caution, no hurry >>>> (it can wait for quiet time)... >>>> >>>> >>> Pawel, here is another try to check what's going on here, so just >>> like before, but this one on top of these 2 last working patches, >>> plus quite time... (Stats aren't necessary; if these are some doubts >>> let me know.) >>> >>> Thanks, >>> Jarek P. >>> --------------------> (synchronize_rcu take 5) >>> >>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >>> --- a/net/ipv4/fib_trie.c 2009-06-29 10:00:14.000000000 +0000 >>> +++ b/net/ipv4/fib_trie.c 2009-06-30 06:50:35.000000000 +0000 >>> @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie * >>> rcu_assign_pointer(t->trie, (struct node *)tn); >>> tnode_free_flush(); >>> + synchronize_rcu(); >>> return; >>> } >>> >>> >> Apply and tested >> >> Traffic is not forwarded after apply this patch.:) >> > > A little comment: these last 2 patches weren't exactly to fix the > problem you reported, which should be mostly fixed by the earlier > patch. > > There is some other bug, which you omit with CONFIG_PREEMPT_NONE > (but it's not for sure there is no by effects). So, I'd like to be > sure you are willing and can (without too much risk) to do more such > tests. Alas I've no way to generate similar conditions so it would > simply have to wait for somebody else. > > Yes i can make tests like this. My network is splited to test clients and other normal clients so it's really no problem to make testing. - if testing clients working then traffic from normal clients is also switched to this router (but if traffic is not forwarded "like in this case" for testing clients then failover switching them to working router ) and other point to make this tests - is that - it is good to have all in linux kernel networking working well :) Regards Paweł Staszewski > Many thanks again, > Jarek P. > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 23:31 ` Paweł Staszewski @ 2009-07-01 6:36 ` Jarek Poplawski [not found] ` <20090701072409.GA12592@ff.dom.local> 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-01 6:36 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote: ... > Yes i can make tests like this. > My network is splited to test clients and other normal clients > so it's really no problem to make testing. - if testing clients working > then traffic from normal clients is also switched to this router (but if > traffic is not forwarded "like in this case" for testing clients then > failover switching them to working router ) > > and other point to make this tests - is that - it is good to have all in > linux kernel networking working well :) It's extremely nice of you! On the other hand, this type of change was planned to the net-next to fix possible memory problems, which might have happened to you as well. So you'd probably experience this problem in the future (2.6.32) anyway. So here is the first of 2 patches (the second in a separate message), which should be tested separately, each one applied on top of the 2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous one. So, they are again all-in-one, to eclude any misunderstanding. Btw., I assume there were no oopses, warnings or lockups after those previous non-working patches - only no routing/forwarding. Thanks, Jarek P. ----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-06-29 05:30:50.000000000 +0000 +++ b/net/ipv4/fib_trie.c 2009-07-01 05:15:37.000000000 +0000 @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +516,7 @@ static struct node *resize(struct trie * /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +530,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -670,7 +691,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn, bool sync) { int wasfull; - t_key cindex, key = tn->key; + t_key cindex, key; struct tnode *tp; + key = tn->key; + while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { cindex = tkey_extract_bits(key, tp->pos, tp->bits); wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc tp = node_parent((struct node *) tn); if (!tp) + rcu_assign_pointer(t->trie, (struct node *)tn); + + //tnode_free_flush(); + if (!tp) break; tn = tp; } @@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc if (IS_TNODE(tn)) tn = (struct tnode *)resize(t, (struct tnode *)tn); - return (struct node *)tn; + rcu_assign_pointer(t->trie, (struct node *)tn); + if (sync) + synchronize_rcu(); + tnode_free_flush(); + + return; } /* only used from updater-side */ @@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp, true); done: return fa_head; } @@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp, false); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply [flat|nested] 99+ messages in thread
[parent not found: <20090701072409.GA12592@ff.dom.local>]
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits [not found] ` <20090701072409.GA12592@ff.dom.local> @ 2009-07-01 9:43 ` Paweł Staszewski 2009-07-01 9:50 ` Paweł Staszewski 2009-07-01 10:13 ` Jarek Poplawski 0 siblings, 2 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-07-01 9:43 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson Jarek Poplawski pisze: > On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote: > >> On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote: >> ... >> > > It looks like Cc was shortened BTW, but I guess at least Robert is > interested in this testing, so I add him back. > > Cheers, > Jarek P. > > >>> Yes i can make tests like this. >>> My network is splited to test clients and other normal clients >>> so it's really no problem to make testing. - if testing clients working >>> then traffic from normal clients is also switched to this router (but if >>> traffic is not forwarded "like in this case" for testing clients then >>> failover switching them to working router ) >>> >>> and other point to make this tests - is that - it is good to have all in >>> linux kernel networking working well :) >>> >> It's extremely nice of you! On the other hand, this type of change >> was planned to the net-next to fix possible memory problems, which >> might have happened to you as well. So you'd probably experience this >> problem in the future (2.6.32) anyway. >> >> So here is the first of 2 patches (the second in a separate message), >> which should be tested separately, each one applied on top of the >> 2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous >> one. So, they are again all-in-one, to eclude any misunderstanding. >> >> Btw., I assume there were no oopses, warnings or lockups after those >> previous non-working patches - only no routing/forwarding. >> >> Yes on on previous patches there was / no warnings / no oopses or lockups But now i apply this patch and i make more testing. First boot with start of bgpd and - traffic is not forwarded So i start to search and make only some routes (static without bgpd) thru this host And all is working for this host when i make all by static routes. So i change a little my bgp configuration and make default route to only one of my iBGP peers and start bgpd process All is working and what is weird is number of routes in kernel table. Kernel is learning routes from bgpd but very slowly - really very slowly. In attached file there are some fib_triestats after 5min of traffic. Without this patch (normally) total size: reported by fib_triestats in less that 1sec is: "Total size: 35769 kB" But with this patch Total size is growing up and in 5 min of traffic it grow to only: "Total size: 1005 kB" Regards Paweł Staszewski >> Thanks, >> Jarek P. >> ----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27) >> >> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >> --- a/net/ipv4/fib_trie.c 2009-06-29 05:30:50.000000000 +0000 >> +++ b/net/ipv4/fib_trie.c 2009-07-01 05:15:37.000000000 +0000 >> @@ -123,6 +123,7 @@ struct tnode { >> union { >> struct rcu_head rcu; >> struct work_struct work; >> + struct tnode *tnode_free; >> }; >> struct node *child[0]; >> }; >> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct >> static struct node *resize(struct trie *t, struct tnode *tn); >> static struct tnode *inflate(struct trie *t, struct tnode *tn); >> static struct tnode *halve(struct trie *t, struct tnode *tn); >> +/* tnodes to free after resize(); protected by RTNL */ >> +static struct tnode *tnode_free_head; >> >> static struct kmem_cache *fn_alias_kmem __read_mostly; >> static struct kmem_cache *trie_leaf_kmem __read_mostly; >> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno >> call_rcu(&tn->rcu, __tnode_free_rcu); >> } >> >> +static void tnode_free_safe(struct tnode *tn) >> +{ >> + BUG_ON(IS_LEAF(tn)); >> + tn->tnode_free = tnode_free_head; >> + tnode_free_head = tn; >> +} >> + >> +static void tnode_free_flush(void) >> +{ >> + struct tnode *tn; >> + >> + while ((tn = tnode_free_head)) { >> + tnode_free_head = tn->tnode_free; >> + tn->tnode_free = NULL; >> + tnode_free(tn); >> + } >> +} >> + >> static struct leaf *leaf_new(void) >> { >> struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); >> @@ -495,7 +516,7 @@ static struct node *resize(struct trie * >> >> /* No children */ >> if (tn->empty_children == tnode_child_length(tn)) { >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return NULL; >> } >> /* One child */ >> @@ -509,7 +530,7 @@ static struct node *resize(struct trie * >> >> /* compress one level */ >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> /* >> @@ -670,7 +691,7 @@ static struct node *resize(struct trie * >> /* compress one level */ >> >> node_set_parent(n, NULL); >> - tnode_free(tn); >> + tnode_free_safe(tn); >> return n; >> } >> >> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, inode->child[0]); >> put_child(t, tn, 2*i+1, inode->child[1]); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> continue; >> } >> >> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie >> put_child(t, tn, 2*i, resize(t, left)); >> put_child(t, tn, 2*i+1, resize(t, right)); >> >> - tnode_free(inode); >> + tnode_free_safe(inode); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * >> put_child(t, newBinNode, 1, right); >> put_child(t, tn, i/2, resize(t, newBinNode)); >> } >> - tnode_free(oldtnode); >> + tnode_free_safe(oldtnode); >> return tn; >> nomem: >> { >> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) >> return NULL; >> } >> >> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) >> +static void trie_rebalance(struct trie *t, struct tnode *tn, bool sync) >> { >> int wasfull; >> - t_key cindex, key = tn->key; >> + t_key cindex, key; >> struct tnode *tp; >> >> + key = tn->key; >> + >> while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { >> cindex = tkey_extract_bits(key, tp->pos, tp->bits); >> wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); >> @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc >> >> tp = node_parent((struct node *) tn); >> if (!tp) >> + rcu_assign_pointer(t->trie, (struct node *)tn); >> + >> + //tnode_free_flush(); >> + if (!tp) >> break; >> tn = tp; >> } >> @@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc >> if (IS_TNODE(tn)) >> tn = (struct tnode *)resize(t, (struct tnode *)tn); >> >> - return (struct node *)tn; >> + rcu_assign_pointer(t->trie, (struct node *)tn); >> + if (sync) >> + synchronize_rcu(); >> + tnode_free_flush(); >> + >> + return; >> } >> >> /* only used from updater-side */ >> @@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node >> >> /* Rebalance the trie */ >> >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp, true); >> done: >> return fa_head; >> } >> @@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie >> if (tp) { >> t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); >> put_child(t, (struct tnode *)tp, cindex, NULL); >> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >> + trie_rebalance(t, tp, false); >> } else >> rcu_assign_pointer(t->trie, NULL); >> >> > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-01 9:43 ` Paweł Staszewski @ 2009-07-01 9:50 ` Paweł Staszewski 2009-07-01 10:13 ` Jarek Poplawski 1 sibling, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-07-01 9:50 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson [-- Attachment #1: Type: text/plain, Size: 8578 bytes --] Paweł Staszewski pisze: > Jarek Poplawski pisze: >> On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote: >> >>> On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote: >>> ... >>> >> >> It looks like Cc was shortened BTW, but I guess at least Robert is >> interested in this testing, so I add him back. >> >> Cheers, >> Jarek P. >> >> >>>> Yes i can make tests like this. >>>> My network is splited to test clients and other normal clients >>>> so it's really no problem to make testing. - if testing clients >>>> working then traffic from normal clients is also switched to this >>>> router (but if traffic is not forwarded "like in this case" for >>>> testing clients then failover switching them to working router ) >>>> >>>> and other point to make this tests - is that - it is good to have >>>> all in linux kernel networking working well :) >>>> >>> It's extremely nice of you! On the other hand, this type of change >>> was planned to the net-next to fix possible memory problems, which >>> might have happened to you as well. So you'd probably experience this >>> problem in the future (2.6.32) anyway. >>> >>> So here is the first of 2 patches (the second in a separate message), >>> which should be tested separately, each one applied on top of the >>> 2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous >>> one. So, they are again all-in-one, to eclude any misunderstanding. >>> >>> Btw., I assume there were no oopses, warnings or lockups after those >>> previous non-working patches - only no routing/forwarding. >>> >>> > Yes on on previous patches there was / no warnings / no oopses or lockups > > But now i apply this patch and i make more testing. > First boot with start of bgpd and - traffic is not forwarded > So i start to search and make only some routes (static without bgpd) > thru this host > And all is working for this host when i make all by static routes. > > So i change a little my bgp configuration and make default route to > only one of my iBGP peers and start bgpd process > All is working and what is weird is number of routes in kernel table. > Kernel is learning routes from bgpd but very slowly - really very slowly. > > In attached file there are some fib_triestats after 5min of traffic. > > Without this patch (normally) > total size: reported by fib_triestats in less that 1sec is: "Total > size: 35769 kB" > > But with this patch > Total size is growing up and in 5 min of traffic it grow to only: > "Total size: 1005 kB" > Sorry no attached file. > Regards > Paweł Staszewski > >>> Thanks, >>> Jarek P. >>> ----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27) >>> >>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >>> --- a/net/ipv4/fib_trie.c 2009-06-29 05:30:50.000000000 +0000 >>> +++ b/net/ipv4/fib_trie.c 2009-07-01 05:15:37.000000000 +0000 >>> @@ -123,6 +123,7 @@ struct tnode { >>> union { >>> struct rcu_head rcu; >>> struct work_struct work; >>> + struct tnode *tnode_free; >>> }; >>> struct node *child[0]; >>> }; >>> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct >>> static struct node *resize(struct trie *t, struct tnode *tn); >>> static struct tnode *inflate(struct trie *t, struct tnode *tn); >>> static struct tnode *halve(struct trie *t, struct tnode *tn); >>> +/* tnodes to free after resize(); protected by RTNL */ >>> +static struct tnode *tnode_free_head; >>> >>> static struct kmem_cache *fn_alias_kmem __read_mostly; >>> static struct kmem_cache *trie_leaf_kmem __read_mostly; >>> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno >>> call_rcu(&tn->rcu, __tnode_free_rcu); >>> } >>> >>> +static void tnode_free_safe(struct tnode *tn) >>> +{ >>> + BUG_ON(IS_LEAF(tn)); >>> + tn->tnode_free = tnode_free_head; >>> + tnode_free_head = tn; >>> +} >>> + >>> +static void tnode_free_flush(void) >>> +{ >>> + struct tnode *tn; >>> + >>> + while ((tn = tnode_free_head)) { >>> + tnode_free_head = tn->tnode_free; >>> + tn->tnode_free = NULL; >>> + tnode_free(tn); >>> + } >>> +} >>> + >>> static struct leaf *leaf_new(void) >>> { >>> struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); >>> @@ -495,7 +516,7 @@ static struct node *resize(struct trie * >>> >>> /* No children */ >>> if (tn->empty_children == tnode_child_length(tn)) { >>> - tnode_free(tn); >>> + tnode_free_safe(tn); >>> return NULL; >>> } >>> /* One child */ >>> @@ -509,7 +530,7 @@ static struct node *resize(struct trie * >>> >>> /* compress one level */ >>> node_set_parent(n, NULL); >>> - tnode_free(tn); >>> + tnode_free_safe(tn); >>> return n; >>> } >>> /* >>> @@ -670,7 +691,7 @@ static struct node *resize(struct trie * >>> /* compress one level */ >>> >>> node_set_parent(n, NULL); >>> - tnode_free(tn); >>> + tnode_free_safe(tn); >>> return n; >>> } >>> >>> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie >>> put_child(t, tn, 2*i, inode->child[0]); >>> put_child(t, tn, 2*i+1, inode->child[1]); >>> >>> - tnode_free(inode); >>> + tnode_free_safe(inode); >>> continue; >>> } >>> >>> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie >>> put_child(t, tn, 2*i, resize(t, left)); >>> put_child(t, tn, 2*i+1, resize(t, right)); >>> >>> - tnode_free(inode); >>> + tnode_free_safe(inode); >>> } >>> - tnode_free(oldtnode); >>> + tnode_free_safe(oldtnode); >>> return tn; >>> nomem: >>> { >>> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * >>> put_child(t, newBinNode, 1, right); >>> put_child(t, tn, i/2, resize(t, newBinNode)); >>> } >>> - tnode_free(oldtnode); >>> + tnode_free_safe(oldtnode); >>> return tn; >>> nomem: >>> { >>> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) >>> return NULL; >>> } >>> >>> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) >>> +static void trie_rebalance(struct trie *t, struct tnode *tn, bool >>> sync) >>> { >>> int wasfull; >>> - t_key cindex, key = tn->key; >>> + t_key cindex, key; >>> struct tnode *tp; >>> >>> + key = tn->key; >>> + >>> while (tn != NULL && (tp = node_parent((struct node *)tn)) != >>> NULL) { >>> cindex = tkey_extract_bits(key, tp->pos, tp->bits); >>> wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); >>> @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc >>> >>> tp = node_parent((struct node *) tn); >>> if (!tp) >>> + rcu_assign_pointer(t->trie, (struct node *)tn); >>> + >>> + //tnode_free_flush(); >>> + if (!tp) >>> break; >>> tn = tp; >>> } >>> @@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc >>> if (IS_TNODE(tn)) >>> tn = (struct tnode *)resize(t, (struct tnode *)tn); >>> >>> - return (struct node *)tn; >>> + rcu_assign_pointer(t->trie, (struct node *)tn); >>> + if (sync) >>> + synchronize_rcu(); >>> + tnode_free_flush(); >>> + >>> + return; >>> } >>> >>> /* only used from updater-side */ >>> @@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node >>> >>> /* Rebalance the trie */ >>> >>> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >>> + trie_rebalance(t, tp, true); >>> done: >>> return fa_head; >>> } >>> @@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie >>> if (tp) { >>> t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); >>> put_child(t, (struct tnode *)tp, cindex, NULL); >>> - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); >>> + trie_rebalance(t, tp, false); >>> } else >>> rcu_assign_pointer(t->trie, NULL); >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > [-- Attachment #2: fib_triestats.txt --] [-- Type: text/plain, Size: 1883 bytes --] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 3.79 Max depth: 9 Leaves: 15518 Prefixes: 15933 Internal nodes: 3973 1: 2260 2: 674 3: 518 4: 268 5: 164 6: 79 7: 5 8: 2 9: 1 10: 2 Pointers: 29664 Null ptrs: 10174 Total size: 995 kB Counters: --------- gets = 17863461 backtracks = 13345457 semantic match passed = 17305229 semantic match miss = 419 null node hit= 17602641 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 17865423 backtracks = 4964174 semantic match passed = 2126 semantic match miss = 0 null node hit= 1853 skipped node resize = 0 ----------- After 30sec ---------------- cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 3.79 Max depth: 9 Leaves: 15686 Prefixes: 16111 Internal nodes: 4002 1: 2259 2: 679 3: 536 4: 274 5: 165 6: 79 7: 5 8: 2 9: 1 10: 2 Pointers: 29954 Null ptrs: 10267 Total size: 1005 kB Counters: --------- gets = 18042821 backtracks = 13523292 semantic match passed = 17484572 semantic match miss = 419 null node hit= 17799334 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 18044798 backtracks = 5012942 semantic match passed = 2140 semantic match miss = 0 null node hit= 1865 skipped node resize = 0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-01 9:43 ` Paweł Staszewski 2009-07-01 9:50 ` Paweł Staszewski @ 2009-07-01 10:13 ` Jarek Poplawski 2009-07-01 11:04 ` Jarek Poplawski 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-01 10:13 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote: ... > Yes on on previous patches there was / no warnings / no oopses or lockups > > But now i apply this patch and i make more testing. > First boot with start of bgpd and - traffic is not forwarded > So i start to search and make only some routes (static without bgpd) > thru this host > And all is working for this host when i make all by static routes. > > So i change a little my bgp configuration and make default route to only > one of my iBGP peers and start bgpd process > All is working and what is weird is number of routes in kernel table. > Kernel is learning routes from bgpd but very slowly - really very slowly. Pawel, this is really very helpful! So, this is (probably) only about timing, not wrong memory freeing. On the other hand this test was only for inserts. Btw., if you didn't start the second test, you can skip it. I have to rethink this. Many thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-01 10:13 ` Jarek Poplawski @ 2009-07-01 11:04 ` Jarek Poplawski 2009-07-01 22:17 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-01 11:04 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson On Wed, Jul 01, 2009 at 10:13:33AM +0000, Jarek Poplawski wrote: > On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote: > ... > > Yes on on previous patches there was / no warnings / no oopses or lockups > > > > But now i apply this patch and i make more testing. > > First boot with start of bgpd and - traffic is not forwarded > > So i start to search and make only some routes (static without bgpd) > > thru this host > > And all is working for this host when i make all by static routes. > > > > So i change a little my bgp configuration and make default route to only > > one of my iBGP peers and start bgpd process > > All is working and what is weird is number of routes in kernel table. > > Kernel is learning routes from bgpd but very slowly - really very slowly. > > Pawel, this is really very helpful! So, this is (probably) only about > timing, not wrong memory freeing. On the other hand this test was only > for inserts. Btw., if you didn't start the second test, you can skip > it. I have to rethink this. So, after your findings I'm about to recommend sending to -stable 3 patches from net-2.6, with additional lowering of threshold_root settings, but it would be nice if you could give it a try with CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break your other apps!) It is expected to work this time...;-) Maybe a bit slower. Thanks, Jarek P. --------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-07-01 06:17:08.000000000 +0000 +++ b/net/ipv4/fib_trie.c 2009-07-01 10:43:44.000000000 +0000 @@ -123,6 +123,7 @@ struct tnode { union { struct rcu_head rcu; struct work_struct work; + struct tnode *tnode_free; }; struct node *child[0]; }; @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct static struct node *resize(struct trie *t, struct tnode *tn); static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); +/* tnodes to free after resize(); protected by RTNL */ +static struct tnode *tnode_free_head; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -313,8 +316,8 @@ static inline void check_tnode(const str static const int halve_threshold = 25; static const int inflate_threshold = 50; -static const int halve_threshold_root = 8; -static const int inflate_threshold_root = 15; +static const int halve_threshold_root = 15; +static const int inflate_threshold_root = 25; static void __alias_free_mem(struct rcu_head *head) @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno call_rcu(&tn->rcu, __tnode_free_rcu); } +static void tnode_free_safe(struct tnode *tn) +{ + BUG_ON(IS_LEAF(tn)); + tn->tnode_free = tnode_free_head; + tnode_free_head = tn; +} + +static void tnode_free_flush(void) +{ + struct tnode *tn; + + while ((tn = tnode_free_head)) { + tnode_free_head = tn->tnode_free; + tn->tnode_free = NULL; + tnode_free(tn); + } +} + static struct leaf *leaf_new(void) { struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); @@ -495,7 +516,7 @@ static struct node *resize(struct trie * /* No children */ if (tn->empty_children == tnode_child_length(tn)) { - tnode_free(tn); + tnode_free_safe(tn); return NULL; } /* One child */ @@ -509,7 +530,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } /* @@ -670,7 +691,7 @@ static struct node *resize(struct trie * /* compress one level */ node_set_parent(n, NULL); - tnode_free(tn); + tnode_free_safe(tn); return n; } @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, inode->child[0]); put_child(t, tn, 2*i+1, inode->child[1]); - tnode_free(inode); + tnode_free_safe(inode); continue; } @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie put_child(t, tn, 2*i, resize(t, left)); put_child(t, tn, 2*i+1, resize(t, right)); - tnode_free(inode); + tnode_free_safe(inode); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * put_child(t, newBinNode, 1, right); put_child(t, tn, i/2, resize(t, newBinNode)); } - tnode_free(oldtnode); + tnode_free_safe(oldtnode); return tn; nomem: { @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) return NULL; } -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) +static void trie_rebalance(struct trie *t, struct tnode *tn) { int wasfull; - t_key cindex, key = tn->key; + t_key cindex, key; struct tnode *tp; + key = tn->key; + while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { cindex = tkey_extract_bits(key, tp->pos, tp->bits); wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc tp = node_parent((struct node *) tn); if (!tp) + rcu_assign_pointer(t->trie, (struct node *)tn); + + tnode_free_flush(); + if (!tp) break; tn = tp; } @@ -1007,7 +1034,10 @@ static struct node *trie_rebalance(struc if (IS_TNODE(tn)) tn = (struct tnode *)resize(t, (struct tnode *)tn); - return (struct node *)tn; + rcu_assign_pointer(t->trie, (struct node *)tn); + tnode_free_flush(); + + return; } /* only used from updater-side */ @@ -1155,7 +1185,7 @@ static struct list_head *fib_insert_node /* Rebalance the trie */ - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); done: return fa_head; } @@ -1575,7 +1605,7 @@ static void trie_leaf_remove(struct trie if (tp) { t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); put_child(t, (struct tnode *)tp, cindex, NULL); - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); + trie_rebalance(t, tp); } else rcu_assign_pointer(t->trie, NULL); ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-01 11:04 ` Jarek Poplawski @ 2009-07-01 22:17 ` Paweł Staszewski 2009-07-02 5:32 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-01 22:17 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson [-- Attachment #1: Type: text/plain, Size: 6762 bytes --] Jarek Poplawski pisze: > On Wed, Jul 01, 2009 at 10:13:33AM +0000, Jarek Poplawski wrote: > >> On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote: >> ... >> >>> Yes on on previous patches there was / no warnings / no oopses or lockups >>> >>> But now i apply this patch and i make more testing. >>> First boot with start of bgpd and - traffic is not forwarded >>> So i start to search and make only some routes (static without bgpd) >>> thru this host >>> And all is working for this host when i make all by static routes. >>> >>> So i change a little my bgp configuration and make default route to only >>> one of my iBGP peers and start bgpd process >>> All is working and what is weird is number of routes in kernel table. >>> Kernel is learning routes from bgpd but very slowly - really very slowly. >>> >> Pawel, this is really very helpful! So, this is (probably) only about >> timing, not wrong memory freeing. On the other hand this test was only >> for inserts. Btw., if you didn't start the second test, you can skip >> it. I have to rethink this. >> > > So, after your findings I'm about to recommend sending to -stable > 3 patches from net-2.6, with additional lowering of threshold_root > settings, but it would be nice if you could give it a try with > CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break > your other apps!) It is expected to work this time...;-) Maybe a > bit slower. > > Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE And working :) fib_triestats in attached file I think I can test it with PREEMPT enabled but first i must make some other tests of my apps that are on server. Regards Paweł Staszewski > Thanks, > Jarek P. > --------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x) > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-07-01 06:17:08.000000000 +0000 > +++ b/net/ipv4/fib_trie.c 2009-07-01 10:43:44.000000000 +0000 > @@ -123,6 +123,7 @@ struct tnode { > union { > struct rcu_head rcu; > struct work_struct work; > + struct tnode *tnode_free; > }; > struct node *child[0]; > }; > @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct > static struct node *resize(struct trie *t, struct tnode *tn); > static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > +/* tnodes to free after resize(); protected by RTNL */ > +static struct tnode *tnode_free_head; > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -313,8 +316,8 @@ static inline void check_tnode(const str > > static const int halve_threshold = 25; > static const int inflate_threshold = 50; > -static const int halve_threshold_root = 8; > -static const int inflate_threshold_root = 15; > +static const int halve_threshold_root = 15; > +static const int inflate_threshold_root = 25; > > > static void __alias_free_mem(struct rcu_head *head) > @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno > call_rcu(&tn->rcu, __tnode_free_rcu); > } > > +static void tnode_free_safe(struct tnode *tn) > +{ > + BUG_ON(IS_LEAF(tn)); > + tn->tnode_free = tnode_free_head; > + tnode_free_head = tn; > +} > + > +static void tnode_free_flush(void) > +{ > + struct tnode *tn; > + > + while ((tn = tnode_free_head)) { > + tnode_free_head = tn->tnode_free; > + tn->tnode_free = NULL; > + tnode_free(tn); > + } > +} > + > static struct leaf *leaf_new(void) > { > struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL); > @@ -495,7 +516,7 @@ static struct node *resize(struct trie * > > /* No children */ > if (tn->empty_children == tnode_child_length(tn)) { > - tnode_free(tn); > + tnode_free_safe(tn); > return NULL; > } > /* One child */ > @@ -509,7 +530,7 @@ static struct node *resize(struct trie * > > /* compress one level */ > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > /* > @@ -670,7 +691,7 @@ static struct node *resize(struct trie * > /* compress one level */ > > node_set_parent(n, NULL); > - tnode_free(tn); > + tnode_free_safe(tn); > return n; > } > > @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, inode->child[0]); > put_child(t, tn, 2*i+1, inode->child[1]); > > - tnode_free(inode); > + tnode_free_safe(inode); > continue; > } > > @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie > put_child(t, tn, 2*i, resize(t, left)); > put_child(t, tn, 2*i+1, resize(t, right)); > > - tnode_free(inode); > + tnode_free_safe(inode); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie * > put_child(t, newBinNode, 1, right); > put_child(t, tn, i/2, resize(t, newBinNode)); > } > - tnode_free(oldtnode); > + tnode_free_safe(oldtnode); > return tn; > nomem: > { > @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key) > return NULL; > } > > -static struct node *trie_rebalance(struct trie *t, struct tnode *tn) > +static void trie_rebalance(struct trie *t, struct tnode *tn) > { > int wasfull; > - t_key cindex, key = tn->key; > + t_key cindex, key; > struct tnode *tp; > > + key = tn->key; > + > while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) { > cindex = tkey_extract_bits(key, tp->pos, tp->bits); > wasfull = tnode_full(tp, tnode_get_child(tp, cindex)); > @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc > > tp = node_parent((struct node *) tn); > if (!tp) > + rcu_assign_pointer(t->trie, (struct node *)tn); > + > + tnode_free_flush(); > + if (!tp) > break; > tn = tp; > } > @@ -1007,7 +1034,10 @@ static struct node *trie_rebalance(struc > if (IS_TNODE(tn)) > tn = (struct tnode *)resize(t, (struct tnode *)tn); > > - return (struct node *)tn; > + rcu_assign_pointer(t->trie, (struct node *)tn); > + tnode_free_flush(); > + > + return; > } > > /* only used from updater-side */ > @@ -1155,7 +1185,7 @@ static struct list_head *fib_insert_node > > /* Rebalance the trie */ > > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > done: > return fa_head; > } > @@ -1575,7 +1605,7 @@ static void trie_leaf_remove(struct trie > if (tp) { > t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits); > put_child(t, (struct tnode *)tp, cindex, NULL); > - rcu_assign_pointer(t->trie, trie_rebalance(t, tp)); > + trie_rebalance(t, tp); > } else > rcu_assign_pointer(t->trie, NULL); > > > > [-- Attachment #2: fib_triestats.txt --] [-- Type: text/plain, Size: 925 bytes --] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277395 Prefixes: 290874 Internal nodes: 66711 1: 32915 2: 14668 3: 10752 4: 4913 5: 2197 6: 895 7: 367 8: 3 17: 1 Pointers: 595526 Null ptrs: 251421 Total size: 18044 kB Counters: --------- gets = 2705388 backtracks = 137797 semantic match passed = 2658993 semantic match miss = 87 null node hit= 1980950 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 2709741 backtracks = 1584810 semantic match passed = 4417 semantic match miss = 0 null node hit= 192688 skipped node resize = 0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-01 22:17 ` Paweł Staszewski @ 2009-07-02 5:32 ` Jarek Poplawski 2009-07-02 5:43 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-02 5:32 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: ... >> So, after your findings I'm about to recommend sending to -stable >> 3 patches from net-2.6, with additional lowering of threshold_root >> settings, but it would be nice if you could give it a try with >> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >> your other apps!) It is expected to work this time...;-) Maybe a >> bit slower. >> >> > Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE > And working :) Hmm... It should, because you tested very similar patch already;-) Sorry if I didn't make it clear. > > fib_triestats in attached file > > I think I can test it with PREEMPT enabled but first i must make some > other tests of my apps that are on server. It could probably matter only if you're using some broken out-of-tree patches. Otherwise the kernel is expected to work OK. Btw., it would be also interesting to check if there is any difference wrt. these route cache problems while PREEMPT is enabled. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 5:32 ` Jarek Poplawski @ 2009-07-02 5:43 ` Paweł Staszewski 2009-07-02 6:00 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-02 5:43 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson Jarek Poplawski pisze: > On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: > >> Jarek Poplawski pisze: >> > ... > >>> So, after your findings I'm about to recommend sending to -stable >>> 3 patches from net-2.6, with additional lowering of threshold_root >>> settings, but it would be nice if you could give it a try with >>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>> your other apps!) It is expected to work this time...;-) Maybe a >>> bit slower. >>> >>> >>> >> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE >> And working :) >> > > Hmm... It should, because you tested very similar patch already;-) > Sorry if I didn't make it clear. > > Yes i know there was almost identical one. And i see this was without sync rcu :) >> fib_triestats in attached file >> >> I think I can test it with PREEMPT enabled but first i must make some >> other tests of my apps that are on server. >> > > It could probably matter only if you're using some broken out-of-tree > patches. Otherwise the kernel is expected to work OK. > > Im a little confused about using of PREEMPT kernel because of past there was many oopses / lockups :) but yes that was a little long time ago. I will try to make this test today. > Btw., it would be also interesting to check if there is any difference > wrt. these route cache problems while PREEMPT is enabled. > > Thanks, > Jarek P. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 5:43 ` Paweł Staszewski @ 2009-07-02 6:00 ` Jarek Poplawski 2009-07-02 15:31 ` Robert Olsson 2009-07-05 0:26 ` Paweł Staszewski 0 siblings, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-02 6:00 UTC (permalink / raw) To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: >> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: >> >>> Jarek Poplawski pisze: >>> >> ... >> >>>> So, after your findings I'm about to recommend sending to -stable >>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>> settings, but it would be nice if you could give it a try with >>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>> your other apps!) It is expected to work this time...;-) Maybe a >>>> bit slower. >>>> >>>> >>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE >>> And working :) >>> >> >> Hmm... It should, because you tested very similar patch already;-) >> Sorry if I didn't make it clear. >> >> > Yes i know there was almost identical one. > And i see this was without sync rcu :) Yes, it looks like we can't free memory so simple because of such huge latencies. > >>> fib_triestats in attached file >>> >>> I think I can test it with PREEMPT enabled but first i must make some >>> other tests of my apps that are on server. >>> >> >> It could probably matter only if you're using some broken out-of-tree >> patches. Otherwise the kernel is expected to work OK. >> >> > Im a little confused about using of PREEMPT kernel because of past > there was many oopses / lockups :) but yes that was a little long time ago. > I will try to make this test today. > >> Btw., it would be also interesting to check if there is any difference >> wrt. these route cache problems while PREEMPT is enabled. And you're very right! The place we're fixing is the best example. On the other hand, I hope there is not many such places yet. But if we test/fix it there will be one less... Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 6:00 ` Jarek Poplawski @ 2009-07-02 15:31 ` Robert Olsson 2009-07-02 19:06 ` Jarek Poplawski 2009-07-05 0:26 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-07-02 15:31 UTC (permalink / raw) To: Jarek Poplawski Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson Jarek Poplawski writes: > Yes, it looks like we can't free memory so simple because of such huge > latencies. Controlling RCU seems crucial. Insertion of the full BGP table increased from 2 seconds to > 20 min with one synchronize_rcu patches. And fib_trie "worst case" wrt memory is the root node. So maybe we should monitor changes in root node and use this to control synchronize_rcu. Didn't Paul suggest something like this? And with don't find any decent solution we have to add an option for a fixed and pre-allocated root-nod typically for BGP-routers. Cheers --ro ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 15:31 ` Robert Olsson @ 2009-07-02 19:06 ` Jarek Poplawski 2009-07-02 21:32 ` Robert Olsson 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-02 19:06 UTC (permalink / raw) To: Robert Olsson Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson On Thu, Jul 02, 2009 at 05:31:58PM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > > Yes, it looks like we can't free memory so simple because of such huge > > latencies. > > Controlling RCU seems crucial. Insertion of the full BGP table increased > from 2 seconds to > 20 min with one synchronize_rcu patches. I wish I knew this a few days before. I could imagine a slow down, but it looked like it was stuck. Since these last changes weren't tested on SMP + PREEMPT I thought there is still something broken. (I was mainly interested in this synchronize_rcu at the moment as a preemption test.) > And fib_trie "worst case" wrt memory is the root node. So maybe we should > monitor changes in root node and use this to control synchronize_rcu. > > Didn't Paul suggest something like this? Sure, and it needs testing, but we should send some safe preemption fix for -stable first, don't we? > And with don't find any decent solution we have to add an option for > a fixed and pre-allocated root-nod typically for BGP-routers. Probably you're right; I'd prefer to see the test results showing a difference vs. simply less aggressive root thresholds. But of course, even if not convinced, I'll respect your choice as the author and maintainer, so feel free to NAK my proposals - I won't get it personally.;-) Cheers, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 19:06 ` Jarek Poplawski @ 2009-07-02 21:32 ` Robert Olsson 2009-07-02 22:13 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-07-02 21:32 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, Paweł Staszewski, Linux Network Development list, Robert Olsson Jarek Poplawski writes: > > Controlling RCU seems crucial. Insertion of the full BGP table increased > > from 2 seconds to > 20 min with one synchronize_rcu patches. > > I wish I knew this a few days before. I could imagine a slow down, > but it looked like it was stuck. Since these last changes weren't > tested on SMP + PREEMPT I thought there is still something broken. > (I was mainly interested in this synchronize_rcu at the moment as > a preemption test.) Honestly this huge slowdown was surprise for me too. I think I sent you a script so you could insert the full table yourself. > > And fib_trie "worst case" wrt memory is the root node. So maybe we should > > monitor changes in root node and use this to control synchronize_rcu. > > > > Didn't Paul suggest something like this? > > Sure, and it needs testing, but we should send some safe preemption > fix for -stable first, don't we? Yes my hope was that we could combine them... personally I'll need to understand who we can preeemted better in the different configs and most of that this can be handled by "standard" RCU. > > And with don't find any decent solution we have to add an option for > > a fixed and pre-allocated root-nod typically for BGP-routers. > > Probably you're right; I'd prefer to see the test results showing > a difference vs. simply less aggressive root thresholds. But of > course, even if not convinced, I'll respect your choice as the author > and maintainer, so feel free to NAK my proposals - I won't get it > personally.;-) Thresholds we can change no problem... but very soon I'll people will start routing without the route cache this at least in close to Internet core ,we will need all fib_look performance we can get. fib_trie was designed for classical RCU and no preempt you see the names i file... so this new and very challenging work to all of us. First week of vacation and have to fix the roof of the house... it's hot and dirty. Cheers. --ro ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 21:32 ` Robert Olsson @ 2009-07-02 22:13 ` Jarek Poplawski 0 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-02 22:13 UTC (permalink / raw) To: Robert Olsson Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson On Thu, Jul 02, 2009 at 11:32:26PM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > > > Controlling RCU seems crucial. Insertion of the full BGP table increased > > > from 2 seconds to > 20 min with one synchronize_rcu patches. > > > > I wish I knew this a few days before. I could imagine a slow down, > > but it looked like it was stuck. Since these last changes weren't > > tested on SMP + PREEMPT I thought there is still something broken. > > (I was mainly interested in this synchronize_rcu at the moment as > > a preemption test.) > > > Honestly this huge slowdown was surprise for me too. I think I sent > you a script so you could insert the full table yourself. I can't remember this script, but I guess my hardware should be suitable for reading it.;-) > > > > And fib_trie "worst case" wrt memory is the root node. So maybe we should > > > monitor changes in root node and use this to control synchronize_rcu. > > > > > > Didn't Paul suggest something like this? > > > > Sure, and it needs testing, but we should send some safe preemption > > fix for -stable first, don't we? > > Yes my hope was that we could combine them... personally I'll need > to understand who we can preeemted better in the different configs > and most of that this can be handled by "standard" RCU. > > > > And with don't find any decent solution we have to add an option for > > > a fixed and pre-allocated root-nod typically for BGP-routers. > > > > Probably you're right; I'd prefer to see the test results showing > > a difference vs. simply less aggressive root thresholds. But of > > course, even if not convinced, I'll respect your choice as the author > > and maintainer, so feel free to NAK my proposals - I won't get it > > personally.;-) > > Thresholds we can change no problem... but very soon I'll people > will start routing without the route cache this at least in close > to Internet core ,we will need all fib_look performance we can get. I mean changing thresholds as a temporary solution, until we can control memory freeing; and it seems to me, even excluding the root node, there could be a lot of temporary allocations during all those cycles repeated 10 times. > > fib_trie was designed for classical RCU and no preempt you see the > names i file... so this new and very challenging work to all of us. Then it should depend on CONFIG_PREEMPT_NONE, I guess. > > First week of vacation and have to fix the roof of the house... > it's hot and dirty. Have a nice time, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-02 6:00 ` Jarek Poplawski 2009-07-02 15:31 ` Robert Olsson @ 2009-07-05 0:26 ` Paweł Staszewski 2009-07-05 0:30 ` Paweł Staszewski ` (3 more replies) 1 sibling, 4 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-07-05 0:26 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson Jarek Poplawski pisze: > On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote: > >> Jarek Poplawski pisze: >> >>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: >>> >>> >>>> Jarek Poplawski pisze: >>>> >>>> >>> ... >>> >>> >>>>> So, after your findings I'm about to recommend sending to -stable >>>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>>> settings, but it would be nice if you could give it a try with >>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>>> your other apps!) It is expected to work this time...;-) Maybe a >>>>> bit slower. >>>>> >>>>> Ok kernel configured with CONFIG_PREEMPT and all this day work without any problems (with Jarek last patch). So in attached file trere is fib_tirestats I dont see any big change of (cpu load or faster/slower routing/propagating routes from bgpd or something else) - in avg there is from 2% to 3% more of CPU load i dont know why but it is - i change from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 1 30" always avg cpu load was from 2 to 3% more compared to "no preempt" Regards Paweł Staszewski >>>>> >>>>> >>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE >>>> And working :) >>>> >>>> >>> Hmm... It should, because you tested very similar patch already;-) >>> Sorry if I didn't make it clear. >>> >>> >>> >> Yes i know there was almost identical one. >> And i see this was without sync rcu :) >> > > Yes, it looks like we can't free memory so simple because of such huge > latencies. > > >>>> fib_triestats in attached file >>>> >>>> I think I can test it with PREEMPT enabled but first i must make some >>>> other tests of my apps that are on server. >>>> >>>> >>> It could probably matter only if you're using some broken out-of-tree >>> patches. Otherwise the kernel is expected to work OK. >>> >>> >>> >> Im a little confused about using of PREEMPT kernel because of past >> there was many oopses / lockups :) but yes that was a little long time ago. >> I will try to make this test today. >> >> >>> Btw., it would be also interesting to check if there is any difference >>> wrt. these route cache problems while PREEMPT is enabled. >>> > > And you're very right! The place we're fixing is the best example. On > the other hand, I hope there is not many such places yet. But if we > test/fix it there will be one less... > > Jarek P. > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 0:26 ` Paweł Staszewski @ 2009-07-05 0:30 ` Paweł Staszewski 2009-07-05 16:20 ` Jarek Poplawski 2009-07-05 0:31 ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski ` (2 subsequent siblings) 3 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-05 0:30 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson Oh I forgot - please Jarek give me patch with sync rcu and i will make test on preempt kernel Thanks Paweł Staszewski Paweł Staszewski pisze: > Jarek Poplawski pisze: >> On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote: >> >>> Jarek Poplawski pisze: >>> >>>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: >>>> >>>>> Jarek Poplawski pisze: >>>>> >>>> ... >>>> >>>>>> So, after your findings I'm about to recommend sending to -stable >>>>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>>>> settings, but it would be nice if you could give it a try with >>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>>>> your other apps!) It is expected to work this time...;-) Maybe a >>>>>> bit slower. >>>>>> >>>>>> > Ok kernel configured with CONFIG_PREEMPT > and all this day work without any problems (with Jarek last patch). > > > So in attached file trere is fib_tirestats > I dont see any big change of (cpu load or faster/slower > routing/propagating routes from bgpd or something else) - in avg there > is from 2% to 3% more of CPU load i dont know why but it is - i change > from "preempt" to "no preempt" 3 times and check this my "mpstat -P > ALL 1 30" > always avg cpu load was from 2 to 3% more compared to "no preempt" > > Regards > Paweł Staszewski > > >>>>>> >>>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE >>>>> And working :) >>>>> >>>> Hmm... It should, because you tested very similar patch already;-) >>>> Sorry if I didn't make it clear. >>>> >>>> >>> Yes i know there was almost identical one. >>> And i see this was without sync rcu :) >>> >> >> Yes, it looks like we can't free memory so simple because of such huge >> latencies. >> >>>>> fib_triestats in attached file >>>>> >>>>> I think I can test it with PREEMPT enabled but first i must make >>>>> some other tests of my apps that are on server. >>>>> >>>> It could probably matter only if you're using some broken out-of-tree >>>> patches. Otherwise the kernel is expected to work OK. >>>> >>>> >>> Im a little confused about using of PREEMPT kernel because of past >>> there was many oopses / lockups :) but yes that was a little long >>> time ago. >>> I will try to make this test today. >>> >>> >>>> Btw., it would be also interesting to check if there is any difference >>>> wrt. these route cache problems while PREEMPT is enabled. >>>> >> >> And you're very right! The place we're fixing is the best example. On >> the other hand, I hope there is not many such places yet. But if we >> test/fix it there will be one less... >> >> Jarek P. >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 0:30 ` Paweł Staszewski @ 2009-07-05 16:20 ` Jarek Poplawski 2009-07-05 17:32 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-05 16:20 UTC (permalink / raw) To: Paweł Staszewski Cc: Linux Network Development list, Robert Olsson, Paul E. McKenney On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: > Oh > > I forgot - please Jarek give me patch with sync rcu and i will make test > on preempt kernel Probably non-preempt kernel might need something like this more, but comparing is always interesting. This patch is based on Paul's suggestion (I hope). Thanks, Jarek P. ---> (synchronize take 7; apply on top of the 2.6.29.x with the last all-in-one patch, or net-2.6) net/ipv4/fib_trie.c | 8 ++++++++ 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..fce8238 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -164,6 +164,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -393,6 +394,8 @@ static void tnode_free_safe(struct tnode *tn) BUG_ON(IS_LEAF(tn)); tn->tnode_free = tnode_free_head; tnode_free_head = tn; + tnode_free_size += sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); } static void tnode_free_flush(void) @@ -404,6 +407,11 @@ static void tnode_free_flush(void) tn->tnode_free = NULL; tnode_free(tn); } + + if (tnode_free_size >= PAGE_SIZE * 128) { + tnode_free_size = 0; + synchronize_rcu(); + } } static struct leaf *leaf_new(void) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 16:20 ` Jarek Poplawski @ 2009-07-05 17:32 ` Jarek Poplawski 2009-07-05 21:32 ` Paul E. McKenney 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-05 17:32 UTC (permalink / raw) To: Paweł Staszewski Cc: Linux Network Development list, Robert Olsson, Paul E. McKenney On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote: > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: > > Oh > > > > I forgot - please Jarek give me patch with sync rcu and i will make test > > on preempt kernel > > Probably non-preempt kernel might need something like this more, but > comparing is always interesting. This patch is based on Paul's > suggestion (I hope). Hold on ;-) Here is something even better... Syncing after 128 pages might be still too slow, so here is a higher initial value, 1000, plus you can change this while testing in: /sys/module/fib_trie/parameters/sync_pages It would be interesting to find the lowest acceptable value. Jarek P. ---> (synchronize take 8; apply on top of the 2.6.29.x with the last all-in-one patch, or net-2.6) net/ipv4/fib_trie.c | 12 ++++++++++++ 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..decc8d0 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -71,6 +71,7 @@ #include <linux/netlink.h> #include <linux/init.h> #include <linux/list.h> +#include <linux/moduleparam.h> #include <net/net_namespace.h> #include <net/ip.h> #include <net/protocol.h> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; + +static int sync_pages __read_mostly = 1000; +module_param(sync_pages, int, 0640); static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn) BUG_ON(IS_LEAF(tn)); tn->tnode_free = tnode_free_head; tnode_free_head = tn; + tnode_free_size += sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); } static void tnode_free_flush(void) @@ -404,6 +411,11 @@ static void tnode_free_flush(void) tn->tnode_free = NULL; tnode_free(tn); } + + if (tnode_free_size >= PAGE_SIZE * sync_pages) { + tnode_free_size = 0; + synchronize_rcu(); + } } static struct leaf *leaf_new(void) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 17:32 ` Jarek Poplawski @ 2009-07-05 21:32 ` Paul E. McKenney 2009-07-05 22:23 ` Jarek Poplawski ` (2 more replies) 0 siblings, 3 replies; 99+ messages in thread From: Paul E. McKenney @ 2009-07-05 21:32 UTC (permalink / raw) To: Jarek Poplawski Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote: > On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote: > > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: > > > Oh > > > > > > I forgot - please Jarek give me patch with sync rcu and i will make test > > > on preempt kernel > > > > Probably non-preempt kernel might need something like this more, but > > comparing is always interesting. This patch is based on Paul's > > suggestion (I hope). > > Hold on ;-) Here is something even better... Syncing after 128 pages > might be still too slow, so here is a higher initial value, 1000, plus > you can change this while testing in: > > /sys/module/fib_trie/parameters/sync_pages > > It would be interesting to find the lowest acceptable value. Looks like a promising approach to me! Thanx, Paul > Jarek P. > ---> (synchronize take 8; apply on top of the 2.6.29.x with the last > all-in-one patch, or net-2.6) > > net/ipv4/fib_trie.c | 12 ++++++++++++ > 1 files changed, 12 insertions(+), 0 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 00a54b2..decc8d0 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -71,6 +71,7 @@ > #include <linux/netlink.h> > #include <linux/init.h> > #include <linux/list.h> > +#include <linux/moduleparam.h> > #include <net/net_namespace.h> > #include <net/ip.h> > #include <net/protocol.h> > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > /* tnodes to free after resize(); protected by RTNL */ > static struct tnode *tnode_free_head; > +static size_t tnode_free_size; > + > +static int sync_pages __read_mostly = 1000; > +module_param(sync_pages, int, 0640); > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn) > BUG_ON(IS_LEAF(tn)); > tn->tnode_free = tnode_free_head; > tnode_free_head = tn; > + tnode_free_size += sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > } > > static void tnode_free_flush(void) > @@ -404,6 +411,11 @@ static void tnode_free_flush(void) > tn->tnode_free = NULL; > tnode_free(tn); > } > + > + if (tnode_free_size >= PAGE_SIZE * sync_pages) { > + tnode_free_size = 0; > + synchronize_rcu(); > + } > } > > static struct leaf *leaf_new(void) > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 21:32 ` Paul E. McKenney @ 2009-07-05 22:23 ` Jarek Poplawski 2009-07-05 23:53 ` Paweł Staszewski 2009-07-14 18:33 ` [PATCH net-next] " Jarek Poplawski 2009-07-14 21:20 ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski 2 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-05 22:23 UTC (permalink / raw) To: Paul E. McKenney Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote: > On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote: > > On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote: > > > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: > > > > Oh > > > > > > > > I forgot - please Jarek give me patch with sync rcu and i will make test > > > > on preempt kernel > > > > > > Probably non-preempt kernel might need something like this more, but > > > comparing is always interesting. This patch is based on Paul's > > > suggestion (I hope). > > > > Hold on ;-) Here is something even better... Syncing after 128 pages > > might be still too slow, so here is a higher initial value, 1000, plus > > you can change this while testing in: > > > > /sys/module/fib_trie/parameters/sync_pages > > > > It would be interesting to find the lowest acceptable value. > > Looks like a promising approach to me! > > Thanx, Paul Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this synchronize_rcu done at the lowest acceptable rate could be actually mostly idle or on the contrary too late. Probably some more complex (per cpu?) accounting would be necessary to really matter here, but on the other hand these problems weren't reported often enough. Thanks, Jarek P. > > ---> (synchronize take 8; apply on top of the 2.6.29.x with the last > > all-in-one patch, or net-2.6) > > > > net/ipv4/fib_trie.c | 12 ++++++++++++ > > 1 files changed, 12 insertions(+), 0 deletions(-) > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 00a54b2..decc8d0 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -71,6 +71,7 @@ > > #include <linux/netlink.h> > > #include <linux/init.h> > > #include <linux/list.h> > > +#include <linux/moduleparam.h> > > #include <net/net_namespace.h> > > #include <net/ip.h> > > #include <net/protocol.h> > > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); > > static struct tnode *halve(struct trie *t, struct tnode *tn); > > /* tnodes to free after resize(); protected by RTNL */ > > static struct tnode *tnode_free_head; > > +static size_t tnode_free_size; > > + > > +static int sync_pages __read_mostly = 1000; > > +module_param(sync_pages, int, 0640); > > > > static struct kmem_cache *fn_alias_kmem __read_mostly; > > static struct kmem_cache *trie_leaf_kmem __read_mostly; > > @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn) > > BUG_ON(IS_LEAF(tn)); > > tn->tnode_free = tnode_free_head; > > tnode_free_head = tn; > > + tnode_free_size += sizeof(struct tnode) + > > + (sizeof(struct node *) << tn->bits); > > } > > > > static void tnode_free_flush(void) > > @@ -404,6 +411,11 @@ static void tnode_free_flush(void) > > tn->tnode_free = NULL; > > tnode_free(tn); > > } > > + > > + if (tnode_free_size >= PAGE_SIZE * sync_pages) { > > + tnode_free_size = 0; > > + synchronize_rcu(); > > + } > > } > > > > static struct leaf *leaf_new(void) > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 22:23 ` Jarek Poplawski @ 2009-07-05 23:53 ` Paweł Staszewski 2009-07-06 9:02 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-05 23:53 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson kernel 2.6.29.5 preempt bgp starts normal and kernel know routes normaly like without patch Here are some fib_triestats cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277888 Prefixes: 291399 Internal nodes: 66818 1: 33080 2: 14584 3: 10788 4: 4911 5: 2185 6: 900 7: 366 8: 3 17: 1 Pointers: 595584 Null ptrs: 250879 Total size: 18072 kB Counters: --------- gets = 1052940 backtracks = 55985 semantic match passed = 1034114 semantic match miss = 5 null node hit= 534415 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 1057636 backtracks = 1101307 semantic match passed = 4751 semantic match miss = 0 null node hit= 195605 skipped node resize = 0 kernel 2.6.29.5 no-preempt All is ok like with preempt kernel (andl all working in normal time "routes propagation") cat /sys/module/fib_trie/parameters/sync_pages 1000 cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.45 Max depth: 6 Leaves: 277905 Prefixes: 291416 Internal nodes: 66863 1: 33119 2: 14594 3: 10782 4: 4911 5: 2187 6: 901 7: 365 8: 3 17: 1 Pointers: 595654 Null ptrs: 250887 Total size: 18074 kB Counters: --------- gets = 1060650 backtracks = 53161 semantic match passed = 1041008 semantic match miss = 12 null node hit= 504478 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 1065517 backtracks = 1095422 semantic match passed = 4954 semantic match miss = 0 null node hit= 195584 skipped node resize = 0 So i make tests with changing sync_pages And #################################### sync_pages: 64 total size reach maximum in 17sec Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.43 Max depth: 6 Leaves: 271928 Prefixes: 285435 Internal nodes: 66185 1: 32904 2: 14554 3: 10740 4: 4677 5: 2047 6: 901 7: 361 17: 1 Pointers: 585224 Null ptrs: 247112 Total size: 17729 kB Counters: --------- gets = 5313544 backtracks = 230501 semantic match passed = 5233998 semantic match miss = 61 null node hit= 2757531 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 5332471 backtracks = 4708505 semantic match passed = 19264 semantic match miss = 0 null node hit= 782757 skipped node resize = 0 ###################################### sync_pages: 128 Fib trie Total size reach max in 14sec Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277915 Prefixes: 291427 Internal nodes: 66832 1: 33085 2: 14597 3: 10785 4: 4908 5: 2187 6: 900 7: 366 8: 3 17: 1 Pointers: 595638 Null ptrs: 250892 Total size: 18074 kB Counters: --------- gets = 6698058 backtracks = 307491 semantic match passed = 6593421 semantic match miss = 66 null node hit= 3498560 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 6721120 backtracks = 5934017 semantic match passed = 23440 semantic match miss = 0 null node hit= 978008 skipped node resize = 0 ######################################### sync_pages: 256 hmm no difference also in 10sec Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277913 Prefixes: 291425 Internal nodes: 66829 1: 33082 2: 14596 3: 10786 4: 4909 5: 2186 6: 900 7: 366 8: 3 17: 1 Pointers: 595620 Null ptrs: 250879 Total size: 18073 kB Counters: --------- gets = 4637474 backtracks = 188624 semantic match passed = 4577266 semantic match miss = 61 null node hit= 2451890 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 4651791 backtracks = 3716400 semantic match passed = 14613 semantic match miss = 0 null node hit= 587208 skipped node resize = 0 And with sync_pages higher that 256 time of filling kernel routes is the same approx 10sec. I make this test bu use: watch -n1 cat /proc/net/fib_triestat timer start when Total size was 1kB and stop when Total size reach 18073 kB Regards Paweł Staszewski Jarek Poplawski pisze: > On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote: > >> On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote: >> >>> On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote: >>> >>>> On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: >>>> >>>>> Oh >>>>> >>>>> I forgot - please Jarek give me patch with sync rcu and i will make test >>>>> on preempt kernel >>>>> >>>> Probably non-preempt kernel might need something like this more, but >>>> comparing is always interesting. This patch is based on Paul's >>>> suggestion (I hope). >>>> >>> Hold on ;-) Here is something even better... Syncing after 128 pages >>> might be still too slow, so here is a higher initial value, 1000, plus >>> you can change this while testing in: >>> >>> /sys/module/fib_trie/parameters/sync_pages >>> >>> It would be interesting to find the lowest acceptable value. >>> >> Looks like a promising approach to me! >> >> Thanx, Paul >> > > Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this > synchronize_rcu done at the lowest acceptable rate could be actually > mostly idle or on the contrary too late. Probably some more complex > (per cpu?) accounting would be necessary to really matter here, but > on the other hand these problems weren't reported often enough. > > Thanks, > Jarek P. > > >>> ---> (synchronize take 8; apply on top of the 2.6.29.x with the last >>> all-in-one patch, or net-2.6) >>> >>> net/ipv4/fib_trie.c | 12 ++++++++++++ >>> 1 files changed, 12 insertions(+), 0 deletions(-) >>> >>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >>> index 00a54b2..decc8d0 100644 >>> --- a/net/ipv4/fib_trie.c >>> +++ b/net/ipv4/fib_trie.c >>> @@ -71,6 +71,7 @@ >>> #include <linux/netlink.h> >>> #include <linux/init.h> >>> #include <linux/list.h> >>> +#include <linux/moduleparam.h> >>> #include <net/net_namespace.h> >>> #include <net/ip.h> >>> #include <net/protocol.h> >>> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); >>> static struct tnode *halve(struct trie *t, struct tnode *tn); >>> /* tnodes to free after resize(); protected by RTNL */ >>> static struct tnode *tnode_free_head; >>> +static size_t tnode_free_size; >>> + >>> +static int sync_pages __read_mostly = 1000; >>> +module_param(sync_pages, int, 0640); >>> >>> static struct kmem_cache *fn_alias_kmem __read_mostly; >>> static struct kmem_cache *trie_leaf_kmem __read_mostly; >>> @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn) >>> BUG_ON(IS_LEAF(tn)); >>> tn->tnode_free = tnode_free_head; >>> tnode_free_head = tn; >>> + tnode_free_size += sizeof(struct tnode) + >>> + (sizeof(struct node *) << tn->bits); >>> } >>> >>> static void tnode_free_flush(void) >>> @@ -404,6 +411,11 @@ static void tnode_free_flush(void) >>> tn->tnode_free = NULL; >>> tnode_free(tn); >>> } >>> + >>> + if (tnode_free_size >= PAGE_SIZE * sync_pages) { >>> + tnode_free_size = 0; >>> + synchronize_rcu(); >>> + } >>> } >>> >>> static struct leaf *leaf_new(void) >>> -- >>> To unsubscribe from this list: send the line "unsubscribe netdev" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 23:53 ` Paweł Staszewski @ 2009-07-06 9:02 ` Jarek Poplawski 2009-07-07 22:56 ` Paweł Staszewski 2009-07-07 23:23 ` [PATCH net-2.6] " Paweł Staszewski 0 siblings, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-06 9:02 UTC (permalink / raw) To: Paweł Staszewski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote: ... > So i make tests with changing sync_pages > And > > #################################### > sync_pages: 64 > total size reach maximum in 17sec ... > ###################################### > sync_pages: 128 > Fib trie Total size reach max in 14sec ... > ######################################### > sync_pages: 256 > hmm no difference also in 10sec 14 == 10!? ;-) ... > And with sync_pages higher that 256 time of filling kernel routes is the > same approx 10sec. Hmm... So, it's better than I expected; syncing after 128 or 256 pages could be quite reasonable. But then it would be interesting to find out if with such a safety we could go back to more aggressive values for possibly better performance. So here is 'the same' patch (so the previous, take 8, should be reverted), but with additional possibility to change: /sys/module/fib_trie/parameters/inflate_threshold_root I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15 can give faster lookups (or lower cpu loads); with this these inflate warnings could be back btw.; or maybe you'll find something in between like inflate_threshold_root 20 is optimal for you. (I think it should be enough to try this only for PREEMPT_NONE unless you have spare time ;-) Thanks, Jarek P. ---> (synchronize take 9; apply on top of the 2.6.29.x with the last all-in-one patch, or net-2.6) net/ipv4/fib_trie.c | 18 ++++++++++++++++-- 1 files changed, 16 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..e8fca11 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -71,6 +71,7 @@ #include <linux/netlink.h> #include <linux/init.h> #include <linux/list.h> +#include <linux/moduleparam.h> #include <net/net_namespace.h> #include <net/ip.h> #include <net/protocol.h> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; + +static int sync_pages __read_mostly = 1000; +module_param(sync_pages, int, 0640); static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn) static const int halve_threshold = 25; static const int inflate_threshold = 50; -static const int halve_threshold_root = 15; -static const int inflate_threshold_root = 25; +static int inflate_threshold_root __read_mostly = 25; +module_param(inflate_threshold_root, int, 0640); + +#define halve_threshold_root (inflate_threshold_root / 2 + 1) static void __alias_free_mem(struct rcu_head *head) { @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn) BUG_ON(IS_LEAF(tn)); tn->tnode_free = tnode_free_head; tnode_free_head = tn; + tnode_free_size += sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); } static void tnode_free_flush(void) @@ -404,6 +413,11 @@ static void tnode_free_flush(void) tn->tnode_free = NULL; tnode_free(tn); } + + if (tnode_free_size >= PAGE_SIZE * sync_pages) { + tnode_free_size = 0; + synchronize_rcu(); + } } static struct leaf *leaf_new(void) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-06 9:02 ` Jarek Poplawski @ 2009-07-07 22:56 ` Paweł Staszewski 2009-07-07 23:50 ` Jarek Poplawski 2009-07-07 23:23 ` [PATCH net-2.6] " Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-07 22:56 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson Jarek Poplawski pisze: > On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote: > ... > >> So i make tests with changing sync_pages >> And >> >> #################################### >> sync_pages: 64 >> total size reach maximum in 17sec >> > ... > >> ###################################### >> sync_pages: 128 >> Fib trie Total size reach max in 14sec >> > ... > >> ######################################### >> sync_pages: 256 >> hmm no difference also in 10sec >> > > 14 == 10!? ;-) > ... > >> And with sync_pages higher that 256 time of filling kernel routes is the >> same approx 10sec. >> > > Hmm... So, it's better than I expected; syncing after 128 or 256 pages > could be quite reasonable. But then it would be interesting to find > out if with such a safety we could go back to more aggressive values > for possibly better performance. So here is 'the same' patch (so the > previous, take 8, should be reverted), but with additional possibility > to change: > /sys/module/fib_trie/parameters/inflate_threshold_root > > I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15 > can give faster lookups (or lower cpu loads); with this these inflate > warnings could be back btw.; or maybe you'll find something in between > like inflate_threshold_root 20 is optimal for you. (I think it should be > enough to try this only for PREEMPT_NONE unless you have spare time ;-) > > Thanks, > Jarek P. > ---> (synchronize take 9; apply on top of the 2.6.29.x with the last > all-in-one patch, or net-2.6) > > Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make only one test with sync_pages = 256 to check that is working :) So here are some tests for different sync_pages size. echo 1 > /sys/module/fib_trie/parameters/sync_pages I stop count after 1minute - total size still rising :) echo 2 > /sys/module/fib_trie/parameters/sync_pages Total size in fib_triestats reach maximum in 33sec echo 3 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 31sec echo 4 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 23sec echo 8 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 17sec echo 16 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 14 sec echo 32 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 14 sec So i see in prev tests i make something wrong in time counting So i modify test script and make tests again: echo 64 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 13 sec echo 128 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 10 sec echo 256 > /sys/module/fib_trie/parameters/sync_pages Total size reach max in 10 sec And for sync_paqges >256 time for propagating routes is always 10sec. Also today i have many messages in dmesg like this: Fix inflate_threshold_root. Now=25 size=11 bits :) And after tune : /sys/module/fib_trie/parameters/inflate_threshold_root no more info :) Regards Paweł Staszewski > net/ipv4/fib_trie.c | 18 ++++++++++++++++-- > 1 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 00a54b2..e8fca11 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -71,6 +71,7 @@ > #include <linux/netlink.h> > #include <linux/init.h> > #include <linux/list.h> > +#include <linux/moduleparam.h> > #include <net/net_namespace.h> > #include <net/ip.h> > #include <net/protocol.h> > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > /* tnodes to free after resize(); protected by RTNL */ > static struct tnode *tnode_free_head; > +static size_t tnode_free_size; > + > +static int sync_pages __read_mostly = 1000; > +module_param(sync_pages, int, 0640); > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn) > > static const int halve_threshold = 25; > static const int inflate_threshold = 50; > -static const int halve_threshold_root = 15; > -static const int inflate_threshold_root = 25; > > +static int inflate_threshold_root __read_mostly = 25; > +module_param(inflate_threshold_root, int, 0640); > + > +#define halve_threshold_root (inflate_threshold_root / 2 + 1) > > static void __alias_free_mem(struct rcu_head *head) > { > @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn) > BUG_ON(IS_LEAF(tn)); > tn->tnode_free = tnode_free_head; > tnode_free_head = tn; > + tnode_free_size += sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > } > > static void tnode_free_flush(void) > @@ -404,6 +413,11 @@ static void tnode_free_flush(void) > tn->tnode_free = NULL; > tnode_free(tn); > } > + > + if (tnode_free_size >= PAGE_SIZE * sync_pages) { > + tnode_free_size = 0; > + synchronize_rcu(); > + } > } > > static struct leaf *leaf_new(void) > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-07 22:56 ` Paweł Staszewski @ 2009-07-07 23:50 ` Jarek Poplawski 2009-07-09 20:34 ` Paweł Staszewski 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-07 23:50 UTC (permalink / raw) To: Paweł Staszewski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: ... >> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last >> all-in-one patch, or net-2.6) >> >> > > Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make > only one test with sync_pages = 256 to check that is working :) > > So here are some tests for different sync_pages size. ... > So i see in prev tests i make something wrong in time counting > So i modify test script and make tests again: > > echo 64 > /sys/module/fib_trie/parameters/sync_pages > Total size reach max in 13 sec > > echo 128 > /sys/module/fib_trie/parameters/sync_pages > Total size reach max in 10 sec > > echo 256 > /sys/module/fib_trie/parameters/sync_pages > Total size reach max in 10 sec > > And for sync_paqges >256 time for propagating routes is always 10sec. So this means sync_pages 128 or 256 is reasonable. > > Also today i have many messages in dmesg like this: > Fix inflate_threshold_root. Now=25 size=11 bits > :) This is something new and a bit surprising to me: the same threshold in previous tests didn't generate this? Do you mean more than: "Fix inflate_threshold_root. Now=15 size=11 bits" before? > And after tune : > /sys/module/fib_trie/parameters/inflate_threshold_root > no more info :) With what value? Pawel, let's say that current defaults are: inflate_threshold_root 25 sync_pages 256 I'd like you to try to check if e.g.: inflate_threshold_root 15 sync_pages 256 can give you any visible or subjective difference worth tweaking it at all? (These stats from the next messages don't show this enough.) You don't need to hurry with this... Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-07 23:50 ` Jarek Poplawski @ 2009-07-09 20:34 ` Paweł Staszewski 2009-07-14 19:41 ` [PATCH net-next] " Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-09 20:34 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson Jarek Poplawski pisze: > On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote: > >> Jarek Poplawski pisze: >> > ... > >>> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last >>> all-in-one patch, or net-2.6) >>> >>> >>> >> Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make >> only one test with sync_pages = 256 to check that is working :) >> >> So here are some tests for different sync_pages size. >> > ... > >> So i see in prev tests i make something wrong in time counting >> So i modify test script and make tests again: >> >> echo 64 > /sys/module/fib_trie/parameters/sync_pages >> Total size reach max in 13 sec >> >> echo 128 > /sys/module/fib_trie/parameters/sync_pages >> Total size reach max in 10 sec >> >> echo 256 > /sys/module/fib_trie/parameters/sync_pages >> Total size reach max in 10 sec >> >> And for sync_paqges >256 time for propagating routes is always 10sec. >> > > So this means sync_pages 128 or 256 is reasonable. > > >> Also today i have many messages in dmesg like this: >> Fix inflate_threshold_root. Now=25 size=11 bits >> :) >> > > > This is something new and a bit surprising to me: the same threshold > in previous tests didn't generate this? Do you mean more than: > "Fix inflate_threshold_root. Now=15 size=11 bits" before? > > Yes. Sorry for that - this info was not all the day but only 5 minutes when i was making tests. This info was reported only when all iBGP peers was down/up fast. >> And after tune : >> /sys/module/fib_trie/parameters/inflate_threshold_root >> no more info :) >> > > With what value? > > When i set 35 as inflate_threshold_root there was no info even if all iBGP peers was down/up. But i start to search when i have info about "Fix inflate_threshold_root" And i see that the best is set this to 20 for me i have no info then in normal router operation / without down/up bgp peers many times in short time. > Pawel, let's say that current defaults are: > inflate_threshold_root 25 sync_pages 256 > > I'd like you to try to check if e.g.: > inflate_threshold_root 15 sync_pages 256 > can give you any visible or subjective difference worth tweaking it > at all? (These stats from the next messages don't show this enough.) > You don't need to hurry with this... > > I will try to make more accurate tests in weekend. Regards Paweł Staszewski > Thanks, > Jarek P. > > > ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-09 20:34 ` Paweł Staszewski @ 2009-07-14 19:41 ` Jarek Poplawski 2009-07-15 7:43 ` Robert Olsson 2009-07-20 14:41 ` David Miller 0 siblings, 2 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-14 19:41 UTC (permalink / raw) To: David Miller Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson, Jorge Boncompte [DTI2] On Thu, Jul 09, 2009 at 10:34:17PM +0200, Paweł Staszewski wrote: > Jarek Poplawski pisze: >> On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote: ... >>> Also today i have many messages in dmesg like this: >>> Fix inflate_threshold_root. Now=25 size=11 bits >>> :) >>> >> >> This is something new and a bit surprising to me: the same threshold >> in previous tests didn't generate this? Do you mean more than: "Fix >> inflate_threshold_root. Now=15 size=11 bits" before? >> >> > Yes. Sorry for that - this info was not all the day but only 5 minutes > when i was making tests. > This info was reported only when all iBGP peers was down/up fast. > >>> And after tune : >>> /sys/module/fib_trie/parameters/inflate_threshold_root >>> no more info :) >>> >> >> With what value? >> >> > When i set 35 as inflate_threshold_root there was no info even if all > iBGP peers was down/up. So it looks like the patch tested earlier could be still useful; after changing the inflate_threshold_root it seems these warnings should be very rare but there is no reason to alarm users with something they can't fix optimally, anyway. Thanks, Jarek P. ---------------------> ipv4: Fix inflate_threshold_root automatically During large updates there could be triggered warnings like: "Fix inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root node isn't finished in 10 loops. It should be much rarer now, after changing the threshold from 15 to 25, and a temporary problem, so this patch tries to handle it automatically using a fix variable to increase by one inflate threshold for next root resizes (up to the 35 limit, max fix = 10). The fix variable is decreased when root's inflate() finishes below 7 loops (even if some other, smaller table/ trie is updated -- for simplicity the fix variable is global for now). Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-07-13 13:32:53.000000000 +0200 +++ b/net/ipv4/fib_trie.c 2009-07-13 15:16:18.000000000 +0200 @@ -327,6 +327,8 @@ static const int inflate_threshold = 50; static const int halve_threshold_root = 15; static const int inflate_threshold_root = 25; +static int inflate_threshold_root_fix; +#define INFLATE_FIX_MAX 10 /* a comment in resize() */ static void __alias_free_mem(struct rcu_head *head) { @@ -617,7 +619,8 @@ static struct node *resize(struct trie * /* Keep root node larger */ if (!tn->parent) - inflate_threshold_use = inflate_threshold_root; + inflate_threshold_use = inflate_threshold_root + + inflate_threshold_root_fix; else inflate_threshold_use = inflate_threshold; @@ -641,15 +644,27 @@ static struct node *resize(struct trie * } if (max_resize < 0) { - if (!tn->parent) - pr_warning("Fix inflate_threshold_root." - " Now=%d size=%d bits\n", - inflate_threshold_root, tn->bits); - else + if (!tn->parent) { + /* + * It was observed that during large updates even + * inflate_threshold_root = 35 might be needed to avoid + * this warning; but it should be temporary, so let's + * try to handle this automatically. + */ + if (inflate_threshold_root_fix < INFLATE_FIX_MAX) + inflate_threshold_root_fix++; + else + pr_warning("Fix inflate_threshold_root." + " Now=%d size=%d bits fix=%d\n", + inflate_threshold_root, tn->bits, + inflate_threshold_root_fix); + } else { pr_warning("Fix inflate_threshold." " Now=%d size=%d bits\n", inflate_threshold, tn->bits); - } + } + } else if (max_resize > 3 && !tn->parent && inflate_threshold_root_fix) + inflate_threshold_root_fix--; check_tnode(tn); ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-14 19:41 ` [PATCH net-next] " Jarek Poplawski @ 2009-07-15 7:43 ` Robert Olsson 2009-07-15 13:05 ` Jarek Poplawski 2009-07-20 14:41 ` David Miller 1 sibling, 1 reply; 99+ messages in thread From: Robert Olsson @ 2009-07-15 7:43 UTC (permalink / raw) To: Jarek Poplawski Cc: David Miller, Paweł Staszewski, Linux Network Development list, Robert Olsson, Jorge Boncompte [DTI2] Jarek Poplawski writes: Looks good. Maybe we're getting close to some generic solution to take a very optimistic approach wrt thresholds for root node and adjust to settings without the warning. Or maybe now even remove warning totally with stata counter? Can we even consider some other different strategy for bumping up the root node. We need all lookup performance we can get when we now try to route without the route cache. And we probably need to evaluate the cost for the multiple lookups again at least for LOCAL and MAIN when we talking routing well at least straight-forward simple routing. (Semantic change) I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns we see in/close to Internet core. This w/o route cache on our hi-end opterons with 8 CPU cores using niu and ixgbe. I'll test again and your patches when I'm back from vacation. Cheers --ro > So it looks like the patch tested earlier could be still useful; after > changing the inflate_threshold_root it seems these warnings should be > very rare but there is no reason to alarm users with something they > can't fix optimally, anyway. > > Thanks, > Jarek P. > ---------------------> > ipv4: Fix inflate_threshold_root automatically > > During large updates there could be triggered warnings like: "Fix > inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root > node isn't finished in 10 loops. It should be much rarer now, after > changing the threshold from 15 to 25, and a temporary problem, so > this patch tries to handle it automatically using a fix variable to > increase by one inflate threshold for next root resizes (up to the 35 > limit, max fix = 10). The fix variable is decreased when root's > inflate() finishes below 7 loops (even if some other, smaller table/ > trie is updated -- for simplicity the fix variable is global for now). > > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net> > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > --- > > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > --- a/net/ipv4/fib_trie.c 2009-07-13 13:32:53.000000000 +0200 > +++ b/net/ipv4/fib_trie.c 2009-07-13 15:16:18.000000000 +0200 > @@ -327,6 +327,8 @@ static const int inflate_threshold = 50; > static const int halve_threshold_root = 15; > static const int inflate_threshold_root = 25; > > +static int inflate_threshold_root_fix; > +#define INFLATE_FIX_MAX 10 /* a comment in resize() */ > > static void __alias_free_mem(struct rcu_head *head) > { > @@ -617,7 +619,8 @@ static struct node *resize(struct trie * > /* Keep root node larger */ > > if (!tn->parent) > - inflate_threshold_use = inflate_threshold_root; > + inflate_threshold_use = inflate_threshold_root + > + inflate_threshold_root_fix; > else > inflate_threshold_use = inflate_threshold; > > @@ -641,15 +644,27 @@ static struct node *resize(struct trie * > } > > if (max_resize < 0) { > - if (!tn->parent) > - pr_warning("Fix inflate_threshold_root." > - " Now=%d size=%d bits\n", > - inflate_threshold_root, tn->bits); > - else > + if (!tn->parent) { > + /* > + * It was observed that during large updates even > + * inflate_threshold_root = 35 might be needed to avoid > + * this warning; but it should be temporary, so let's > + * try to handle this automatically. > + */ > + if (inflate_threshold_root_fix < INFLATE_FIX_MAX) > + inflate_threshold_root_fix++; > + else > + pr_warning("Fix inflate_threshold_root." > + " Now=%d size=%d bits fix=%d\n", > + inflate_threshold_root, tn->bits, > + inflate_threshold_root_fix); > + } else { > pr_warning("Fix inflate_threshold." > " Now=%d size=%d bits\n", > inflate_threshold, tn->bits); > - } > + } > + } else if (max_resize > 3 && !tn->parent && inflate_threshold_root_fix) > + inflate_threshold_root_fix--; > > check_tnode(tn); > ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-15 7:43 ` Robert Olsson @ 2009-07-15 13:05 ` Jarek Poplawski 2009-07-17 8:08 ` Robert Olsson 0 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-15 13:05 UTC (permalink / raw) To: Robert Olsson Cc: David Miller, Paweł Staszewski, Linux Network Development list, Robert Olsson, Jorge Boncompte [DTI2] On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote: > > Jarek Poplawski writes: > > > Looks good. Maybe we're getting close to some generic solution to take > a very optimistic approach wrt thresholds for root node and adjust to > settings without the warning. Or maybe now even remove warning totally > with stata counter? I guess, we could, but maybe let's wait a bit to make sure there is nothing surprising? > > Can we even consider some other different strategy for bumping up the root > node. > > We need all lookup performance we can get when we now try to route without > the route cache. And we probably need to evaluate the cost for the multiple > lookups again at least for LOCAL and MAIN when we talking routing well at > least straight-forward simple routing. (Semantic change) > > I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns > we see in/close to Internet core. This w/o route cache on our hi-end opterons > with 8 CPU cores using niu and ixgbe. I'll test again and your patches when > I'm back from vacation. > Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if tests show there is a better strategy we should go for it. Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-15 13:05 ` Jarek Poplawski @ 2009-07-17 8:08 ` Robert Olsson 0 siblings, 0 replies; 99+ messages in thread From: Robert Olsson @ 2009-07-17 8:08 UTC (permalink / raw) To: Jarek Poplawski Cc: Robert Olsson, David Miller, Paweł Staszewski, Linux Network Development list, Jorge Boncompte [DTI2] Jarek Poplawski writes: > On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote: > > a very optimistic approach wrt thresholds for root node and adjust to > > settings without the warning. Or maybe now even remove warning totally > > with stata counter? > > I guess, we could, but maybe let's wait a bit to make sure there is > nothing surprising? Yes if Pawel is running it we we'll get reports. I've no chance to upgrade any of our routers now. I've seen this printout in one our routers but we don't do "clear ip bgp *" to often and besides we try to use soft re- configuration inbound. > > I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns > > we see in/close to Internet core. This w/o route cache on our hi-end opterons > > with 8 CPU cores using niu and ixgbe. I'll test again and your patches when > > I'm back from vacation. > > > Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if > tests show there is a better strategy we should go for it. Routing without route cache is "new" area probably for minority of systems were caching is not possible. Read BGP routers in core. Yes we should have safe defults. Thanks for all your work. Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Cheers --ro ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-14 19:41 ` [PATCH net-next] " Jarek Poplawski 2009-07-15 7:43 ` Robert Olsson @ 2009-07-20 14:41 ` David Miller 1 sibling, 0 replies; 99+ messages in thread From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw) To: jarkao2; +Cc: pstaszewski, netdev, robert, jorge From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 14 Jul 2009 21:41:00 +0200 > ipv4: Fix inflate_threshold_root automatically > > During large updates there could be triggered warnings like: "Fix > inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root > node isn't finished in 10 loops. It should be much rarer now, after > changing the threshold from 15 to 25, and a temporary problem, so > this patch tries to handle it automatically using a fix variable to > increase by one inflate threshold for next root resizes (up to the 35 > limit, max fix = 10). The fix variable is decreased when root's > inflate() finishes below 7 loops (even if some other, smaller table/ > trie is updated -- for simplicity the fix variable is global for now). > > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net> > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Applied. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-06 9:02 ` Jarek Poplawski 2009-07-07 22:56 ` Paweł Staszewski @ 2009-07-07 23:23 ` Paweł Staszewski 2009-07-07 23:30 ` Paweł Staszewski 1 sibling, 1 reply; 99+ messages in thread From: Paweł Staszewski @ 2009-07-07 23:23 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson [-- Attachment #1: Type: text/plain, Size: 4397 bytes --] Jarek Poplawski pisze: > On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote: > ... > >> So i make tests with changing sync_pages >> And >> >> #################################### >> sync_pages: 64 >> total size reach maximum in 17sec >> > ... > >> ###################################### >> sync_pages: 128 >> Fib trie Total size reach max in 14sec >> > ... > >> ######################################### >> sync_pages: 256 >> hmm no difference also in 10sec >> > > 14 == 10!? ;-) > ... > :) i miss one test >> And with sync_pages higher that 256 time of filling kernel routes is the >> same approx 10sec. >> > > Hmm... So, it's better than I expected; syncing after 128 or 256 pages > could be quite reasonable. But then it would be interesting to find > out if with such a safety we could go back to more aggressive values > for possibly better performance. So here is 'the same' patch (so the > previous, take 8, should be reverted), but with additional possibility > to change: > /sys/module/fib_trie/parameters/inflate_threshold_root > > I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15 > can give faster lookups (or lower cpu loads); with this these inflate > warnings could be back btw.; or maybe you'll find something in between > like inflate_threshold_root 20 is optimal for you. (I think it should be > enough to try this only for PREEMPT_NONE unless you have spare time ;-) > > And i can't make good tests with cpu load because of problem that i have from "weird problem" emails It depend when i make mpstat to check cpu load and for what time because every 15 sec i have 1 do 3 % of cpu and after 15 sec i have almost 40% cpu load for next 15 sec. I try to make mpstat -P ALL 1 60 but after 15 sec of 1 to 3 % cpu load this next higher cpu load if different everytime it balance from 30 to 50% so i make test shorter when cpu load is 1 to 3 % - "mpstat -P ALL 1 10" output in attached file Regards Paweł Staszewski > Thanks, > Jarek P. > ---> (synchronize take 9; apply on top of the 2.6.29.x with the last > all-in-one patch, or net-2.6) > > net/ipv4/fib_trie.c | 18 ++++++++++++++++-- > 1 files changed, 16 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index 00a54b2..e8fca11 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -71,6 +71,7 @@ > #include <linux/netlink.h> > #include <linux/init.h> > #include <linux/list.h> > +#include <linux/moduleparam.h> > #include <net/net_namespace.h> > #include <net/ip.h> > #include <net/protocol.h> > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); > static struct tnode *halve(struct trie *t, struct tnode *tn); > /* tnodes to free after resize(); protected by RTNL */ > static struct tnode *tnode_free_head; > +static size_t tnode_free_size; > + > +static int sync_pages __read_mostly = 1000; > +module_param(sync_pages, int, 0640); > > static struct kmem_cache *fn_alias_kmem __read_mostly; > static struct kmem_cache *trie_leaf_kmem __read_mostly; > @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn) > > static const int halve_threshold = 25; > static const int inflate_threshold = 50; > -static const int halve_threshold_root = 15; > -static const int inflate_threshold_root = 25; > > +static int inflate_threshold_root __read_mostly = 25; > +module_param(inflate_threshold_root, int, 0640); > + > +#define halve_threshold_root (inflate_threshold_root / 2 + 1) > > static void __alias_free_mem(struct rcu_head *head) > { > @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn) > BUG_ON(IS_LEAF(tn)); > tn->tnode_free = tnode_free_head; > tnode_free_head = tn; > + tnode_free_size += sizeof(struct tnode) + > + (sizeof(struct node *) << tn->bits); > } > > static void tnode_free_flush(void) > @@ -404,6 +413,11 @@ static void tnode_free_flush(void) > tn->tnode_free = NULL; > tnode_free(tn); > } > + > + if (tnode_free_size >= PAGE_SIZE * sync_pages) { > + tnode_free_size = 0; > + synchronize_rcu(); > + } > } > > static struct leaf *leaf_new(void) > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > [-- Attachment #2: sync_pages.txt --] [-- Type: text/plain, Size: 3200 bytes --] sync_pages: 256 inflate_threshold_root: 10 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.00 0.60 0.00 0.00 99.40 Average: 0 0.00 0.00 0.00 0.00 0.00 0.60 0.00 0.00 99.40 Average: 1 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 99.60 sync_pages: 256 inflate_threshold_root: 15 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.10 0.00 0.00 0.70 0.00 0.00 99.20 Average: 0 0.00 0.00 0.00 0.00 0.20 0.80 0.00 0.00 99.00 Average: 1 0.00 0.00 0.20 0.00 0.00 0.61 0.00 0.00 99.19 sync_pages: 256 inflate_threshold_root: 20 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.10 0.80 0.00 0.00 99.10 Average: 0 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 99.00 Average: 1 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 99.39 sync_pages: 256 inflate_threshold_root: 25 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.00 0.70 0.00 0.00 99.30 Average: 0 0.00 0.00 0.00 0.00 0.20 1.00 0.00 0.00 98.80 Average: 1 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 99.60 sync_pages: 512 inflate_threshold_root: 10 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.10 0.00 0.10 0.60 0.00 0.00 99.20 Average: 0 0.00 0.00 0.20 0.00 0.00 1.00 0.00 0.00 98.80 Average: 1 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 99.60 sync_pages: 512 inflate_threshold_root: 15 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.20 0.00 0.00 1.10 0.00 0.00 98.70 Average: 0 0.00 0.00 0.40 0.00 0.00 1.00 0.00 0.00 98.60 Average: 1 0.00 0.00 0.00 0.00 0.00 1.01 0.00 0.00 98.99 sync_pages: 512 inflate_threshold_root: 20 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.10 0.00 0.10 1.01 0.00 0.00 98.79 Average: 0 0.00 0.00 0.20 0.00 0.20 1.40 0.00 0.00 98.20 Average: 1 0.00 0.00 0.00 0.00 0.00 0.61 0.00 0.00 99.39 sync_pages: 512 inflate_threshold_root: 25 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle Average: all 0.00 0.00 0.00 0.00 0.10 0.90 0.00 0.00 99.00 Average: 0 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 99.00 Average: 1 0.00 0.00 0.00 0.00 0.20 0.80 0.00 0.00 99.00 ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-07 23:23 ` [PATCH net-2.6] " Paweł Staszewski @ 2009-07-07 23:30 ` Paweł Staszewski 0 siblings, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-07-07 23:30 UTC (permalink / raw) To: Jarek Poplawski Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson Paweł Staszewski pisze: > Jarek Poplawski pisze: >> On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote: >> ... >> >>> So i make tests with changing sync_pages >>> And >>> >>> #################################### >>> sync_pages: 64 >>> total size reach maximum in 17sec >>> >> ... >> >>> ###################################### >>> sync_pages: 128 >>> Fib trie Total size reach max in 14sec >>> >> ... >> >>> ######################################### >>> sync_pages: 256 >>> hmm no difference also in 10sec >>> >> >> 14 == 10!? ;-) >> ... >> > :) i miss one test >>> And with sync_pages higher that 256 time of filling kernel routes is >>> the same approx 10sec. >>> >> >> Hmm... So, it's better than I expected; syncing after 128 or 256 pages >> could be quite reasonable. But then it would be interesting to find >> out if with such a safety we could go back to more aggressive values >> for possibly better performance. So here is 'the same' patch (so the >> previous, take 8, should be reverted), but with additional possibility >> to change: >> /sys/module/fib_trie/parameters/inflate_threshold_root >> >> I guess, you could try e.g. if: sync_pages 256, >> inflate_threshold_root 15 >> can give faster lookups (or lower cpu loads); with this these inflate >> warnings could be back btw.; or maybe you'll find something in between >> like inflate_threshold_root 20 is optimal for you. (I think it should be >> enough to try this only for PREEMPT_NONE unless you have spare time ;-) >> >> > And i can't make good tests with cpu load because of problem that i > have from "weird problem" emails > It depend when i make mpstat to check cpu load and for what time > because every 15 sec i have 1 do 3 % of cpu and after 15 sec i have > almost 40% cpu load for next 15 sec. > I try to make mpstat -P ALL 1 60 > but after 15 sec of 1 to 3 % cpu load this next higher cpu load if > different everytime it balance from 30 to 50% > > so i make test shorter when cpu load is 1 to 3 % - "mpstat -P ALL 1 10" > output in attached file > > Regards > Paweł Staszewski > i forgot to add: Traffic when i make test was +/- 10Mbit/s in next tests: eth0: RX: 231.21 Mb/s TX: 287.40 Mb/s eth1: RX: 289.19 Mb/s TX: 231.35 Mb/s > >> Thanks, >> Jarek P. >> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last >> all-in-one patch, or net-2.6) >> >> net/ipv4/fib_trie.c | 18 ++++++++++++++++-- >> 1 files changed, 16 insertions(+), 2 deletions(-) >> >> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c >> index 00a54b2..e8fca11 100644 >> --- a/net/ipv4/fib_trie.c >> +++ b/net/ipv4/fib_trie.c >> @@ -71,6 +71,7 @@ >> #include <linux/netlink.h> >> #include <linux/init.h> >> #include <linux/list.h> >> +#include <linux/moduleparam.h> >> #include <net/net_namespace.h> >> #include <net/ip.h> >> #include <net/protocol.h> >> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, >> struct tnode *tn); >> static struct tnode *halve(struct trie *t, struct tnode *tn); >> /* tnodes to free after resize(); protected by RTNL */ >> static struct tnode *tnode_free_head; >> +static size_t tnode_free_size; >> + >> +static int sync_pages __read_mostly = 1000; >> +module_param(sync_pages, int, 0640); >> >> static struct kmem_cache *fn_alias_kmem __read_mostly; >> static struct kmem_cache *trie_leaf_kmem __read_mostly; >> @@ -316,9 +321,11 @@ static inline void check_tnode(const struct >> tnode *tn) >> >> static const int halve_threshold = 25; >> static const int inflate_threshold = 50; >> -static const int halve_threshold_root = 15; >> -static const int inflate_threshold_root = 25; >> >> +static int inflate_threshold_root __read_mostly = 25; >> +module_param(inflate_threshold_root, int, 0640); >> + >> +#define halve_threshold_root (inflate_threshold_root / 2 + 1) >> >> static void __alias_free_mem(struct rcu_head *head) >> { >> @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn) >> BUG_ON(IS_LEAF(tn)); >> tn->tnode_free = tnode_free_head; >> tnode_free_head = tn; >> + tnode_free_size += sizeof(struct tnode) + >> + (sizeof(struct node *) << tn->bits); >> } >> >> static void tnode_free_flush(void) >> @@ -404,6 +413,11 @@ static void tnode_free_flush(void) >> tn->tnode_free = NULL; >> tnode_free(tn); >> } >> + >> + if (tnode_free_size >= PAGE_SIZE * sync_pages) { >> + tnode_free_size = 0; >> + synchronize_rcu(); >> + } >> } >> >> static struct leaf *leaf_new(void) >> -- >> To unsubscribe from this list: send the line "unsubscribe netdev" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 21:32 ` Paul E. McKenney 2009-07-05 22:23 ` Jarek Poplawski @ 2009-07-14 18:33 ` Jarek Poplawski 2009-07-20 14:41 ` David Miller 2009-07-14 21:20 ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski 2 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-14 18:33 UTC (permalink / raw) To: David Miller Cc: Paul E. McKenney, Paweł Staszewski, Linux Network Development list, Robert Olsson On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote: > On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote: > > On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote: > > > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote: > > > > Oh > > > > > > > > I forgot - please Jarek give me patch with sync rcu and i will make test > > > > on preempt kernel > > > > > > Probably non-preempt kernel might need something like this more, but > > > comparing is always interesting. This patch is based on Paul's > > > suggestion (I hope). > > > > Hold on ;-) Here is something even better... Syncing after 128 pages > > might be still too slow, so here is a higher initial value, 1000, plus > > you can change this while testing in: > > > > /sys/module/fib_trie/parameters/sync_pages > > > > It would be interesting to find the lowest acceptable value. > > Looks like a promising approach to me! > > Thanx, Paul Below is a simpler version of this patch, without the sysfs parameter. (I left the previous version quoted for comparison.) Thanks. > > Jarek P. > > ---> (synchronize take 8; apply on top of the 2.6.29.x with the last > > all-in-one patch, or net-2.6) > > > > net/ipv4/fib_trie.c | 12 ++++++++++++ > > 1 files changed, 12 insertions(+), 0 deletions(-) > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > > index 00a54b2..decc8d0 100644 > > --- a/net/ipv4/fib_trie.c > > +++ b/net/ipv4/fib_trie.c > > @@ -71,6 +71,7 @@ > > #include <linux/netlink.h> > > #include <linux/init.h> > > #include <linux/list.h> > > +#include <linux/moduleparam.h> > > #include <net/net_namespace.h> > > #include <net/ip.h> > > #include <net/protocol.h> > > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); > > static struct tnode *halve(struct trie *t, struct tnode *tn); > > /* tnodes to free after resize(); protected by RTNL */ > > static struct tnode *tnode_free_head; > > +static size_t tnode_free_size; > > + > > +static int sync_pages __read_mostly = 1000; > > +module_param(sync_pages, int, 0640); > > > > static struct kmem_cache *fn_alias_kmem __read_mostly; > > static struct kmem_cache *trie_leaf_kmem __read_mostly; > > @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn) > > BUG_ON(IS_LEAF(tn)); > > tn->tnode_free = tnode_free_head; > > tnode_free_head = tn; > > + tnode_free_size += sizeof(struct tnode) + > > + (sizeof(struct node *) << tn->bits); > > } > > > > static void tnode_free_flush(void) > > @@ -404,6 +411,11 @@ static void tnode_free_flush(void) > > tn->tnode_free = NULL; > > tnode_free(tn); > > } > > + > > + if (tnode_free_size >= PAGE_SIZE * sync_pages) { > > + tnode_free_size = 0; > > + synchronize_rcu(); > > + } > > } > > > > static struct leaf *leaf_new(void) > > -- ------------------------> ipv4: Use synchronize_rcu() during trie_rebalance() During trie_rebalance() we free memory after resizing with call_rcu(), but large updates, especially with PREEMPT_NONE configs, can cause memory stresses, so this patch calls synchronize_rcu() in tnode_free_flush() after each sync_pages to guarantee such freeing (especially before resizing the root node). The value of sync_pages = 128 is based on Pawel Staszewski's tests as the lowest which doesn't hinder updating times. (For testing purposes there was a sysfs module parameter to change it on demand, but it's removed until we're sure it could be really useful.) The patch is based on suggestions by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- net/ipv4/fib_trie.c | 15 +++++++++++++++ 1 files changed, 15 insertions(+), 0 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 63c2fa7..58ba9f4 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -164,6 +164,14 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn); static struct tnode *halve(struct trie *t, struct tnode *tn); /* tnodes to free after resize(); protected by RTNL */ static struct tnode *tnode_free_head; +static size_t tnode_free_size; + +/* + * synchronize_rcu after call_rcu for that many pages; it should be especially + * useful before resizing the root node with PREEMPT_NONE configs; the value was + * obtained experimentally, aiming to avoid visible slowdown. + */ +static const int sync_pages = 128; static struct kmem_cache *fn_alias_kmem __read_mostly; static struct kmem_cache *trie_leaf_kmem __read_mostly; @@ -393,6 +401,8 @@ static void tnode_free_safe(struct tnode *tn) BUG_ON(IS_LEAF(tn)); tn->tnode_free = tnode_free_head; tnode_free_head = tn; + tnode_free_size += sizeof(struct tnode) + + (sizeof(struct node *) << tn->bits); } static void tnode_free_flush(void) @@ -404,6 +414,11 @@ static void tnode_free_flush(void) tn->tnode_free = NULL; tnode_free(tn); } + + if (tnode_free_size >= PAGE_SIZE * sync_pages) { + tnode_free_size = 0; + synchronize_rcu(); + } } static struct leaf *leaf_new(void) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-14 18:33 ` [PATCH net-next] " Jarek Poplawski @ 2009-07-20 14:41 ` David Miller 0 siblings, 0 replies; 99+ messages in thread From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw) To: jarkao2; +Cc: paulmck, pstaszewski, netdev, robert From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 14 Jul 2009 20:33:08 +0200 > ipv4: Use synchronize_rcu() during trie_rebalance() > > During trie_rebalance() we free memory after resizing with call_rcu(), > but large updates, especially with PREEMPT_NONE configs, can cause > memory stresses, so this patch calls synchronize_rcu() in > tnode_free_flush() after each sync_pages to guarantee such freeing > (especially before resizing the root node). > > The value of sync_pages = 128 is based on Pawel Staszewski's tests as > the lowest which doesn't hinder updating times. (For testing purposes > there was a sysfs module parameter to change it on demand, but it's > removed until we're sure it could be really useful.) > > The patch is based on suggestions by: Paul E. McKenney > <paulmck@linux.vnet.ibm.com> > > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Applied. ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups 2009-07-05 21:32 ` Paul E. McKenney 2009-07-05 22:23 ` Jarek Poplawski 2009-07-14 18:33 ` [PATCH net-next] " Jarek Poplawski @ 2009-07-14 21:20 ` Jarek Poplawski 2009-07-20 14:41 ` David Miller 2 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-14 21:20 UTC (permalink / raw) To: David Miller Cc: Paul E. McKenney, Paweł Staszewski, Linux Network Development list, Robert Olsson While looking for other fib_trie problems reported by Pawel Staszewski I noticed there are a few uses of tnode_get_child() and node_parent() in lookups instead of their rcu versions. Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> --- (this patch was prepared on top of my 2 today's fib_trie patches) diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c --- a/net/ipv4/fib_trie.c 2009-07-14 20:40:39.000000000 +0200 +++ b/net/ipv4/fib_trie.c 2009-07-14 22:41:26.000000000 +0200 @@ -1465,7 +1465,7 @@ static int fn_trie_lookup(struct fib_tab cindex = tkey_extract_bits(mask_pfx(key, current_prefix_length), pos, bits); - n = tnode_get_child(pn, cindex); + n = tnode_get_child_rcu(pn, cindex); if (n == NULL) { #ifdef CONFIG_IP_FIB_TRIE_STATS @@ -1600,7 +1600,7 @@ backtrace: if (chopped_off <= pn->bits) { cindex &= ~(1 << (chopped_off-1)); } else { - struct tnode *parent = node_parent((struct node *) pn); + struct tnode *parent = node_parent_rcu((struct node *) pn); if (!parent) goto failed; @@ -1813,7 +1813,7 @@ static struct leaf *trie_firstleaf(struc static struct leaf *trie_nextleaf(struct leaf *l) { struct node *c = (struct node *) l; - struct tnode *p = node_parent(c); + struct tnode *p = node_parent_rcu(c); if (!p) return NULL; /* trie with just one leaf */ ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups 2009-07-14 21:20 ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski @ 2009-07-20 14:41 ` David Miller 0 siblings, 0 replies; 99+ messages in thread From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw) To: jarkao2; +Cc: paulmck, pstaszewski, netdev, robert From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 14 Jul 2009 23:20:32 +0200 > > While looking for other fib_trie problems reported by Pawel Staszewski > I noticed there are a few uses of tnode_get_child() and node_parent() > in lookups instead of their rcu versions. > > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Applied. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 0:26 ` Paweł Staszewski 2009-07-05 0:30 ` Paweł Staszewski @ 2009-07-05 0:31 ` Paweł Staszewski 2009-07-05 12:56 ` [PATCH -stable] " Jarek Poplawski 2009-07-05 13:08 ` [PATCH v2 " Jarek Poplawski 3 siblings, 0 replies; 99+ messages in thread From: Paweł Staszewski @ 2009-07-05 0:31 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson [-- Attachment #1: Type: text/plain, Size: 2866 bytes --] Sorry again no attachement. Paweł Staszewski pisze: > Jarek Poplawski pisze: >> On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote: >> >>> Jarek Poplawski pisze: >>> >>>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote: >>>> >>>>> Jarek Poplawski pisze: >>>>> >>>> ... >>>> >>>>>> So, after your findings I'm about to recommend sending to -stable >>>>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>>>> settings, but it would be nice if you could give it a try with >>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>>>> your other apps!) It is expected to work this time...;-) Maybe a >>>>>> bit slower. >>>>>> >>>>>> > Ok kernel configured with CONFIG_PREEMPT > and all this day work without any problems (with Jarek last patch). > > > So in attached file trere is fib_tirestats > I dont see any big change of (cpu load or faster/slower > routing/propagating routes from bgpd or something else) - in avg there > is from 2% to 3% more of CPU load i dont know why but it is - i change > from "preempt" to "no preempt" 3 times and check this my "mpstat -P > ALL 1 30" > always avg cpu load was from 2 to 3% more compared to "no preempt" > > Regards > Paweł Staszewski > > >>>>>> >>>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE >>>>> And working :) >>>>> >>>> Hmm... It should, because you tested very similar patch already;-) >>>> Sorry if I didn't make it clear. >>>> >>>> >>> Yes i know there was almost identical one. >>> And i see this was without sync rcu :) >>> >> >> Yes, it looks like we can't free memory so simple because of such huge >> latencies. >> >>>>> fib_triestats in attached file >>>>> >>>>> I think I can test it with PREEMPT enabled but first i must make >>>>> some other tests of my apps that are on server. >>>>> >>>> It could probably matter only if you're using some broken out-of-tree >>>> patches. Otherwise the kernel is expected to work OK. >>>> >>>> >>> Im a little confused about using of PREEMPT kernel because of past >>> there was many oopses / lockups :) but yes that was a little long >>> time ago. >>> I will try to make this test today. >>> >>> >>>> Btw., it would be also interesting to check if there is any difference >>>> wrt. these route cache problems while PREEMPT is enabled. >>>> >> >> And you're very right! The place we're fixing is the best example. On >> the other hand, I hope there is not many such places yet. But if we >> test/fix it there will be one less... >> >> Jarek P. >> >> >> > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > [-- Attachment #2: fib_triestats.txt --] [-- Type: text/plain, Size: 929 bytes --] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277814 Prefixes: 291306 Internal nodes: 66420 1: 32737 2: 14850 3: 10332 4: 4871 5: 2313 6: 942 7: 371 8: 3 17: 1 Pointers: 599098 Null ptrs: 254865 Total size: 18067 kB Counters: --------- gets = 2003686 backtracks = 78789 semantic match passed = 1977687 semantic match miss = 112 null node hit= 1470619 skipped node resize = 0 Local: Aver depth: 3.75 Max depth: 5 Leaves: 12 Prefixes: 13 Internal nodes: 10 1: 9 2: 1 Pointers: 22 Null ptrs: 1 Total size: 2 kB Counters: --------- gets = 2008497 backtracks = 1417179 semantic match passed = 4823 semantic match miss = 0 null node hit= 197044 skipped node resize = 0 ^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 0:26 ` Paweł Staszewski 2009-07-05 0:30 ` Paweł Staszewski 2009-07-05 0:31 ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski @ 2009-07-05 12:56 ` Jarek Poplawski 2009-07-05 13:08 ` [PATCH v2 " Jarek Poplawski 3 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-05 12:56 UTC (permalink / raw) To: David Miller Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson, Jorge Boncompte [DTI2] David & Robert, below are my recommendations for -stable plus one more patch: On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote: ... >>>>> Jarek Poplawski pisze: ... >>>>>> So, after your findings I'm about to recommend sending to -stable >>>>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>>>> settings, but it would be nice if you could give it a try with >>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>>>> your other apps!) It is expected to work this time...;-) Maybe a >>>>>> bit slower. >>>>>> >>>>>> > Ok kernel configured with CONFIG_PREEMPT > and all this day work without any problems (with Jarek last patch). > > > So in attached file trere is fib_tirestats > I dont see any big change of (cpu load or faster/slower > routing/propagating routes from bgpd or something else) - in avg there > is from 2% to 3% more of CPU load i dont know why but it is - i change > from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL > 1 30" > always avg cpu load was from 2 to 3% more compared to "no preempt" > > Regards > Paweł Staszewski So after these patches from net-2.6 are tested both for PREEMPT and PREEMPT_NONE I think they should go to -stable: 2.6.30 needs: ------------- commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Mon Jun 15 02:31:29 2009 -0700 ipv4: Fix fib_trie rebalancing commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Thu Jun 18 00:28:51 2009 -0700 ipv4: Fix fib_trie rebalancing, part 2 commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08 Author: Jarek Poplawski <jarkao2@gmail.com> Date: Tue Jun 30 12:47:19 2009 -0700 ipv4: Fix fib_trie rebalancing, part 3 plus the new patch below ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) 2.6.29 needs: ------------- this patch from 2.6.30: commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556 Author: Robert Olsson <robert.olsson@its.uu.se> Date: Thu May 21 15:20:59 2009 -0700 ipv4: Fix oops with FIB_TRIE plus above mentionned patches for 2.6.30 (part 1 - 4) ----------------- David, if possible, please add to all these "Fix... part 1 - 4": Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> This new patch below is intended only for -stable (and later for net-next), because it doesn't meet rules of the current -rc. Anyway, it's not critical (but it actually fixes a regression from 2.6.22). Thanks, Jarek P. ----------------> ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) Pawel Staszewski wrote: <blockquote> Some time ago i report this: http://bugzilla.kernel.org/show_bug.cgi?id=6648 and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back dmesg output: oprofile: using NMI interrupt. Fix inflate_threshold_root. Now=15 size=11 bits ... Fix inflate_threshold_root. Now=15 size=11 bits cat /proc/net/fib_triestat Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. Main: Aver depth: 2.28 Max depth: 6 Leaves: 276539 Prefixes: 289922 Internal nodes: 66762 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 9: 1 18: 1 Pointers: 691228 Null ptrs: 347928 Total size: 35709 kB </blockquote> It seems, the current threshold for root resizing is too aggressive, and it causes misleading warnings during big updates, but it might be also responsible for memory problems, especially with non-preempt configs, when RCU freeing is delayed long after call_rcu. It should be also mentionned that because of non-atomic changes during resizing/rebalancing the current lookup algorithm can miss valid leafs so it's additional argument to shorten these activities even at a cost of a minimally longer searching. This patch restores values before the patch "[IPV4]: fib_trie root node settings", commit: 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 from v2.6.22. Pawel's report: <blockquote> I dont see any big change of (cpu load or faster/slower routing/propagating routes from bgpd or something else) - in avg there is from 2% to 3% more of CPU load i dont know why but it is - i change from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 1 30" always avg cpu load was from 2 to 3% more compared to "no preempt" [...] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277814 Prefixes: 291306 Internal nodes: 66420 1: 32737 2: 14850 3: 10332 4: 4871 5: 2313 6: 942 7: 371 8: 3 17: 1 Pointers: 599098 Null ptrs: 254865 Total size: 18067 kB </blockquote> According to this and other similar reports average depth is slightly increased (~0.2), and root nodes are shorter (log 17 vs. 18), but there is no visible performance decrease. So, until memory handling is improved or added parameters for changing this individually, this patch resets to safer defaults. Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> --- net/ipv4/fib_trie.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..63c2fa7 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -316,8 +316,8 @@ static inline void check_tnode(const struct tnode *tn) static const int halve_threshold = 25; static const int inflate_threshold = 50; -static const int halve_threshold_root = 8; -static const int inflate_threshold_root = 15; +static const int halve_threshold_root = 15; +static const int inflate_threshold_root = 25; static void __alias_free_mem(struct rcu_head *head) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 0:26 ` Paweł Staszewski ` (2 preceding siblings ...) 2009-07-05 12:56 ` [PATCH -stable] " Jarek Poplawski @ 2009-07-05 13:08 ` Jarek Poplawski 2009-07-08 2:42 ` David Miller 3 siblings, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-07-05 13:08 UTC (permalink / raw) To: David Miller Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson, Jorge Boncompte [DTI2] (Take 2: Changelog spelling fixes, sorry.) David & Robert, below are my recommendations for -stable plus one more patch: On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote: ... >>>>> Jarek Poplawski pisze: ... >>>>>> So, after your findings I'm about to recommend sending to -stable >>>>>> 3 patches from net-2.6, with additional lowering of threshold_root >>>>>> settings, but it would be nice if you could give it a try with >>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break >>>>>> your other apps!) It is expected to work this time...;-) Maybe a >>>>>> bit slower. >>>>>> >>>>>> > Ok kernel configured with CONFIG_PREEMPT > and all this day work without any problems (with Jarek last patch). > > > So in attached file trere is fib_tirestats > I dont see any big change of (cpu load or faster/slower > routing/propagating routes from bgpd or something else) - in avg there > is from 2% to 3% more of CPU load i dont know why but it is - i change > from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL > 1 30" > always avg cpu load was from 2 to 3% more compared to "no preempt" > > Regards > Paweł Staszewski So after these patches from net-2.6 are tested both for PREEMPT and PREEMPT_NONE I think they should go to -stable: 2.6.30 needs: ------------- commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Mon Jun 15 02:31:29 2009 -0700 ipv4: Fix fib_trie rebalancing commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f Author: Jarek Poplawski <jarkao2@gmail.com> Date: Thu Jun 18 00:28:51 2009 -0700 ipv4: Fix fib_trie rebalancing, part 2 commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08 Author: Jarek Poplawski <jarkao2@gmail.com> Date: Tue Jun 30 12:47:19 2009 -0700 ipv4: Fix fib_trie rebalancing, part 3 plus the new patch below ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) 2.6.29 needs: ------------- this patch from 2.6.30: commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556 Author: Robert Olsson <robert.olsson@its.uu.se> Date: Thu May 21 15:20:59 2009 -0700 ipv4: Fix oops with FIB_TRIE plus above mentionned patches for 2.6.30 (part 1 - 4) ----------------- David, if possible, please add to all these "Fix... part 1 - 4": Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> This new patch below is intended only for -stable (and later for net-next), because it doesn't meet rules of the current -rc. Anyway, it's not critical (but it actually fixes a regression from 2.6.22). Thanks, Jarek P. ----------------> ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) Pawel Staszewski wrote: <blockquote> Some time ago i report this: http://bugzilla.kernel.org/show_bug.cgi?id=6648 and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back dmesg output: oprofile: using NMI interrupt. Fix inflate_threshold_root. Now=15 size=11 bits ... Fix inflate_threshold_root. Now=15 size=11 bits cat /proc/net/fib_triestat Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes. Main: Aver depth: 2.28 Max depth: 6 Leaves: 276539 Prefixes: 289922 Internal nodes: 66762 1: 35046 2: 13824 3: 9508 4: 4897 5: 2331 6: 1149 7: 5 9: 1 18: 1 Pointers: 691228 Null ptrs: 347928 Total size: 35709 kB </blockquote> It seems, the current threshold for root resizing is too aggressive, and it causes misleading warnings during big updates, but it might be also responsible for memory problems, especially with non-preempt configs, when RCU freeing is delayed long after call_rcu. It should be also mentioned that because of non-atomic changes during resizing/rebalancing the current lookup algorithm can miss valid leaves so it's additional argument to shorten these activities even at a cost of a minimally longer searching. This patch restores values before the patch "[IPV4]: fib_trie root node settings", commit: 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 from v2.6.22. Pawel's report: <blockquote> I dont see any big change of (cpu load or faster/slower routing/propagating routes from bgpd or something else) - in avg there is from 2% to 3% more of CPU load i dont know why but it is - i change from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 1 30" always avg cpu load was from 2 to 3% more compared to "no preempt" [...] cat /proc/net/fib_triestat Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes. Main: Aver depth: 2.44 Max depth: 6 Leaves: 277814 Prefixes: 291306 Internal nodes: 66420 1: 32737 2: 14850 3: 10332 4: 4871 5: 2313 6: 942 7: 371 8: 3 17: 1 Pointers: 599098 Null ptrs: 254865 Total size: 18067 kB </blockquote> According to this and other similar reports average depth is slightly increased (~0.2), and root nodes are shorter (log 17 vs. 18), but there is no visible performance decrease. So, until memory handling is improved or added parameters for changing this individually, this patch resets to safer defaults. Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> --- net/ipv4/fib_trie.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 00a54b2..63c2fa7 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -316,8 +316,8 @@ static inline void check_tnode(const struct tnode *tn) static const int halve_threshold = 25; static const int inflate_threshold = 50; -static const int halve_threshold_root = 8; -static const int inflate_threshold_root = 15; +static const int halve_threshold_root = 15; +static const int inflate_threshold_root = 25; static void __alias_free_mem(struct rcu_head *head) ^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-05 13:08 ` [PATCH v2 " Jarek Poplawski @ 2009-07-08 2:42 ` David Miller 2009-07-08 6:44 ` Jarek Poplawski 0 siblings, 1 reply; 99+ messages in thread From: David Miller @ 2009-07-08 2:42 UTC (permalink / raw) To: jarkao2; +Cc: pstaszewski, netdev, robert, jorge From: Jarek Poplawski <jarkao2@gmail.com> Date: Sun, 5 Jul 2009 15:08:28 +0200 > This new patch below is intended only for -stable (and later for > net-next), because it doesn't meet rules of the current -rc. Anyway, > it's not critical (but it actually fixes a regression from 2.6.22). I think if we' re going to toss this into -stable, we should put it into net-2.6 too, and that's what I'm going to do. Once this makes it's way to Linus I'll work on the -stable submissions. And I'll make sure to add the tested-by tags, as you mentioned. Thanks! ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-07-08 2:42 ` David Miller @ 2009-07-08 6:44 ` Jarek Poplawski 0 siblings, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-07-08 6:44 UTC (permalink / raw) To: David Miller; +Cc: pstaszewski, netdev, robert, jorge On Tue, Jul 07, 2009 at 07:42:08PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Sun, 5 Jul 2009 15:08:28 +0200 > > > This new patch below is intended only for -stable (and later for > > net-next), because it doesn't meet rules of the current -rc. Anyway, > > it's not critical (but it actually fixes a regression from 2.6.22). > > I think if we' re going to toss this into -stable, we should > put it into net-2.6 too, and that's what I'm going to do. It's your decision: I don't think this patch is worth any arguing about (de)stabilizing. Btw., since -stable rules are less strict it seems natural such patches with bug fixes should rather go net-next -> -stable way, unless I miss something? Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 9:51 ` Paweł Staszewski 2009-06-29 10:47 ` Jarek Poplawski @ 2009-06-29 10:58 ` Jarek Poplawski 2009-06-30 19:48 ` David Miller 1 sibling, 1 reply; 99+ messages in thread From: Jarek Poplawski @ 2009-06-29 10:58 UTC (permalink / raw) To: Paweł Staszewski Cc: David Miller, Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2], Eric Dumazet, Robert Olsson, Linux Network Development list On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > I apply this patch > > fib_triestats in attached file :) > >> -------------------> >> ipv4: Fix fib_trie rebalancing, part 3 >> >> Alas current delaying of freeing old tnodes by RCU in trie_rebalance >> is still not enough because we can free a top tnode before updating a >> t->trie pointer. >> >> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> >> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> >> --- David, I guess you could add: Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-29 10:58 ` [PATCH net-2.6] " Jarek Poplawski @ 2009-06-30 19:48 ` David Miller 2009-06-30 20:14 ` Jarek Poplawski 2009-07-10 15:29 ` Stephen Hemminger 0 siblings, 2 replies; 99+ messages in thread From: David Miller @ 2009-06-30 19:48 UTC (permalink / raw) To: jarkao2 Cc: pstaszewski, robert, Robert.Olsson, jorge, dada1, robert.olsson, netdev From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 29 Jun 2009 10:58:20 +0000 > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: >> I apply this patch >> >> fib_triestats in attached file :) >> >>> -------------------> >>> ipv4: Fix fib_trie rebalancing, part 3 >>> >>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance >>> is still not enough because we can free a top tnode before updating a >>> t->trie pointer. >>> >>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> >>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> >>> --- > > David, I guess you could add: > > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> Done, and applied, thanks Jarek. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 19:48 ` David Miller @ 2009-06-30 20:14 ` Jarek Poplawski 2009-07-10 15:29 ` Stephen Hemminger 1 sibling, 0 replies; 99+ messages in thread From: Jarek Poplawski @ 2009-06-30 20:14 UTC (permalink / raw) To: David Miller Cc: pstaszewski, robert, Robert.Olsson, jorge, dada1, robert.olsson, netdev On Tue, Jun 30, 2009 at 12:48:49PM -0700, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 29 Jun 2009 10:58:20 +0000 > > > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > >> I apply this patch > >> > >> fib_triestats in attached file :) > >> > >>> -------------------> > >>> ipv4: Fix fib_trie rebalancing, part 3 > >>> > >>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance > >>> is still not enough because we can free a top tnode before updating a > >>> t->trie pointer. > >>> > >>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > >>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > >>> --- > > > > David, I guess you could add: > > > > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> > > Done, and applied, thanks Jarek. Btw., a little comment: there are still some issues while trying to reclaim memory after synchronize_rcu, which means the algorithm is buggy, or RCU use is still buggy, or maybe some timing because of synchronize_rcu. Anyway, fib_trie still seems to be safe only with CONFIG_PREEMPT_NONE, so I have no idea how this should be fixed in -stables (or why people don't report more this BUG in 2.6.30)... Thanks, Jarek P. ^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits 2009-06-30 19:48 ` David Miller 2009-06-30 20:14 ` Jarek Poplawski @ 2009-07-10 15:29 ` Stephen Hemminger 1 sibling, 0 replies; 99+ messages in thread From: Stephen Hemminger @ 2009-07-10 15:29 UTC (permalink / raw) To: David Miller Cc: jarkao2, pstaszewski, robert, Robert.Olsson, jorge, dada1, robert.olsson, netdev On Tue, 30 Jun 2009 12:48:49 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 29 Jun 2009 10:58:20 +0000 > > > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote: > >> I apply this patch > >> > >> fib_triestats in attached file :) > >> > >>> -------------------> > >>> ipv4: Fix fib_trie rebalancing, part 3 > >>> > >>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance > >>> is still not enough because we can free a top tnode before updating a > >>> t->trie pointer. > >>> > >>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl> > >>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > >>> --- > > > > David, I guess you could add: > > > > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl> > > Done, and applied, thanks Jarek. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html This is probably in kernel bugzilla as well, so someone should update: http://bugzilla.kernel.org/show_bug.cgi?id=6648 -- ^ permalink raw reply [flat|nested] 99+ messages in thread
end of thread, other threads:[~2009-07-20 14:41 UTC | newest]
Thread overview: 99+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-25 15:48 rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
2009-06-25 21:19 ` Eric Dumazet
2009-06-25 21:52 ` Paweł Staszewski
2009-06-25 22:54 ` Eric Dumazet
2009-06-26 10:06 ` Paweł Staszewski
2009-06-26 10:34 ` Eric Dumazet
2009-06-26 10:47 ` Paweł Staszewski
2009-06-26 10:52 ` Eric Dumazet
2009-06-26 17:26 ` Paweł Staszewski
2009-06-26 8:03 ` Jarek Poplawski
2009-06-26 9:19 ` Robert Olsson
2009-06-26 9:37 ` Jarek Poplawski
2009-06-26 10:26 ` Jorge Boncompte [DTI2]
2009-06-26 12:42 ` Robert Olsson
2009-06-26 12:54 ` Jarek Poplawski
2009-06-26 13:28 ` Jarek Poplawski
2009-06-26 13:52 ` Robert Olsson
2009-06-26 15:10 ` Jarek Poplawski
2009-06-26 15:30 ` Paul E. McKenney
2009-06-26 15:54 ` Jarek Poplawski
2009-06-26 16:15 ` Jarek Poplawski
2009-06-26 16:23 ` Paul E. McKenney
2009-06-26 16:45 ` Jarek Poplawski
2009-06-26 17:05 ` Paul E. McKenney
2009-06-26 18:05 ` Jarek Poplawski
2009-06-26 18:21 ` Paul E. McKenney
2009-06-26 20:19 ` Jarek Poplawski
2009-06-26 20:26 ` Robert Olsson
2009-06-26 20:37 ` Jarek Poplawski
2009-06-26 21:20 ` Jarek Poplawski
2009-06-27 19:20 ` Jarek Poplawski
2009-06-27 20:51 ` Jarek Poplawski
2009-06-28 0:28 ` Paweł Staszewski
2009-06-28 11:11 ` Robert Olsson
2009-06-29 7:57 ` Paweł Staszewski
2009-06-28 11:04 ` Robert Olsson
2009-06-28 12:03 ` Jarek Poplawski
2009-06-28 14:35 ` Jarek Poplawski
2009-06-28 15:32 ` Paweł Staszewski
2009-06-28 15:48 ` Paweł Staszewski
2009-06-28 19:56 ` Jarek Poplawski
2009-06-28 21:36 ` Jarek Poplawski
2009-06-29 8:08 ` Paweł Staszewski
2009-06-29 8:47 ` Paweł Staszewski
2009-06-29 9:27 ` Jarek Poplawski
2009-06-29 9:43 ` Paweł Staszewski
2009-06-29 8:33 ` [PATCH net-2.6] " Jarek Poplawski
2009-06-29 9:51 ` Paweł Staszewski
2009-06-29 10:47 ` Jarek Poplawski
2009-06-29 16:24 ` Paweł Staszewski
2009-06-29 17:09 ` Jarek Poplawski
2009-06-30 7:09 ` Jarek Poplawski
2009-06-30 20:16 ` Paweł Staszewski
2009-06-30 20:41 ` Jarek Poplawski
2009-06-30 23:31 ` Paweł Staszewski
2009-07-01 6:36 ` Jarek Poplawski
[not found] ` <20090701072409.GA12592@ff.dom.local>
2009-07-01 9:43 ` Paweł Staszewski
2009-07-01 9:50 ` Paweł Staszewski
2009-07-01 10:13 ` Jarek Poplawski
2009-07-01 11:04 ` Jarek Poplawski
2009-07-01 22:17 ` Paweł Staszewski
2009-07-02 5:32 ` Jarek Poplawski
2009-07-02 5:43 ` Paweł Staszewski
2009-07-02 6:00 ` Jarek Poplawski
2009-07-02 15:31 ` Robert Olsson
2009-07-02 19:06 ` Jarek Poplawski
2009-07-02 21:32 ` Robert Olsson
2009-07-02 22:13 ` Jarek Poplawski
2009-07-05 0:26 ` Paweł Staszewski
2009-07-05 0:30 ` Paweł Staszewski
2009-07-05 16:20 ` Jarek Poplawski
2009-07-05 17:32 ` Jarek Poplawski
2009-07-05 21:32 ` Paul E. McKenney
2009-07-05 22:23 ` Jarek Poplawski
2009-07-05 23:53 ` Paweł Staszewski
2009-07-06 9:02 ` Jarek Poplawski
2009-07-07 22:56 ` Paweł Staszewski
2009-07-07 23:50 ` Jarek Poplawski
2009-07-09 20:34 ` Paweł Staszewski
2009-07-14 19:41 ` [PATCH net-next] " Jarek Poplawski
2009-07-15 7:43 ` Robert Olsson
2009-07-15 13:05 ` Jarek Poplawski
2009-07-17 8:08 ` Robert Olsson
2009-07-20 14:41 ` David Miller
2009-07-07 23:23 ` [PATCH net-2.6] " Paweł Staszewski
2009-07-07 23:30 ` Paweł Staszewski
2009-07-14 18:33 ` [PATCH net-next] " Jarek Poplawski
2009-07-20 14:41 ` David Miller
2009-07-14 21:20 ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski
2009-07-20 14:41 ` David Miller
2009-07-05 0:31 ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
2009-07-05 12:56 ` [PATCH -stable] " Jarek Poplawski
2009-07-05 13:08 ` [PATCH v2 " Jarek Poplawski
2009-07-08 2:42 ` David Miller
2009-07-08 6:44 ` Jarek Poplawski
2009-06-29 10:58 ` [PATCH net-2.6] " Jarek Poplawski
2009-06-30 19:48 ` David Miller
2009-06-30 20:14 ` Jarek Poplawski
2009-07-10 15:29 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).