* debugging kernel during packet drops @ 2010-03-22 10:41 Jorrit Kronjee 2010-03-22 17:16 ` Patrick McHardy 2010-03-23 17:04 ` James King 0 siblings, 2 replies; 29+ messages in thread From: Jorrit Kronjee @ 2010-03-22 10:41 UTC (permalink / raw) To: netfilter-devel Dear list, I've asked this question on the kernelnewbies forum, but I haven't got any responses. I hope someone here is able to help me. I'm trying to build a setup that allows me to limit the amount of packets/s per destination IP address. The setup I use for this is as follows: [ DoS machine ] -> [ bridging firewall ] -> [ receiving network ] I used brctl to build the bridge. The DoS machine has a custom built tool that allows me to send small packets at very fast rates. I've discovered that bridging still works reliably at around 300 kpackets/s (notice the 'k' in there). However, as said before, I was trying to limit the amount of packets/s, so I used netfilter's hashlimit module. This is when packet drops started to appear. At around 300 kpps, the amount of packet drops is 40 kpps. For me, this amount is too significant to ignore. I see the load average go from a comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's still 2 kpps. When I disable the hashlimit module the packet drops disappear again. Now I know that hashlimit is made for more than one thing, namely limiting packets based on source/destination host and source/destination port, so it's not as efficient as it could be for my purposes. I could rewrite it, but before I do that, I would like to know if the module itself is really what's causing it, or if there's some underlying cause that I'm not seeing. So my question in short: how can I discover why it's dropping packets? Some details about the machine: network controllers: 00:19.0 Ethernet controller: Intel Corporation 82566DM-2 Gigabit Network Connection (rev 02) 04:02.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05) drivers: driver: e1000e version: 1.1.2.1a-NAPI firmware-version: 1.3-0 bus-info: 0000:00:19.0 driver: e1000 version: 7.3.21-k3-NAPI firmware-version: N/A bus-info: 0000:04:02.0 CPU: Intel Xeon CPU X3330 @ 2.66 GHz Regards, Jorrit Kronjee ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-22 10:41 debugging kernel during packet drops Jorrit Kronjee @ 2010-03-22 17:16 ` Patrick McHardy 2010-03-22 17:53 ` Jan Engelhardt 2010-03-23 15:14 ` Jorrit Kronjee 2010-03-23 17:04 ` James King 1 sibling, 2 replies; 29+ messages in thread From: Patrick McHardy @ 2010-03-22 17:16 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel Jorrit Kronjee wrote: > Dear list, > > I've asked this question on the kernelnewbies forum, but I haven't got > any responses. I hope someone here is able to help me. I'm trying to > build a setup that allows me to limit the amount of packets/s per > destination IP address. The setup I use for this is as follows: > > [ DoS machine ] -> [ bridging firewall ] -> [ receiving network ] > > I used brctl to build the bridge. The DoS machine has a custom built > tool that allows me to send small packets at very fast rates. I've > discovered that bridging still works reliably at around 300 kpackets/s > (notice the 'k' in there). However, as said before, I was trying to > limit the amount of packets/s, so I used netfilter's hashlimit module. > This is when packet drops started to appear. > > At around 300 kpps, the amount of packet drops is 40 kpps. For me, this > amount is too significant to ignore. I see the load average go from a > comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 > kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's > still 2 kpps. > > When I disable the hashlimit module the packet drops disappear again. > Now I know that hashlimit is made for more than one thing, namely > limiting packets based on source/destination host and source/destination > port, so it's not as efficient as it could be for my purposes. I could > rewrite it, but before I do that, I would like to know if the module > itself is really what's causing it, or if there's some underlying cause > that I'm not seeing. So my question in short: how can I discover why > it's dropping packets? > A couple of suggestions: - try the limit module in case you don't actually need per-source/dest etc. limiting but just a global limit - try using TBF or ingress policing. Both limit and hashlimit suffer of problems regarding the resolution of the applied TBF. I don't remember the exact range of values it is able to handle, but IIRC you should be able to find it in the netfilter bugzilla. - if you use TBF or ingress policing and don't need ip_tables specific modules, disabling bridge netfilter invocation of ip_tables through /proc should increase performance. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-22 17:16 ` Patrick McHardy @ 2010-03-22 17:53 ` Jan Engelhardt 2010-03-22 18:02 ` Patrick McHardy 2010-03-23 15:14 ` Jorrit Kronjee 1 sibling, 1 reply; 29+ messages in thread From: Jan Engelhardt @ 2010-03-22 17:53 UTC (permalink / raw) To: Patrick McHardy; +Cc: Jorrit Kronjee, netfilter-devel On Monday 2010-03-22 18:16, Patrick McHardy wrote: >> >> I used brctl to build the bridge. The DoS machine has a custom built >> tool that allows me to send small packets at very fast rates. I've >> discovered that bridging still works reliably at around 300 kpackets/s >> (notice the 'k' in there). However, as said before, I was trying to >> limit the amount of packets/s, so I used netfilter's hashlimit module. >> This is when packet drops started to appear. >> >> At around 300 kpps, the amount of packet drops is 40 kpps. For me, this >> amount is too significant to ignore. I see the load average go from a >> comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 >> kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's >> still 2 kpps. >A couple of suggestions: > >- try the limit module in case you don't actually need per-source/dest etc. > limiting but just a global limit The token-per-jiffy math logic used in xt_limit and some other modules is known to be inaccurate at high speeds. My suggestion is therefore to try xt_rateest instead which has a somewhat different logic. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-22 17:53 ` Jan Engelhardt @ 2010-03-22 18:02 ` Patrick McHardy 0 siblings, 0 replies; 29+ messages in thread From: Patrick McHardy @ 2010-03-22 18:02 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Jorrit Kronjee, netfilter-devel Jan Engelhardt wrote: > On Monday 2010-03-22 18:16, Patrick McHardy wrote: >>> I used brctl to build the bridge. The DoS machine has a custom built >>> tool that allows me to send small packets at very fast rates. I've >>> discovered that bridging still works reliably at around 300 kpackets/s >>> (notice the 'k' in there). However, as said before, I was trying to >>> limit the amount of packets/s, so I used netfilter's hashlimit module. >>> This is when packet drops started to appear. >>> >>> At around 300 kpps, the amount of packet drops is 40 kpps. For me, this >>> amount is too significant to ignore. I see the load average go from a >>> comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 >>> kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's >>> still 2 kpps. > >> A couple of suggestions: >> >> - try the limit module in case you don't actually need per-source/dest etc. >> limiting but just a global limit > > The token-per-jiffy math logic used in xt_limit and some other > modules is known to be inaccurate at high speeds. > > My suggestion is therefore to try xt_rateest instead which has > a somewhat different logic. Good point, I forgot about xt_rateest :) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-22 17:16 ` Patrick McHardy 2010-03-22 17:53 ` Jan Engelhardt @ 2010-03-23 15:14 ` Jorrit Kronjee 2010-03-23 15:39 ` Patrick McHardy 2010-03-23 17:21 ` Eric Dumazet 1 sibling, 2 replies; 29+ messages in thread From: Jorrit Kronjee @ 2010-03-23 15:14 UTC (permalink / raw) To: Patrick McHardy, netfilter-devel On 3/22/2010 6:16 PM, Patrick McHardy wrote: > Jorrit Kronjee wrote: > >> Dear list, >> >> I've asked this question on the kernelnewbies forum, but I haven't got >> any responses. I hope someone here is able to help me. I'm trying to >> build a setup that allows me to limit the amount of packets/s per >> destination IP address. The setup I use for this is as follows: >> >> [ DoS machine ] -> [ bridging firewall ] -> [ receiving network ] >> >> I used brctl to build the bridge. The DoS machine has a custom built >> tool that allows me to send small packets at very fast rates. I've >> discovered that bridging still works reliably at around 300 kpackets/s >> (notice the 'k' in there). However, as said before, I was trying to >> limit the amount of packets/s, so I used netfilter's hashlimit module. >> This is when packet drops started to appear. >> >> At around 300 kpps, the amount of packet drops is 40 kpps. For me, this >> amount is too significant to ignore. I see the load average go from a >> comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 >> kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's >> still 2 kpps. >> >> When I disable the hashlimit module the packet drops disappear again. >> Now I know that hashlimit is made for more than one thing, namely >> limiting packets based on source/destination host and source/destination >> port, so it's not as efficient as it could be for my purposes. I could >> rewrite it, but before I do that, I would like to know if the module >> itself is really what's causing it, or if there's some underlying cause >> that I'm not seeing. So my question in short: how can I discover why >> it's dropping packets? >> >> > A couple of suggestions: > > - try the limit module in case you don't actually need per-source/dest etc. > limiting but just a global limit > > - try using TBF or ingress policing. Both limit and hashlimit suffer of > problems > regarding the resolution of the applied TBF. I don't remember the > exact range > of values it is able to handle, but IIRC you should be able to find it > in the > netfilter bugzilla. > > - if you use TBF or ingress policing and don't need ip_tables specific > modules, > disabling bridge netfilter invocation of ip_tables through /proc > should increase > performance. > > Patrick, Although these are good suggestions, I really need to be able to limit per destination. The receiving network is a /15 which means I have to use something like a hashtable to keep track of destination IP addresses. Neither rateest or limit can do that. OTOH, that's also the only thing I need. This would make a low-cost ISP-grade DDoS filter, which is why I'm interested in it. The bug you're referring to is this one, I think: http://bugzilla.netfilter.org/show_bug.cgi?id=523 but I'm not entirely sure if that is related to my problems. Is there any way I can figure out why ifconfig is reporting dropped packets? Thanks for all the help so far! Regards, Jorrit Kronjee ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-23 15:14 ` Jorrit Kronjee @ 2010-03-23 15:39 ` Patrick McHardy 2010-03-23 17:21 ` Eric Dumazet 1 sibling, 0 replies; 29+ messages in thread From: Patrick McHardy @ 2010-03-23 15:39 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel Jorrit Kronjee wrote: > On 3/22/2010 6:16 PM, Patrick McHardy wrote: > >>> When I disable the hashlimit module the packet drops disappear again. >>> Now I know that hashlimit is made for more than one thing, namely >>> limiting packets based on source/destination host and source/destination >>> port, so it's not as efficient as it could be for my purposes. I could >>> rewrite it, but before I do that, I would like to know if the module >>> itself is really what's causing it, or if there's some underlying cause >>> that I'm not seeing. So my question in short: how can I discover why >>> it's dropping packets? >>> >>> >>> >> A couple of suggestions: >> >> - try the limit module in case you don't actually need per-source/dest etc. >> limiting but just a global limit >> >> - try using TBF or ingress policing. Both limit and hashlimit suffer of >> problems >> regarding the resolution of the applied TBF. I don't remember the >> exact range >> of values it is able to handle, but IIRC you should be able to find it >> in the >> netfilter bugzilla. >> >> - if you use TBF or ingress policing and don't need ip_tables specific >> modules, >> disabling bridge netfilter invocation of ip_tables through /proc >> should increase >> performance. >> >> >> > Patrick, > > Although these are good suggestions, I really need to be able to limit > per destination. The receiving network is a /15 which means I have to > use something like a hashtable to keep track of destination IP > addresses. Neither rateest or limit can do that. OTOH, that's also the > only thing I need. This would make a low-cost ISP-grade DDoS filter, > which is why I'm interested in it. > > The bug you're referring to is this one, I think: > http://bugzilla.netfilter.org/show_bug.cgi?id=523 but I'm not entirely > sure if that is related to my problems. > Yes, that's the one. Not specifically related to your problem, but to the general hashlimit limitations. > Is there any way I can figure out why ifconfig is reporting dropped > packets? Based on your description I'd say that the CPU simply can't keep up with traffic. Perhaps the hash is undersized. The default size depends on the amount of memory, you could try setting it manually to something appropriate for the amount of addresses you expect to see. If that doesn't help, I'd start by profiling the kernel. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-23 15:14 ` Jorrit Kronjee 2010-03-23 15:39 ` Patrick McHardy @ 2010-03-23 17:21 ` Eric Dumazet 2010-03-23 20:07 ` Eric Dumazet 2010-03-24 15:20 ` Jorrit Kronjee 1 sibling, 2 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-23 17:21 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: Patrick McHardy, netfilter-devel Le mardi 23 mars 2010 à 16:14 +0100, Jorrit Kronjee a écrit : > > Patrick, > > Although these are good suggestions, I really need to be able to limit > per destination. The receiving network is a /15 which means I have to > use something like a hashtable to keep track of destination IP > addresses. Neither rateest or limit can do that. OTOH, that's also the > only thing I need. This would make a low-cost ISP-grade DDoS filter, > which is why I'm interested in it. > > The bug you're referring to is this one, I think: > http://bugzilla.netfilter.org/show_bug.cgi?id=523 but I'm not entirely > sure if that is related to my problems. > > Is there any way I can figure out why ifconfig is reporting dropped > packets? > > Thanks for all the help so far! > Could you post more information about your machine ? cat /proc/interrupts If running a recent kernel, a "perf top" would be useful Maybe RPS will help your setup (included in net-next-2.?6 tree) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-23 17:21 ` Eric Dumazet @ 2010-03-23 20:07 ` Eric Dumazet 2010-03-24 15:20 ` Jorrit Kronjee 1 sibling, 0 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-23 20:07 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: Patrick McHardy, netfilter-devel Le mardi 23 mars 2010 à 18:21 +0100, Eric Dumazet a écrit : > Le mardi 23 mars 2010 à 16:14 +0100, Jorrit Kronjee a écrit : > > > > Patrick, > > > > Although these are good suggestions, I really need to be able to limit > > per destination. The receiving network is a /15 which means I have to > > use something like a hashtable to keep track of destination IP > > addresses. Neither rateest or limit can do that. OTOH, that's also the > > only thing I need. This would make a low-cost ISP-grade DDoS filter, > > which is why I'm interested in it. > > > > The bug you're referring to is this one, I think: > > http://bugzilla.netfilter.org/show_bug.cgi?id=523 but I'm not entirely > > sure if that is related to my problems. > > > > Is there any way I can figure out why ifconfig is reporting dropped > > packets? > > > > Thanks for all the help so far! > > > > Could you post more information about your machine ? > > cat /proc/interrupts > > If running a recent kernel, a "perf top" would be useful > > Maybe RPS will help your setup (included in net-next-2.?6 tree) > If your NIC are multiqueue enabled, we might upgrade hashlimit to a more scalable module, using several locks instead of one per table, or maybe RCU. One known problem with hashlimit is its garbage collector, run from timer softirq, holding this central lock for a possible long period. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-23 17:21 ` Eric Dumazet 2010-03-23 20:07 ` Eric Dumazet @ 2010-03-24 15:20 ` Jorrit Kronjee 2010-03-24 16:21 ` Eric Dumazet 1 sibling, 1 reply; 29+ messages in thread From: Jorrit Kronjee @ 2010-03-24 15:20 UTC (permalink / raw) To: Eric Dumazet, netfilter-devel On 3/23/2010 6:21 PM, Eric Dumazet wrote: > > Could you post more information about your machine ? > > cat /proc/interrupts > > If running a recent kernel, a "perf top" would be useful > > Maybe RPS will help your setup (included in net-next-2.?6 tree) > Eric, To make things easier, I just installed the latest net-next tree. Traffic flows from eth3 (e1000 7.3.21-k5-NAPI) to eth4 (e1000e 1.0.2-k2). After a reboot and 5 minutes of flooding it with 300 kpps, perftop shows this: ------------------------------------------------------------------------------------------------------------------------------------------------ PerfTop: 918 irqs/sec kernel:99.6% [1000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------------------------------------------------------------------------ samples pcnt function DSO _______ _____ ________________________ _______________________________________________________________________ 1588.00 11.4% __slab_free /lib/modules/2.6.34-rc1-net-next/build/vmlinux 1571.00 11.3% dsthash_find /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko 1117.00 8.0% _raw_spin_lock /lib/modules/2.6.34-rc1-net-next/build/vmlinux 899.00 6.4% e1000_clean_tx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 702.00 5.0% skb_release_head_state /lib/modules/2.6.34-rc1-net-next/build/vmlinux 650.00 4.7% kfree /lib/modules/2.6.34-rc1-net-next/build/vmlinux 540.00 3.9% __slab_alloc /lib/modules/2.6.34-rc1-net-next/build/vmlinux 514.00 3.7% memcpy /lib/modules/2.6.34-rc1-net-next/build/vmlinux 481.00 3.4% e1000_xmit_frame /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 335.00 2.4% nf_iterate /lib/modules/2.6.34-rc1-net-next/build/vmlinux 285.00 2.0% e1000_clean /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 264.00 1.9% kmem_cache_free /lib/modules/2.6.34-rc1-net-next/build/vmlinux 258.00 1.8% nf_hook_slow /lib/modules/2.6.34-rc1-net-next/build/vmlinux 257.00 1.8% e1000_intr_msi /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 207.00 1.5% ipt_do_table /lib/modules/2.6.34-rc1-net-next/kernel/net/ipv4/netfilter/ip_tables.ko 202.00 1.4% dev_queue_xmit /lib/modules/2.6.34-rc1-net-next/build/vmlinux 180.00 1.3% memset /lib/modules/2.6.34-rc1-net-next/build/vmlinux 173.00 1.2% __alloc_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 165.00 1.2% br_nf_pre_routing /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 159.00 1.1% __kmalloc_track_caller /lib/modules/2.6.34-rc1-net-next/build/vmlinux 159.00 1.1% br_nf_pre_routing_finish /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 158.00 1.1% e1000_clean_rx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 147.00 1.1% kmem_cache_alloc /lib/modules/2.6.34-rc1-net-next/build/vmlinux 140.00 1.0% kmem_cache_alloc_notrace /lib/modules/2.6.34-rc1-net-next/build/vmlinux 130.00 0.9% _raw_spin_lock_bh /lib/modules/2.6.34-rc1-net-next/build/vmlinux 122.00 0.9% hashlimit_mt_init /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko 101.00 0.7% br_fdb_update /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 101.00 0.7% br_handle_frame /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 100.00 0.7% __netif_receive_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 90.00 0.6% __netdev_alloc_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 88.00 0.6% irq_entries_start /lib/modules/2.6.34-rc1-net-next/build/vmlinux 85.00 0.6% br_handle_frame_finish /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 83.00 0.6% br_nf_post_routing /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 76.00 0.5% __br_fdb_get /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 64.00 0.5% eth_type_trans /lib/modules/2.6.34-rc1-net-next/build/vmlinux 59.00 0.4% br_nf_forward_ip /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 55.00 0.4% add_partial /lib/modules/2.6.34-rc1-net-next/build/vmlinux 52.00 0.4% __kfree_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 50.00 0.4% e1000_put_txbuf /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 50.00 0.4% local_bh_disable /lib/modules/2.6.34-rc1-net-next/build/vmlinux 50.00 0.4% e1000_alloc_rx_buffers /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 49.00 0.4% htable_selective_cleanup /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko And /proc/interrupts shows this: CPU0 CPU1 CPU2 CPU3 0: 46 0 1 0 IO-APIC-edge timer 1: 0 1 0 1 IO-APIC-edge i8042 6: 0 1 1 0 IO-APIC-edge floppy 8: 0 0 1 0 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 1 1 1 1 IO-APIC-edge i8042 14: 25 20 20 21 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 622 640 655 667 IO-APIC-fasteoi arcmsr 17: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1 18: 31149 31209 31023 30680 IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb7, eth3 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6 21: 0 0 0 0 IO-APIC-fasteoi ata_piix, uhci_hcd:usb4 23: 1 1 0 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5 27: 541048 540974 541145 541475 PCI-MSI-edge eth4 NMI: 80763 83546 37524 37703 Non-maskable interrupts LOC: 26176 24807 10336 13595 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 80763 83546 37524 37703 Performance monitoring interrupts PND: 79733 82513 36495 36674 Performance pending work RES: 34 196 110 93 Rescheduling interrupts CAL: 801 566 54 53 Function call interrupts TLB: 145 152 89 72 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 2 2 2 2 Machine check polls ERR: 3 MIS: 0 I hope this helps! Is there anything special I need to do to use RPS? Regards, Jorrit Kronjee ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-24 15:20 ` Jorrit Kronjee @ 2010-03-24 16:21 ` Eric Dumazet 2010-03-24 16:28 ` Jan Engelhardt 2010-03-25 9:32 ` Eric Dumazet 0 siblings, 2 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-24 16:21 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel Le mercredi 24 mars 2010 à 16:20 +0100, Jorrit Kronjee a écrit : > On 3/23/2010 6:21 PM, Eric Dumazet wrote: > > > > Could you post more information about your machine ? > > > > cat /proc/interrupts > > > > If running a recent kernel, a "perf top" would be useful > > > > Maybe RPS will help your setup (included in net-next-2.?6 tree) > > > Eric, > > To make things easier, I just installed the latest net-next tree. > Traffic flows from eth3 (e1000 7.3.21-k5-NAPI) to eth4 (e1000e > 1.0.2-k2). After a reboot and 5 minutes of flooding it with 300 kpps, > perftop shows this: > > ------------------------------------------------------------------------------------------------------------------------------------------------ > PerfTop: 918 irqs/sec kernel:99.6% [1000Hz cycles], (all, 4 CPUs) > ------------------------------------------------------------------------------------------------------------------------------------------------ > > samples pcnt function DSO > _______ _____ ________________________ > _______________________________________________________________________ > > 1588.00 11.4% __slab_free > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 1571.00 11.3% dsthash_find > /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko > 1117.00 8.0% _raw_spin_lock > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 899.00 6.4% e1000_clean_tx_irq > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 702.00 5.0% skb_release_head_state > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 650.00 4.7% kfree > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 540.00 3.9% __slab_alloc > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 514.00 3.7% memcpy > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 481.00 3.4% e1000_xmit_frame > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 335.00 2.4% nf_iterate > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 285.00 2.0% e1000_clean > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 264.00 1.9% kmem_cache_free > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 258.00 1.8% nf_hook_slow > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 257.00 1.8% e1000_intr_msi > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 207.00 1.5% ipt_do_table > /lib/modules/2.6.34-rc1-net-next/kernel/net/ipv4/netfilter/ip_tables.ko > 202.00 1.4% dev_queue_xmit > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 180.00 1.3% memset > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 173.00 1.2% __alloc_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 165.00 1.2% br_nf_pre_routing > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 159.00 1.1% __kmalloc_track_caller > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 159.00 1.1% br_nf_pre_routing_finish > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 158.00 1.1% e1000_clean_rx_irq > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko > 147.00 1.1% kmem_cache_alloc > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 140.00 1.0% kmem_cache_alloc_notrace > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 130.00 0.9% _raw_spin_lock_bh > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 122.00 0.9% hashlimit_mt_init > /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko > 101.00 0.7% br_fdb_update > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 101.00 0.7% br_handle_frame > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 100.00 0.7% __netif_receive_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 90.00 0.6% __netdev_alloc_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 88.00 0.6% irq_entries_start > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 85.00 0.6% br_handle_frame_finish > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 83.00 0.6% br_nf_post_routing > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 76.00 0.5% __br_fdb_get > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 64.00 0.5% eth_type_trans > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 59.00 0.4% br_nf_forward_ip > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 55.00 0.4% add_partial > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 52.00 0.4% __kfree_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 50.00 0.4% e1000_put_txbuf > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 50.00 0.4% local_bh_disable > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 50.00 0.4% e1000_alloc_rx_buffers > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko > 49.00 0.4% htable_selective_cleanup > /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko > > And /proc/interrupts shows this: > > CPU0 CPU1 CPU2 CPU3 > 0: 46 0 1 0 IO-APIC-edge timer > 1: 0 1 0 1 IO-APIC-edge i8042 > 6: 0 1 1 0 IO-APIC-edge floppy > 8: 0 0 1 0 IO-APIC-edge rtc0 > 9: 0 0 0 0 IO-APIC-fasteoi acpi > 12: 1 1 1 1 IO-APIC-edge i8042 > 14: 25 20 20 21 IO-APIC-edge > ata_piix > 15: 0 0 0 0 IO-APIC-edge > ata_piix > 16: 622 640 655 667 IO-APIC-fasteoi arcmsr > 17: 0 0 0 0 IO-APIC-fasteoi > ehci_hcd:usb1 > 18: 31149 31209 31023 30680 IO-APIC-fasteoi > uhci_hcd:usb3, uhci_hcd:usb7, eth3 > 19: 0 0 0 0 IO-APIC-fasteoi > uhci_hcd:usb6 > 21: 0 0 0 0 IO-APIC-fasteoi > ata_piix, uhci_hcd:usb4 > 23: 1 1 0 0 IO-APIC-fasteoi > ehci_hcd:usb2, uhci_hcd:usb5 > 27: 541048 540974 541145 541475 PCI-MSI-edge eth4 > NMI: 80763 83546 37524 37703 Non-maskable interrupts > LOC: 26176 24807 10336 13595 Local timer interrupts > SPU: 0 0 0 0 Spurious interrupts > PMI: 80763 83546 37524 37703 Performance > monitoring interrupts > PND: 79733 82513 36495 36674 Performance pending work > RES: 34 196 110 93 Rescheduling interrupts > CAL: 801 566 54 53 Function call interrupts > TLB: 145 152 89 72 TLB shootdowns > TRM: 0 0 0 0 Thermal event interrupts > THR: 0 0 0 0 Threshold APIC interrupts > MCE: 0 0 0 0 Machine check exceptions > MCP: 2 2 2 2 Machine check polls > ERR: 3 > MIS: 0 > > > I hope this helps! Is there anything special I need to do to use RPS? > Sure this helps a lot ! You might try RPS by doing : echo f >/sys/class/net/eth3/queues/rx-0/rps_cpus (But you'll also need a new xt_hashlimit module to make it more scalable, I can work on this this week if necessary) Also, you might try to cpu affine eth3 interrupts : echo 1 >/proc/irq/18/smp_affinity You might also cpu affine eth4 interrupts (Tx completions) echo 4 >/proc/irq/27/smp_affinity Playing with "ethtool -C eth4 ..." might help to reduce number of interrupts (TX completions) and batch them. Please send "ethtool -c eth4" dsthash_find() takes a lot of cpu, its a sign your hash table params might be suboptimal. Please send us "iptables -nvL" -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-24 16:21 ` Eric Dumazet @ 2010-03-24 16:28 ` Jan Engelhardt 2010-03-24 17:04 ` Eric Dumazet 2010-03-25 9:32 ` Eric Dumazet 1 sibling, 1 reply; 29+ messages in thread From: Jan Engelhardt @ 2010-03-24 16:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jorrit Kronjee, netfilter-devel On Wednesday 2010-03-24 17:21, Eric Dumazet wrote: >dsthash_find() takes a lot of cpu, its a sign your hash table params >might be suboptimal. > >Please send us "iptables -nvL" Consider using "iptables-save" instead, because -nvL is just an abridged tabular display. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-24 16:28 ` Jan Engelhardt @ 2010-03-24 17:04 ` Eric Dumazet 2010-03-24 17:25 ` Jan Engelhardt 0 siblings, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2010-03-24 17:04 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Jorrit Kronjee, netfilter-devel Le mercredi 24 mars 2010 à 17:28 +0100, Jan Engelhardt a écrit : > On Wednesday 2010-03-24 17:21, Eric Dumazet wrote: > >dsthash_find() takes a lot of cpu, its a sign your hash table params > >might be suboptimal. > > > >Please send us "iptables -nvL" > > Consider using "iptables-save" instead, because -nvL is just > an abridged tabular display. Yes you are right, but as I was also interested by rules and numbers (packets/bytes) of matches, please Jorrit post : iptables-save iptables -nvL Thanks -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-24 17:04 ` Eric Dumazet @ 2010-03-24 17:25 ` Jan Engelhardt 0 siblings, 0 replies; 29+ messages in thread From: Jan Engelhardt @ 2010-03-24 17:25 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jorrit Kronjee, netfilter-devel On Wednesday 2010-03-24 18:04, Eric Dumazet wrote: >Le mercredi 24 mars 2010 à 17:28 +0100, Jan Engelhardt a écrit : >> On Wednesday 2010-03-24 17:21, Eric Dumazet wrote: >> >dsthash_find() takes a lot of cpu, its a sign your hash table params >> >might be suboptimal. >> > >> >Please send us "iptables -nvL" >> >> Consider using "iptables-save" instead, because -nvL is just >> an abridged tabular display. > >Yes you are right, but as I was also interested by rules and numbers >(packets/bytes) of matches, please Jorrit post : > >iptables-save >iptables -nvL The numbers you get with iptables-save -c. There really is not anything that save does not output, after all, it's meant to store the exact state. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-24 16:21 ` Eric Dumazet 2010-03-24 16:28 ` Jan Engelhardt @ 2010-03-25 9:32 ` Eric Dumazet 2010-03-25 10:35 ` Patrick McHardy 2010-03-26 10:41 ` Jorrit Kronjee 1 sibling, 2 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-25 9:32 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel, netdev, Patrick McHardy Le mercredi 24 mars 2010 à 17:22 +0100, Eric Dumazet a écrit : > Le mercredi 24 mars 2010 à 16:20 +0100, Jorrit Kronjee a écrit : > > > > I hope this helps! Is there anything special I need to do to use RPS? > > > > Sure this helps a lot ! > > You might try RPS by doing : > > echo f >/sys/class/net/eth3/queues/rx-0/rps_cpus > > (But you'll also need a new xt_hashlimit module to make it more > scalable, I can work on this this week if necessary) > Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make it use RCU and scale better in your case (allowing several concurrent cpus once RPS is activated), but also on more general cases. [PATCH] xt_hashlimit: RCU conversion xt_hashlimit uses a central lock per hash table and suffers from contention on some workloads. After RCU conversion, central lock is only used when a writer wants to add or delete an entry. For 'readers', updating an existing entry, they use an individual lock per entry. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/netfilter/xt_hashlimit.c | 290 +++++++++++++++++---------------- 1 file changed, 151 insertions(+), 139 deletions(-) diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 9e9c489..a4f1b69 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -41,6 +41,54 @@ MODULE_DESCRIPTION("Xtables: per hash-bucket rate-limit match"); MODULE_ALIAS("ipt_hashlimit"); MODULE_ALIAS("ip6t_hashlimit"); +/* The algorithm used is the Simple Token Bucket Filter (TBF) + * see net/sched/sch_tbf.c in the linux source tree + */ + +/* Rusty: This is my (non-mathematically-inclined) understanding of + this algorithm. The `average rate' in jiffies becomes your initial + amount of credit `credit' and the most credit you can ever have + `credit_cap'. The `peak rate' becomes the cost of passing the + test, `cost'. + + `prev' tracks the last packet hit: you gain one credit per jiffy. + If you get credit balance more than this, the extra credit is + discarded. Every time the match passes, you lose `cost' credits; + if you don't have that many, the test fails. + + See Alexey's formal explanation in net/sched/sch_tbf.c. + + To get the maximum range, we multiply by this factor (ie. you get N + credits per jiffy). We want to allow a rate as low as 1 per day + (slowest userspace tool allows), which means + CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie. +*/ +#define MAX_CPJ (0xFFFFFFFF / (HZ*60*60*24)) + +/* Repeated shift and or gives us all 1s, final shift and add 1 gives + * us the power of 2 below the theoretical max, so GCC simply does a + * shift. */ +#define _POW2_BELOW2(x) ((x)|((x)>>1)) +#define _POW2_BELOW4(x) (_POW2_BELOW2(x)|_POW2_BELOW2((x)>>2)) +#define _POW2_BELOW8(x) (_POW2_BELOW4(x)|_POW2_BELOW4((x)>>4)) +#define _POW2_BELOW16(x) (_POW2_BELOW8(x)|_POW2_BELOW8((x)>>8)) +#define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16)) +#define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1) + +#define CREDITS_PER_JIFFY POW2_BELOW32(MAX_CPJ) + +/* Precision saver. */ +static inline u_int32_t +user2credits(u_int32_t user) +{ + /* If multiplying would overflow... */ + if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY)) + /* Divide first. */ + return (user / XT_HASHLIMIT_SCALE) * HZ * CREDITS_PER_JIFFY; + + return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE; +} + struct hashlimit_net { struct hlist_head htables; struct proc_dir_entry *ipt_hashlimit; @@ -80,12 +128,14 @@ struct dsthash_ent { struct dsthash_dst dst; /* modified structure members in the end */ + spinlock_t lock; unsigned long expires; /* precalculated expiry time */ struct { unsigned long prev; /* last modification */ u_int32_t credit; - u_int32_t credit_cap, cost; + u_int32_t cost; } rateinfo; + struct rcu_head rcu; }; struct xt_hashlimit_htable { @@ -95,10 +145,12 @@ struct xt_hashlimit_htable { bool rnd_initialized; struct hashlimit_cfg1 cfg; /* config */ + u_int32_t credit_cap; /* used internally */ - spinlock_t lock; /* lock for list_head */ u_int32_t rnd; /* random seed for hash */ + + spinlock_t hlock; /* lock for writes to hash[]/count/rnd */ unsigned int count; /* number entries in table */ struct timer_list timer; /* timer for gc */ @@ -142,55 +194,27 @@ dsthash_find(const struct xt_hashlimit_htable *ht, u_int32_t hash = hash_dst(ht, dst); if (!hlist_empty(&ht->hash[hash])) { - hlist_for_each_entry(ent, pos, &ht->hash[hash], node) - if (dst_cmp(ent, dst)) + hlist_for_each_entry_rcu(ent, pos, &ht->hash[hash], node) + if (dst_cmp(ent, dst)) { + spin_lock(&ent->lock); return ent; + } } return NULL; } -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ -static struct dsthash_ent * -dsthash_alloc_init(struct xt_hashlimit_htable *ht, - const struct dsthash_dst *dst) -{ - struct dsthash_ent *ent; - - /* initialize hash with random val at the time we allocate - * the first hashtable entry */ - if (!ht->rnd_initialized) { - get_random_bytes(&ht->rnd, sizeof(ht->rnd)); - ht->rnd_initialized = true; - } - - if (ht->cfg.max && ht->count >= ht->cfg.max) { - /* FIXME: do something. question is what.. */ - if (net_ratelimit()) - printk(KERN_WARNING - "xt_hashlimit: max count of %u reached\n", - ht->cfg.max); - return NULL; - } - ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); - if (!ent) { - if (net_ratelimit()) - printk(KERN_ERR - "xt_hashlimit: can't allocate dsthash_ent\n"); - return NULL; - } - memcpy(&ent->dst, dst, sizeof(ent->dst)); +static void dsthash_free_rcu(struct rcu_head *head) +{ + struct dsthash_ent *ent = container_of(head, struct dsthash_ent, rcu); - hlist_add_head(&ent->node, &ht->hash[hash_dst(ht, dst)]); - ht->count++; - return ent; + kmem_cache_free(hashlimit_cachep, ent); } -static inline void -dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) +static void dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) { - hlist_del(&ent->node); - kmem_cache_free(hashlimit_cachep, ent); + hlist_del_rcu(&ent->node); + call_rcu_bh(&ent->rcu, dsthash_free_rcu); ht->count--; } static void htable_gc(unsigned long htlong); @@ -243,11 +267,12 @@ static int htable_create_v0(struct net *net, struct xt_hashlimit_info *minfo, u_ for (i = 0; i < hinfo->cfg.size; i++) INIT_HLIST_HEAD(&hinfo->hash[i]); + hinfo->credit_cap = user2credits(hinfo->cfg.avg * hinfo->cfg.burst); hinfo->use = 1; hinfo->count = 0; hinfo->family = family; hinfo->rnd_initialized = false; - spin_lock_init(&hinfo->lock); + spin_lock_init(&hinfo->hlock); hinfo->pde = proc_create_data(minfo->name, 0, (family == NFPROTO_IPV4) ? hashlimit_net->ipt_hashlimit : hashlimit_net->ip6t_hashlimit, @@ -305,11 +330,12 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo, for (i = 0; i < hinfo->cfg.size; i++) INIT_HLIST_HEAD(&hinfo->hash[i]); + hinfo->credit_cap = user2credits(hinfo->cfg.avg * hinfo->cfg.burst); hinfo->use = 1; hinfo->count = 0; hinfo->family = family; hinfo->rnd_initialized = false; - spin_lock_init(&hinfo->lock); + spin_lock_init(&hinfo->hlock); hinfo->pde = proc_create_data(minfo->name, 0, (family == NFPROTO_IPV4) ? @@ -349,7 +375,7 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, unsigned int i; /* lock hash table and iterate over it */ - spin_lock_bh(&ht->lock); + spin_lock_bh(&ht->hlock); for (i = 0; i < ht->cfg.size; i++) { struct dsthash_ent *dh; struct hlist_node *pos, *n; @@ -358,7 +384,7 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, dsthash_free(ht, dh); } } - spin_unlock_bh(&ht->lock); + spin_unlock_bh(&ht->hlock); } /* hash table garbage collector, run by timer */ @@ -417,59 +443,12 @@ static void htable_put(struct xt_hashlimit_htable *hinfo) mutex_unlock(&hashlimit_mutex); } -/* The algorithm used is the Simple Token Bucket Filter (TBF) - * see net/sched/sch_tbf.c in the linux source tree - */ - -/* Rusty: This is my (non-mathematically-inclined) understanding of - this algorithm. The `average rate' in jiffies becomes your initial - amount of credit `credit' and the most credit you can ever have - `credit_cap'. The `peak rate' becomes the cost of passing the - test, `cost'. - - `prev' tracks the last packet hit: you gain one credit per jiffy. - If you get credit balance more than this, the extra credit is - discarded. Every time the match passes, you lose `cost' credits; - if you don't have that many, the test fails. - - See Alexey's formal explanation in net/sched/sch_tbf.c. - - To get the maximum range, we multiply by this factor (ie. you get N - credits per jiffy). We want to allow a rate as low as 1 per day - (slowest userspace tool allows), which means - CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie. -*/ -#define MAX_CPJ (0xFFFFFFFF / (HZ*60*60*24)) - -/* Repeated shift and or gives us all 1s, final shift and add 1 gives - * us the power of 2 below the theoretical max, so GCC simply does a - * shift. */ -#define _POW2_BELOW2(x) ((x)|((x)>>1)) -#define _POW2_BELOW4(x) (_POW2_BELOW2(x)|_POW2_BELOW2((x)>>2)) -#define _POW2_BELOW8(x) (_POW2_BELOW4(x)|_POW2_BELOW4((x)>>4)) -#define _POW2_BELOW16(x) (_POW2_BELOW8(x)|_POW2_BELOW8((x)>>8)) -#define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16)) -#define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1) - -#define CREDITS_PER_JIFFY POW2_BELOW32(MAX_CPJ) - -/* Precision saver. */ -static inline u_int32_t -user2credits(u_int32_t user) -{ - /* If multiplying would overflow... */ - if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY)) - /* Divide first. */ - return (user / XT_HASHLIMIT_SCALE) * HZ * CREDITS_PER_JIFFY; - - return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE; -} - -static inline void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now) +static inline void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, + struct xt_hashlimit_htable *hinfo) { dh->rateinfo.credit += (now - dh->rateinfo.prev) * CREDITS_PER_JIFFY; - if (dh->rateinfo.credit > dh->rateinfo.credit_cap) - dh->rateinfo.credit = dh->rateinfo.credit_cap; + if (dh->rateinfo.credit > hinfo->credit_cap) + dh->rateinfo.credit = hinfo->credit_cap; dh->rateinfo.prev = now; } @@ -563,9 +542,7 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, &_ports); break; default: - _ports[0] = _ports[1] = 0; - ports = _ports; - break; + return 0; } if (!ports) return -1; @@ -576,6 +553,48 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, return 0; } +/* allocate dsthash_ent, initialize dst, put in htable and lock it */ +static struct dsthash_ent * +dsthash_alloc_init(struct xt_hashlimit_htable *ht, + const struct dsthash_dst *dst) +{ + struct dsthash_ent *ent = NULL; + + spin_lock(&ht->hlock); + /* initialize hash with random val at the time we allocate + * the first hashtable entry */ + if (unlikely(!ht->rnd_initialized)) { + get_random_bytes(&ht->rnd, sizeof(ht->rnd)); + ht->rnd_initialized = true; + } + + if (ht->cfg.max && ht->count >= ht->cfg.max) { + /* FIXME: do something. question is what.. */ + if (net_ratelimit()) + printk(KERN_WARNING + "xt_hashlimit: max count of %u reached\n", + ht->cfg.max); + } else + ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); + if (!ent) { + if (net_ratelimit()) + pr_err("xt_hashlimit: can't allocate dsthash_ent\n"); + } else { + memcpy(&ent->dst, dst, sizeof(ent->dst)); + spin_lock_init(&ent->lock); + ent->expires = jiffies + msecs_to_jiffies(ht->cfg.expire); + ent->rateinfo.prev = jiffies; + ent->rateinfo.credit = ht->credit_cap; + ent->rateinfo.cost = user2credits(ht->cfg.avg); + spin_lock(&ent->lock); + + hlist_add_head_rcu(&ent->node, &ht->hash[hash_dst(ht, dst)]); + ht->count++; + } + spin_unlock(&ht->hlock); + return ent; +} + static bool hashlimit_mt_v0(const struct sk_buff *skb, const struct xt_match_param *par) { @@ -588,38 +607,30 @@ hashlimit_mt_v0(const struct sk_buff *skb, const struct xt_match_param *par) if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); + rcu_read_lock_bh(); dh = dsthash_find(hinfo, &dst); if (!dh) { dh = dsthash_alloc_init(hinfo, &dst); if (!dh) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); - dh->rateinfo.prev = jiffies; - dh->rateinfo.credit = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.credit_cap = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.cost = user2credits(hinfo->cfg.avg); } else { /* update expiration timeout */ dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire); - rateinfo_recalc(dh, now); + rateinfo_recalc(dh, now, hinfo); } if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* We're underlimit. */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return true; } - spin_unlock_bh(&hinfo->lock); - - /* default case: we're overlimit, thus don't match */ + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return false; hotdrop: @@ -639,36 +650,31 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); + rcu_read_lock_bh(); dh = dsthash_find(hinfo, &dst); if (dh == NULL) { dh = dsthash_alloc_init(hinfo, &dst); if (dh == NULL) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); - dh->rateinfo.prev = jiffies; - dh->rateinfo.credit = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.credit_cap = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.cost = user2credits(hinfo->cfg.avg); } else { /* update expiration timeout */ dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire); - rateinfo_recalc(dh, now); + rateinfo_recalc(dh, now, hinfo); } if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* below the limit */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return !(info->cfg.mode & XT_HASHLIMIT_INVERT); } - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); /* default match is underlimit - so over the limit, we need to invert */ return info->cfg.mode & XT_HASHLIMIT_INVERT; @@ -843,12 +849,12 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = { /* PROC stuff */ static void *dl_seq_start(struct seq_file *s, loff_t *pos) - __acquires(htable->lock) + __acquires(htable->hlock) { struct xt_hashlimit_htable *htable = s->private; unsigned int *bucket; - spin_lock_bh(&htable->lock); + spin_lock_bh(&htable->hlock); if (*pos >= htable->cfg.size) return NULL; @@ -874,46 +880,52 @@ static void *dl_seq_next(struct seq_file *s, void *v, loff_t *pos) } static void dl_seq_stop(struct seq_file *s, void *v) - __releases(htable->lock) + __releases(htable->hlock) { struct xt_hashlimit_htable *htable = s->private; unsigned int *bucket = (unsigned int *)v; kfree(bucket); - spin_unlock_bh(&htable->lock); + spin_unlock_bh(&htable->hlock); } static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family, - struct seq_file *s) + struct seq_file *s, struct xt_hashlimit_htable *htable) { + int res = 0; + + spin_lock(&ent->lock); /* recalculate to show accurate numbers */ - rateinfo_recalc(ent, jiffies); + rateinfo_recalc(ent, jiffies, htable); switch (family) { case NFPROTO_IPV4: - return seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", + res = seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, &ent->dst.ip.src, ntohs(ent->dst.src_port), &ent->dst.ip.dst, ntohs(ent->dst.dst_port), - ent->rateinfo.credit, ent->rateinfo.credit_cap, + ent->rateinfo.credit, htable->credit_cap, ent->rateinfo.cost); + break; #if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) case NFPROTO_IPV6: - return seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", + res = seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, &ent->dst.ip6.src, ntohs(ent->dst.src_port), &ent->dst.ip6.dst, ntohs(ent->dst.dst_port), - ent->rateinfo.credit, ent->rateinfo.credit_cap, + ent->rateinfo.credit, htable->credit_cap, ent->rateinfo.cost); + break; #endif default: BUG(); - return 0; } + spin_unlock(&ent->lock); + return res; } static int dl_seq_show(struct seq_file *s, void *v) @@ -925,7 +937,7 @@ static int dl_seq_show(struct seq_file *s, void *v) if (!hlist_empty(&htable->hash[*bucket])) { hlist_for_each_entry(ent, pos, &htable->hash[*bucket], node) - if (dl_seq_real_show(ent, htable->family, s)) + if (dl_seq_real_show(ent, htable->family, s, htable)) return -1; } return 0; @@ -1037,9 +1049,9 @@ err1: static void __exit hashlimit_mt_exit(void) { - kmem_cache_destroy(hashlimit_cachep); xt_unregister_matches(hashlimit_mt_reg, ARRAY_SIZE(hashlimit_mt_reg)); unregister_pernet_subsys(&hashlimit_net_ops); + kmem_cache_destroy(hashlimit_cachep); } module_init(hashlimit_mt_init); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-25 9:32 ` Eric Dumazet @ 2010-03-25 10:35 ` Patrick McHardy 2010-03-25 11:02 ` Eric Dumazet ` (2 more replies) 2010-03-26 10:41 ` Jorrit Kronjee 1 sibling, 3 replies; 29+ messages in thread From: Patrick McHardy @ 2010-03-25 10:35 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jorrit Kronjee, netfilter-devel, netdev Eric Dumazet wrote: > Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make > it use RCU and scale better in your case (allowing several concurrent > cpus once RPS is activated), but also on more general cases. > > [PATCH] xt_hashlimit: RCU conversion > > xt_hashlimit uses a central lock per hash table and suffers from > contention on some workloads. > > After RCU conversion, central lock is only used when a writer wants to > add or delete an entry. For 'readers', updating an existing entry, they > use an individual lock per entry. This clashes with some recent cleanups in nf-next-2.6.git. I'm also expecting a patch from Jan to remove the old v0 revision very soon (probably today). Please rediff once I've pushed that out. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-25 10:35 ` Patrick McHardy @ 2010-03-25 11:02 ` Eric Dumazet 2010-03-31 12:23 ` [PATCH nf-next-2.6] xt_hashlimit: RCU conversion Eric Dumazet 2010-03-25 12:42 ` debugging kernel during packet drops Jan Engelhardt 2010-03-30 12:06 ` Jan Engelhardt 2 siblings, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2010-03-25 11:02 UTC (permalink / raw) To: Patrick McHardy; +Cc: Jorrit Kronjee, netfilter-devel, netdev Le jeudi 25 mars 2010 à 11:35 +0100, Patrick McHardy a écrit : > Eric Dumazet wrote: > > Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make > > it use RCU and scale better in your case (allowing several concurrent > > cpus once RPS is activated), but also on more general cases. > > > > [PATCH] xt_hashlimit: RCU conversion > > > > xt_hashlimit uses a central lock per hash table and suffers from > > contention on some workloads. > > > > After RCU conversion, central lock is only used when a writer wants to > > add or delete an entry. For 'readers', updating an existing entry, they > > use an individual lock per entry. > > This clashes with some recent cleanups in nf-next-2.6.git. I'm > also expecting a patch from Jan to remove the old v0 revision > very soon (probably today). Please rediff once I've pushed that out. Sure, this patch was mainly for Jorrit tests (and he uses net-next-2.6), I will submit several patches for mainline :) Thanks -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH nf-next-2.6] xt_hashlimit: RCU conversion 2010-03-25 11:02 ` Eric Dumazet @ 2010-03-31 12:23 ` Eric Dumazet 2010-04-01 11:03 ` Patrick McHardy 0 siblings, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2010-03-31 12:23 UTC (permalink / raw) To: Patrick McHardy; +Cc: Jorrit Kronjee, netfilter-devel, netdev xt_hashlimit uses a central lock per hash table and suffers from contention on some workloads. (Multiqueue NIC or if RPS is enabled) After RCU conversion, central lock is only used when a writer wants to add or delete an entry. For 'readers', updating an existing entry, they use an individual lock per entry. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/netfilter/xt_hashlimit.c | 115 ++++++++++++++++++--------------- 1 file changed, 66 insertions(+), 49 deletions(-) diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 5470bb0..f245de6 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -81,12 +81,14 @@ struct dsthash_ent { struct dsthash_dst dst; /* modified structure members in the end */ + spinlock_t lock; unsigned long expires; /* precalculated expiry time */ struct { unsigned long prev; /* last modification */ u_int32_t credit; u_int32_t credit_cap, cost; } rateinfo; + struct rcu_head rcu; }; struct xt_hashlimit_htable { @@ -143,54 +145,30 @@ dsthash_find(const struct xt_hashlimit_htable *ht, u_int32_t hash = hash_dst(ht, dst); if (!hlist_empty(&ht->hash[hash])) { - hlist_for_each_entry(ent, pos, &ht->hash[hash], node) - if (dst_cmp(ent, dst)) + hlist_for_each_entry_rcu(ent, pos, &ht->hash[hash], node) + if (dst_cmp(ent, dst)) { + spin_lock(&ent->lock); return ent; + } } return NULL; } -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ -static struct dsthash_ent * -dsthash_alloc_init(struct xt_hashlimit_htable *ht, - const struct dsthash_dst *dst) +static void dsthash_free_rcu(struct rcu_head *head) { - struct dsthash_ent *ent; + struct dsthash_ent *ent = container_of(head, struct dsthash_ent, rcu); - /* initialize hash with random val at the time we allocate - * the first hashtable entry */ - if (!ht->rnd_initialized) { - get_random_bytes(&ht->rnd, sizeof(ht->rnd)); - ht->rnd_initialized = true; - } - - if (ht->cfg.max && ht->count >= ht->cfg.max) { - /* FIXME: do something. question is what.. */ - if (net_ratelimit()) - pr_err("max count of %u reached\n", ht->cfg.max); - return NULL; - } - - ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); - if (!ent) { - if (net_ratelimit()) - pr_err("cannot allocate dsthash_ent\n"); - return NULL; - } - memcpy(&ent->dst, dst, sizeof(ent->dst)); - - hlist_add_head(&ent->node, &ht->hash[hash_dst(ht, dst)]); - ht->count++; - return ent; + kmem_cache_free(hashlimit_cachep, ent); } -static inline void -dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) +static void dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) { - hlist_del(&ent->node); - kmem_cache_free(hashlimit_cachep, ent); + hlist_del_rcu(&ent->node); + call_rcu_bh(&ent->rcu, dsthash_free_rcu); ht->count--; } + + static void htable_gc(unsigned long htlong); static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo, @@ -500,6 +478,49 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, return 0; } +/* allocate dsthash_ent, initialize dst, put in htable and lock it */ +static struct dsthash_ent * +dsthash_alloc_init(struct xt_hashlimit_htable *ht, + const struct dsthash_dst *dst) +{ + struct dsthash_ent *ent; + + spin_lock(&ht->lock); + /* initialize hash with random val at the time we allocate + * the first hashtable entry */ + if (unlikely(!ht->rnd_initialized)) { + get_random_bytes(&ht->rnd, sizeof(ht->rnd)); + ht->rnd_initialized = true; + } + + if (ht->cfg.max && ht->count >= ht->cfg.max) { + /* FIXME: do something. question is what.. */ + if (net_ratelimit()) + pr_err("max count of %u reached\n", ht->cfg.max); + ent = NULL; + } else + ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); + if (!ent) { + if (net_ratelimit()) + pr_err("cannot allocate dsthash_ent\n"); + } else { + memcpy(&ent->dst, dst, sizeof(ent->dst)); + spin_lock_init(&ent->lock); + + ent->expires = jiffies + msecs_to_jiffies(ht->cfg.expire); + ent->rateinfo.prev = jiffies; + ent->rateinfo.credit = user2credits(ht->cfg.avg * + ht->cfg.burst); + ent->rateinfo.credit_cap = user2credits(ht->cfg.avg * + ht->cfg.burst); + ent->rateinfo.cost = user2credits(ht->cfg.avg); + spin_lock(&ent->lock); + hlist_add_head_rcu(&ent->node, &ht->hash[hash_dst(ht, dst)]); + ht->count++; + } + spin_unlock(&ht->lock); + return ent; +} static bool hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) { @@ -512,22 +533,14 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); + rcu_read_lock_bh(); dh = dsthash_find(hinfo, &dst); if (dh == NULL) { dh = dsthash_alloc_init(hinfo, &dst); if (dh == NULL) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); - dh->rateinfo.prev = jiffies; - dh->rateinfo.credit = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.credit_cap = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.cost = user2credits(hinfo->cfg.avg); } else { /* update expiration timeout */ dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire); @@ -537,11 +550,13 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* below the limit */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return !(info->cfg.mode & XT_HASHLIMIT_INVERT); } - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); /* default match is underlimit - so over the limit, we need to invert */ return info->cfg.mode & XT_HASHLIMIT_INVERT; @@ -817,9 +832,11 @@ err1: static void __exit hashlimit_mt_exit(void) { - kmem_cache_destroy(hashlimit_cachep); xt_unregister_matches(hashlimit_mt_reg, ARRAY_SIZE(hashlimit_mt_reg)); unregister_pernet_subsys(&hashlimit_net_ops); + + rcu_barrier_bh(); + kmem_cache_destroy(hashlimit_cachep); } module_init(hashlimit_mt_init); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH nf-next-2.6] xt_hashlimit: RCU conversion 2010-03-31 12:23 ` [PATCH nf-next-2.6] xt_hashlimit: RCU conversion Eric Dumazet @ 2010-04-01 11:03 ` Patrick McHardy 2010-04-01 12:10 ` Eric Dumazet 0 siblings, 1 reply; 29+ messages in thread From: Patrick McHardy @ 2010-04-01 11:03 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jorrit Kronjee, netfilter-devel, netdev Eric Dumazet wrote: > xt_hashlimit uses a central lock per hash table and suffers from > contention on some workloads. (Multiqueue NIC or if RPS is enabled) > > After RCU conversion, central lock is only used when a writer wants to > add or delete an entry. > > For 'readers', updating an existing entry, they use an individual lock > per entry. Looks good to me, thanks Eric. > -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ > -static struct dsthash_ent * > -dsthash_alloc_init(struct xt_hashlimit_htable *ht, > - const struct dsthash_dst *dst) Is there a reason for moving this function downwards in the file? That unnecessarily increases the diff and makes the patch harder to review. For review purposes I moved it back up, resulting in 42 lines less diff. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH nf-next-2.6] xt_hashlimit: RCU conversion 2010-04-01 11:03 ` Patrick McHardy @ 2010-04-01 12:10 ` Eric Dumazet 2010-04-01 12:36 ` Patrick McHardy 0 siblings, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2010-04-01 12:10 UTC (permalink / raw) To: Patrick McHardy; +Cc: Jorrit Kronjee, netfilter-devel, netdev Le jeudi 01 avril 2010 à 13:03 +0200, Patrick McHardy a écrit : > Eric Dumazet wrote: > > xt_hashlimit uses a central lock per hash table and suffers from > > contention on some workloads. (Multiqueue NIC or if RPS is enabled) > > > > After RCU conversion, central lock is only used when a writer wants to > > add or delete an entry. > > > > For 'readers', updating an existing entry, they use an individual lock > > per entry. > > Looks good to me, thanks Eric. > > > -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ > > -static struct dsthash_ent * > > -dsthash_alloc_init(struct xt_hashlimit_htable *ht, > > - const struct dsthash_dst *dst) > > Is there a reason for moving this function downwards in the file? > That unnecessarily increases the diff and makes the patch harder to > review. For review purposes I moved it back up, resulting in 42 > lines less diff. > -- Well, this is because I had to move inside this function various initializations and these inits use user2credits() which was defined after dsthash_alloc_init(). But we can avoid this since we hold the entry spinlock, and before hash insertion. Only the lookup fields and the spinlock MUST be committed to memory before the insert. Other fields can be initialized later by caller. Here is V2 of patch, I added locking as well in dl_seq_real_show() because we call rateinfo_recalc(). htable main lock doesnt anymore protects each entry rateinfo. Thanks [PATCH nf-next-2.6] xt_hashlimit: RCU conversion xt_hashlimit uses a central lock per hash table and suffers from contention on some workloads. (Multiqueue NIC or if RPS is enabled) After RCU conversion, central lock is only used when a writer wants to add or delete an entry. For 'readers', updating an existing entry, they use an individual lock per entry. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 5470bb0..453178d 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -81,12 +81,14 @@ struct dsthash_ent { struct dsthash_dst dst; /* modified structure members in the end */ + spinlock_t lock; unsigned long expires; /* precalculated expiry time */ struct { unsigned long prev; /* last modification */ u_int32_t credit; u_int32_t credit_cap, cost; } rateinfo; + struct rcu_head rcu; }; struct xt_hashlimit_htable { @@ -143,9 +145,11 @@ dsthash_find(const struct xt_hashlimit_htable *ht, u_int32_t hash = hash_dst(ht, dst); if (!hlist_empty(&ht->hash[hash])) { - hlist_for_each_entry(ent, pos, &ht->hash[hash], node) - if (dst_cmp(ent, dst)) + hlist_for_each_entry_rcu(ent, pos, &ht->hash[hash], node) + if (dst_cmp(ent, dst)) { + spin_lock(&ent->lock); return ent; + } } return NULL; } @@ -157,9 +161,10 @@ dsthash_alloc_init(struct xt_hashlimit_htable *ht, { struct dsthash_ent *ent; + spin_lock(&ht->lock); /* initialize hash with random val at the time we allocate * the first hashtable entry */ - if (!ht->rnd_initialized) { + if (unlikely(!ht->rnd_initialized)) { get_random_bytes(&ht->rnd, sizeof(ht->rnd)); ht->rnd_initialized = true; } @@ -168,27 +173,36 @@ dsthash_alloc_init(struct xt_hashlimit_htable *ht, /* FIXME: do something. question is what.. */ if (net_ratelimit()) pr_err("max count of %u reached\n", ht->cfg.max); - return NULL; - } - - ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); + ent = NULL; + } else + ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); if (!ent) { if (net_ratelimit()) pr_err("cannot allocate dsthash_ent\n"); - return NULL; - } - memcpy(&ent->dst, dst, sizeof(ent->dst)); + } else { + memcpy(&ent->dst, dst, sizeof(ent->dst)); + spin_lock_init(&ent->lock); - hlist_add_head(&ent->node, &ht->hash[hash_dst(ht, dst)]); - ht->count++; + spin_lock(&ent->lock); + hlist_add_head_rcu(&ent->node, &ht->hash[hash_dst(ht, dst)]); + ht->count++; + } + spin_unlock(&ht->lock); return ent; } +static void dsthash_free_rcu(struct rcu_head *head) +{ + struct dsthash_ent *ent = container_of(head, struct dsthash_ent, rcu); + + kmem_cache_free(hashlimit_cachep, ent); +} + static inline void dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) { - hlist_del(&ent->node); - kmem_cache_free(hashlimit_cachep, ent); + hlist_del_rcu(&ent->node); + call_rcu_bh(&ent->rcu, dsthash_free_rcu); ht->count--; } static void htable_gc(unsigned long htlong); @@ -512,15 +526,14 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); + rcu_read_lock_bh(); dh = dsthash_find(hinfo, &dst); if (dh == NULL) { dh = dsthash_alloc_init(hinfo, &dst); if (dh == NULL) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); dh->rateinfo.prev = jiffies; dh->rateinfo.credit = user2credits(hinfo->cfg.avg * @@ -537,11 +550,13 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* below the limit */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return !(info->cfg.mode & XT_HASHLIMIT_INVERT); } - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); /* default match is underlimit - so over the limit, we need to invert */ return info->cfg.mode & XT_HASHLIMIT_INVERT; @@ -666,12 +681,15 @@ static void dl_seq_stop(struct seq_file *s, void *v) static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family, struct seq_file *s) { + int res; + + spin_lock(&ent->lock); /* recalculate to show accurate numbers */ rateinfo_recalc(ent, jiffies); switch (family) { case NFPROTO_IPV4: - return seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", + res = seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, &ent->dst.ip.src, ntohs(ent->dst.src_port), @@ -679,9 +697,10 @@ static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family, ntohs(ent->dst.dst_port), ent->rateinfo.credit, ent->rateinfo.credit_cap, ent->rateinfo.cost); + break; #if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) case NFPROTO_IPV6: - return seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", + res = seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, &ent->dst.ip6.src, ntohs(ent->dst.src_port), @@ -689,11 +708,14 @@ static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family, ntohs(ent->dst.dst_port), ent->rateinfo.credit, ent->rateinfo.credit_cap, ent->rateinfo.cost); + break; #endif default: BUG(); - return 0; + res = 0; } + spin_unlock(&ent->lock); + return res; } static int dl_seq_show(struct seq_file *s, void *v) @@ -817,9 +839,11 @@ err1: static void __exit hashlimit_mt_exit(void) { - kmem_cache_destroy(hashlimit_cachep); xt_unregister_matches(hashlimit_mt_reg, ARRAY_SIZE(hashlimit_mt_reg)); unregister_pernet_subsys(&hashlimit_net_ops); + + rcu_barrier_bh(); + kmem_cache_destroy(hashlimit_cachep); } module_init(hashlimit_mt_init); ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH nf-next-2.6] xt_hashlimit: RCU conversion 2010-04-01 12:10 ` Eric Dumazet @ 2010-04-01 12:36 ` Patrick McHardy 0 siblings, 0 replies; 29+ messages in thread From: Patrick McHardy @ 2010-04-01 12:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: Jorrit Kronjee, netfilter-devel, netdev Eric Dumazet wrote: > Le jeudi 01 avril 2010 à 13:03 +0200, Patrick McHardy a écrit : >>> -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ >>> -static struct dsthash_ent * >>> -dsthash_alloc_init(struct xt_hashlimit_htable *ht, >>> - const struct dsthash_dst *dst) >> Is there a reason for moving this function downwards in the file? >> That unnecessarily increases the diff and makes the patch harder to >> review. For review purposes I moved it back up, resulting in 42 >> lines less diff. >> -- > > Well, this is because I had to move inside this function various > initializations and these inits use user2credits() which was defined > after dsthash_alloc_init(). I thought I also tried to compile it after my change, but apparently I made some mistake :) > But we can avoid this since we hold the entry spinlock, and before hash > insertion. > > Only the lookup fields and the spinlock MUST be committed to memory > before the insert. Other fields can be initialized later by caller. > > Here is V2 of patch, I added locking as well in dl_seq_real_show() > because we call rateinfo_recalc(). htable main lock doesnt anymore > protects each entry rateinfo. > > Thanks > > [PATCH nf-next-2.6] xt_hashlimit: RCU conversion > > xt_hashlimit uses a central lock per hash table and suffers from > contention on some workloads. (Multiqueue NIC or if RPS is enabled) > > After RCU conversion, central lock is only used when a writer wants to > add or delete an entry. > > For 'readers', updating an existing entry, they use an individual lock > per entry. Applied, thanks a lot Eric. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-25 10:35 ` Patrick McHardy 2010-03-25 11:02 ` Eric Dumazet @ 2010-03-25 12:42 ` Jan Engelhardt 2010-03-30 12:06 ` Jan Engelhardt 2 siblings, 0 replies; 29+ messages in thread From: Jan Engelhardt @ 2010-03-25 12:42 UTC (permalink / raw) To: Patrick McHardy; +Cc: Eric Dumazet, Jorrit Kronjee, netfilter-devel, netdev On Thursday 2010-03-25 11:35, Patrick McHardy wrote: >Eric Dumazet wrote: >> Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make >> it use RCU and scale better in your case (allowing several concurrent >> cpus once RPS is activated), but also on more general cases. >> >> [PATCH] xt_hashlimit: RCU conversion > >This clashes with some recent cleanups in nf-next-2.6.git. I'm >also expecting a patch from Jan to remove the old v0 revision >very soon (probably today). I was waiting for you to process http://marc.info/?l=netfilter-devel&m=126936212932683&w=2 which is a prerequisite for the hashlimit v0 removal. (I expected that you might miss some mails due to trash.net having been down yesterday. No biggie.) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-25 10:35 ` Patrick McHardy 2010-03-25 11:02 ` Eric Dumazet 2010-03-25 12:42 ` debugging kernel during packet drops Jan Engelhardt @ 2010-03-30 12:06 ` Jan Engelhardt 2010-03-30 14:12 ` Patrick McHardy 2 siblings, 1 reply; 29+ messages in thread From: Jan Engelhardt @ 2010-03-30 12:06 UTC (permalink / raw) To: Patrick McHardy; +Cc: Eric Dumazet, Jorrit Kronjee, netfilter-devel, netdev On Thursday 2010-03-25 11:35, Patrick McHardy wrote: >Eric Dumazet wrote: >> Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make >> it use RCU and scale better in your case (allowing several concurrent >> cpus once RPS is activated), but also on more general cases. >> >> [PATCH] xt_hashlimit: RCU conversion >> >> xt_hashlimit uses a central lock per hash table and suffers from >> contention on some workloads. >> >> After RCU conversion, central lock is only used when a writer wants to >> add or delete an entry. For 'readers', updating an existing entry, they >> use an individual lock per entry. > >This clashes with some recent cleanups in nf-next-2.6.git. I'm >also expecting a patch from Jan to remove the old v0 revision >very soon (probably today). Please rediff once I've pushed that out. One 12-series request has been sitting there for a while. Was there something not in order with it? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-30 12:06 ` Jan Engelhardt @ 2010-03-30 14:12 ` Patrick McHardy 0 siblings, 0 replies; 29+ messages in thread From: Patrick McHardy @ 2010-03-30 14:12 UTC (permalink / raw) To: Jan Engelhardt; +Cc: Eric Dumazet, Jorrit Kronjee, netfilter-devel, netdev Jan Engelhardt wrote: > On Thursday 2010-03-25 11:35, Patrick McHardy wrote: > >> This clashes with some recent cleanups in nf-next-2.6.git. I'm >> also expecting a patch from Jan to remove the old v0 revision >> very soon (probably today). Please rediff once I've pushed that out. > > One 12-series request has been sitting there for a while. Was there > something not in order with it? No, I'm currently a bit backlogged due to getting sick. I'm running your series for testing though, I'll probably push it out tommorrow. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-25 9:32 ` Eric Dumazet 2010-03-25 10:35 ` Patrick McHardy @ 2010-03-26 10:41 ` Jorrit Kronjee 2010-03-26 11:21 ` Eric Dumazet 2010-03-26 14:17 ` Eric Dumazet 1 sibling, 2 replies; 29+ messages in thread From: Jorrit Kronjee @ 2010-03-26 10:41 UTC (permalink / raw) To: Eric Dumazet, netfilter-devel On 3/25/2010 10:32 AM, Eric Dumazet wrote: > Le mercredi 24 mars 2010 à 17:22 +0100, Eric Dumazet a écrit : > >> Sure this helps a lot ! >> >> You might try RPS by doing : >> >> echo f >/sys/class/net/eth3/queues/rx-0/rps_cpus >> >> (But you'll also need a new xt_hashlimit module to make it more >> scalable, I can work on this this week if necessary) >> >> > Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make > it use RCU and scale better in your case (allowing several concurrent > cpus once RPS is activated), but also on more general cases. > > [PATCH] xt_hashlimit: RCU conversion > > xt_hashlimit uses a central lock per hash table and suffers from > contention on some workloads. > > After RCU conversion, central lock is only used when a writer wants to > add or delete an entry. For 'readers', updating an existing entry, they > use an individual lock per entry. > Eric, Awesome work, thanks for the effort! I've tried the patch and got some results. The drop rate was reduced dramatically after I activated RPS. I did the same test I did before, namely I rebooted and started flooding the machine immediately after with 300 kpps. After 5 minutes, perf top looked like this: ------------------------------------------------------------------------------------------------------------------------- PerfTop: 1962 irqs/sec kernel:99.3% [1000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________ _____________________________________________________________________ 4501.00 14.0% __ticket_spin_lock /lib/modules/2.6.34-rc1-net-next/build/vmlinux 2985.00 9.3% dsthash_find /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko 2346.00 7.3% __ticket_spin_unlock /lib/modules/2.6.34-rc1-net-next/build/vmlinux 1354.00 4.2% e1000_xmit_frame /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 1070.00 3.3% __slab_free /lib/modules/2.6.34-rc1-net-next/build/vmlinux 997.00 3.1% memcpy /lib/modules/2.6.34-rc1-net-next/build/vmlinux 809.00 2.5% dev_queue_xmit /lib/modules/2.6.34-rc1-net-next/build/vmlinux 791.00 2.5% nf_iterate /lib/modules/2.6.34-rc1-net-next/build/vmlinux 705.00 2.2% e1000_clean_tx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 634.00 2.0% nf_hook_slow /lib/modules/2.6.34-rc1-net-next/build/vmlinux 624.00 1.9% skb_release_head_state /lib/modules/2.6.34-rc1-net-next/build/vmlinux 584.00 1.8% e1000_intr /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 536.00 1.7% br_nf_pre_routing_finish /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 528.00 1.6% nommu_map_page /lib/modules/2.6.34-rc1-net-next/build/vmlinux 499.00 1.6% kfree /lib/modules/2.6.34-rc1-net-next/build/vmlinux 494.00 1.5% __netif_receive_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 472.00 1.5% __alloc_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 448.00 1.4% br_fdb_update /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 437.00 1.4% __slab_alloc /lib/modules/2.6.34-rc1-net-next/build/vmlinux 428.00 1.3% ipt_do_table [ip_tables] 403.00 1.3% memset /lib/modules/2.6.34-rc1-net-next/build/vmlinux 402.00 1.3% br_handle_frame /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 389.00 1.2% e1000_clean_rx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 388.00 1.2% e1000_clean /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 381.00 1.2% uhci_irq /lib/modules/2.6.34-rc1-net-next/build/vmlinux 366.00 1.1% get_rps_cpu /lib/modules/2.6.34-rc1-net-next/build/vmlinux 365.00 1.1% br_nf_pre_routing /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 349.00 1.1% dst_release /lib/modules/2.6.34-rc1-net-next/build/vmlinux And iptables-save -c produced this: # Generated by iptables-save v1.4.4 on Fri Mar 26 11:24:59 2010 *filter :INPUT ACCEPT [1043:60514] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [942:282723] [99563191:3783420610] -A FORWARD -m hashlimit --hashlimit-upto 10000/sec --hashlimit-burst 100 --hashlimit-mode dstip --hashlimit-name hashtable --hashlimit-htable-max 131072 --hashlimit-htable-expire 1000 -j ACCEPT [0:0] -A FORWARD -m limit --limit 5/sec -j LOG --log-prefix "HASHLIMITED -- " [0:0] -A FORWARD -j DROP COMMIT # Completed on Fri Mar 26 11:24:59 2010 And /proc/interrupts looked like this: CPU0 CPU1 CPU2 CPU3 0: 47 0 1 0 IO-APIC-edge timer 1: 0 1 0 1 IO-APIC-edge i8042 6: 1 1 0 0 IO-APIC-edge floppy 8: 1 0 0 0 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 1 1 2 IO-APIC-edge i8042 14: 21 22 22 21 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 492 464 463 474 IO-APIC-fasteoi arcmsr 17: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1 18: 971171 971391 948171 948663 IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb7, eth3 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6 21: 0 0 0 0 IO-APIC-fasteoi ata_piix, uhci_hcd:usb4 23: 1 0 1 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5 27: 1003145 1002952 1026174 1025671 PCI-MSI-edge eth4 NMI: 202553 185135 134999 185071 Non-maskable interrupts LOC: 20270 19227 17387 23282 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 202553 185135 134999 185071 Performance monitoring interrupts PND: 201464 183939 134067 184098 Performance pending work RES: 2216 2449 1212 1432 Rescheduling interrupts CAL: 2223380 2226493 2233481 2228957 Function call interrupts TLB: 606 584 1274 1216 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 2 2 2 2 Machine check polls ERR: 3 MIS: 0 ifconfig reported only 2 drops after these 5 minutes. I'm thinking about removing/changing the hashing algorithm to make dsthash_find faster. All I need after all is a match against a destination IP address. Also, I'd like the limit of 10kpps to be a bit higher. I'll see if I can work on that during the weekend. Thanks again for everything! Regards, Jorrit Kronjee -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-26 10:41 ` Jorrit Kronjee @ 2010-03-26 11:21 ` Eric Dumazet 2010-03-26 14:17 ` Eric Dumazet 1 sibling, 0 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-26 11:21 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel Le vendredi 26 mars 2010 à 11:41 +0100, Jorrit Kronjee a écrit : > > Eric, > Hi Jorrit > Awesome work, thanks for the effort! I've tried the patch and got some > results. The drop rate was reduced dramatically after I activated RPS. > > I did the same test I did before, namely I rebooted and started flooding > the machine immediately after with 300 kpps. After 5 minutes, perf top > looked like this: > > ------------------------------------------------------------------------------------------------------------------------- > PerfTop: 1962 irqs/sec kernel:99.3% [1000Hz cycles], (all, 4 CPUs) > ------------------------------------------------------------------------------------------------------------------------- > we can see cpu power is better used now > samples pcnt function DSO > _______ _____ ________________________ > _____________________________________________________________________ > > 4501.00 14.0% __ticket_spin_lock > /lib/modules/2.6.34-rc1-net-next/build/vmlinux strange... it could be goot to try a callgraph report to check where we spend so much time in spinlock > 2985.00 9.3% dsthash_find > /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko > 2346.00 7.3% __ticket_spin_unlock > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 1354.00 4.2% e1000_xmit_frame > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 1070.00 3.3% __slab_free > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 997.00 3.1% memcpy > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 809.00 2.5% dev_queue_xmit > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 791.00 2.5% nf_iterate > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 705.00 2.2% e1000_clean_tx_irq > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko > 634.00 2.0% nf_hook_slow > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 624.00 1.9% skb_release_head_state > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 584.00 1.8% e1000_intr > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko > 536.00 1.7% br_nf_pre_routing_finish > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 528.00 1.6% nommu_map_page > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 499.00 1.6% kfree > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 494.00 1.5% __netif_receive_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 472.00 1.5% __alloc_skb > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 448.00 1.4% br_fdb_update > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 437.00 1.4% __slab_alloc > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 428.00 1.3% ipt_do_table [ip_tables] > 403.00 1.3% memset > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 402.00 1.3% br_handle_frame > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 389.00 1.2% e1000_clean_rx_irq > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko > 388.00 1.2% e1000_clean > /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko > 381.00 1.2% uhci_irq > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 366.00 1.1% get_rps_cpu > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > 365.00 1.1% br_nf_pre_routing > /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko > 349.00 1.1% dst_release > /lib/modules/2.6.34-rc1-net-next/build/vmlinux > > And iptables-save -c produced this: > # Generated by iptables-save v1.4.4 on Fri Mar 26 11:24:59 2010 > *filter > :INPUT ACCEPT [1043:60514] > :FORWARD ACCEPT [0:0] > :OUTPUT ACCEPT [942:282723] > [99563191:3783420610] -A FORWARD -m hashlimit --hashlimit-upto 10000/sec > --hashlimit-burst 100 --hashlimit-mode dstip --hashlimit-name hashtable > --hashlimit-htable-max 131072 --hashlimit-htable-expire 1000 -j ACCEPT Hmm, maybe you should add --hashlimit-htable-size 32768 (or 65536) Because the default is probably too small for your case. > [0:0] -A FORWARD -m limit --limit 5/sec -j LOG --log-prefix "HASHLIMITED > -- " > [0:0] -A FORWARD -j DROP > COMMIT > # Completed on Fri Mar 26 11:24:59 2010 > > And /proc/interrupts looked like this: > CPU0 CPU1 CPU2 CPU3 > 0: 47 0 1 0 IO-APIC-edge timer > 1: 0 1 0 1 IO-APIC-edge i8042 > 6: 1 1 0 0 IO-APIC-edge floppy > 8: 1 0 0 0 IO-APIC-edge rtc0 > 9: 0 0 0 0 IO-APIC-fasteoi acpi > 12: 0 1 1 2 IO-APIC-edge i8042 > 14: 21 22 22 21 IO-APIC-edge > ata_piix > 15: 0 0 0 0 IO-APIC-edge > ata_piix > 16: 492 464 463 474 IO-APIC-fasteoi arcmsr > 17: 0 0 0 0 IO-APIC-fasteoi > ehci_hcd:usb1 > 18: 971171 971391 948171 948663 IO-APIC-fasteoi > uhci_hcd:usb3, uhci_hcd:usb7, eth3 > 19: 0 0 0 0 IO-APIC-fasteoi > uhci_hcd:usb6 > 21: 0 0 0 0 IO-APIC-fasteoi > ata_piix, uhci_hcd:usb4 > 23: 1 0 1 0 IO-APIC-fasteoi > ehci_hcd:usb2, uhci_hcd:usb5 > 27: 1003145 1002952 1026174 1025671 PCI-MSI-edge eth4 > NMI: 202553 185135 134999 185071 Non-maskable interrupts > LOC: 20270 19227 17387 23282 Local timer interrupts > SPU: 0 0 0 0 Spurious interrupts > PMI: 202553 185135 134999 185071 Performance > monitoring interrupts > PND: 201464 183939 134067 184098 Performance pending work > RES: 2216 2449 1212 1432 Rescheduling interrupts > CAL: 2223380 2226493 2233481 2228957 Function call interrupts > TLB: 606 584 1274 1216 TLB shootdowns > TRM: 0 0 0 0 Thermal event interrupts > THR: 0 0 0 0 Threshold APIC interrupts > MCE: 0 0 0 0 Machine check exceptions > MCP: 2 2 2 2 Machine check polls > ERR: 3 > MIS: 0 > > ifconfig reported only 2 drops after these 5 minutes. I'm thinking about > removing/changing the hashing algorithm to make dsthash_find faster. All > I need after all is a match against a destination IP address. Also, I'd > like the limit of 10kpps to be a bit higher. I'll see if I can work on > that during the weekend. > Nice ! I have another patch version (but not yet tested) that now is able to make dynamic key, so that its length is shortened. In your case, only dst is matched, key length is one u32. So time to build key is short, time to compare is also short (not using slow memcmp()) Its probably buggy as hell, dont even try it ! diff --git a/net/netfilter/xt_hashlimit.c b/net/netfilter/xt_hashlimit.c index 9e9c489..09211fa 100644 --- a/net/netfilter/xt_hashlimit.c +++ b/net/netfilter/xt_hashlimit.c @@ -41,6 +41,54 @@ MODULE_DESCRIPTION("Xtables: per hash-bucket rate-limit match"); MODULE_ALIAS("ipt_hashlimit"); MODULE_ALIAS("ip6t_hashlimit"); +/* The algorithm used is the Simple Token Bucket Filter (TBF) + * see net/sched/sch_tbf.c in the linux source tree + */ + +/* Rusty: This is my (non-mathematically-inclined) understanding of + this algorithm. The `average rate' in jiffies becomes your initial + amount of credit `credit' and the most credit you can ever have + `credit_cap'. The `peak rate' becomes the cost of passing the + test, `cost'. + + `prev' tracks the last packet hit: you gain one credit per jiffy. + If you get credit balance more than this, the extra credit is + discarded. Every time the match passes, you lose `cost' credits; + if you don't have that many, the test fails. + + See Alexey's formal explanation in net/sched/sch_tbf.c. + + To get the maximum range, we multiply by this factor (ie. you get N + credits per jiffy). We want to allow a rate as low as 1 per day + (slowest userspace tool allows), which means + CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie. +*/ +#define MAX_CPJ (0xFFFFFFFF / (HZ*60*60*24)) + +/* Repeated shift and or gives us all 1s, final shift and add 1 gives + * us the power of 2 below the theoretical max, so GCC simply does a + * shift. */ +#define _POW2_BELOW2(x) ((x)|((x)>>1)) +#define _POW2_BELOW4(x) (_POW2_BELOW2(x)|_POW2_BELOW2((x)>>2)) +#define _POW2_BELOW8(x) (_POW2_BELOW4(x)|_POW2_BELOW4((x)>>4)) +#define _POW2_BELOW16(x) (_POW2_BELOW8(x)|_POW2_BELOW8((x)>>8)) +#define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16)) +#define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1) + +#define CREDITS_PER_JIFFY POW2_BELOW32(MAX_CPJ) + +/* Precision saver. */ +static inline u_int32_t +user2credits(u_int32_t user) +{ + /* If multiplying would overflow... */ + if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY)) + /* Divide first. */ + return (user / XT_HASHLIMIT_SCALE) * HZ * CREDITS_PER_JIFFY; + + return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE; +} + struct hashlimit_net { struct hlist_head htables; struct proc_dir_entry *ipt_hashlimit; @@ -56,36 +104,23 @@ static inline struct hashlimit_net *hashlimit_pernet(struct net *net) /* need to declare this at the top */ static const struct file_operations dl_file_ops; -/* hash table crap */ -struct dsthash_dst { - union { - struct { - __be32 src; - __be32 dst; - } ip; #if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) - struct { - __be32 src[4]; - __be32 dst[4]; - } ip6; +#define MAXKEYSIZE (4 + 4 + 1) +#else +#define MAXKEYSIZE (1 + 1 + 1) #endif - }; - __be16 src_port; - __be16 dst_port; -}; struct dsthash_ent { - /* static / read-only parts in the beginning */ - struct hlist_node node; - struct dsthash_dst dst; - - /* modified structure members in the end */ + struct rcu_head rcu; + spinlock_t lock; unsigned long expires; /* precalculated expiry time */ struct { unsigned long prev; /* last modification */ u_int32_t credit; - u_int32_t credit_cap, cost; + u_int32_t cost; } rateinfo; + struct hlist_node node; + u32 key[MAXKEYSIZE]; }; struct xt_hashlimit_htable { @@ -95,10 +130,13 @@ struct xt_hashlimit_htable { bool rnd_initialized; struct hashlimit_cfg1 cfg; /* config */ - + u_int32_t credit_cap; + __be32 ndstmask; + __be32 nsrcmask; /* used internally */ - spinlock_t lock; /* lock for list_head */ u_int32_t rnd; /* random seed for hash */ + unsigned int sz32; /* number of u32 in key[0] */ + spinlock_t hlock; /* lock for writes to hash[]/count/rnd */ unsigned int count; /* number entries in table */ struct timer_list timer; /* timer for gc */ @@ -112,18 +150,19 @@ struct xt_hashlimit_htable { static DEFINE_MUTEX(hashlimit_mutex); /* protects htables list */ static struct kmem_cache *hashlimit_cachep __read_mostly; -static inline bool dst_cmp(const struct dsthash_ent *ent, - const struct dsthash_dst *b) +static bool key_cmp(const struct dsthash_ent *ent, const u32 *key, + unsigned int sz32) { - return !memcmp(&ent->dst, b, sizeof(ent->dst)); + unsigned int i; + u32 res = 0; + for (i = 0; i < sz32; i++) + res |= (ent->key[i] ^ key[i]); + return res == 0; } -static u_int32_t -hash_dst(const struct xt_hashlimit_htable *ht, const struct dsthash_dst *dst) +static u32 hash_func(const struct xt_hashlimit_htable *ht, const u32 *key) { - u_int32_t hash = jhash2((const u32 *)dst, - sizeof(*dst)/sizeof(u32), - ht->rnd); + u32 hash = jhash2(key, ht->sz32, ht->rnd); /* * Instead of returning hash % ht->cfg.size (implying a divide) * we return the high 32 bits of the (hash * ht->cfg.size) that will @@ -135,62 +174,32 @@ hash_dst(const struct xt_hashlimit_htable *ht, const struct dsthash_dst *dst) static struct dsthash_ent * dsthash_find(const struct xt_hashlimit_htable *ht, - const struct dsthash_dst *dst) + const u32 *key, u32 hash) { struct dsthash_ent *ent; struct hlist_node *pos; - u_int32_t hash = hash_dst(ht, dst); - if (!hlist_empty(&ht->hash[hash])) { - hlist_for_each_entry(ent, pos, &ht->hash[hash], node) - if (dst_cmp(ent, dst)) - return ent; - } + hlist_for_each_entry_rcu(ent, pos, &ht->hash[hash], node) + if (key_cmp(ent, key, ht->sz32)) { + spin_lock(&ent->lock); + return ent; + } + return NULL; } -/* allocate dsthash_ent, initialize dst, put in htable and lock it */ -static struct dsthash_ent * -dsthash_alloc_init(struct xt_hashlimit_htable *ht, - const struct dsthash_dst *dst) -{ - struct dsthash_ent *ent; - - /* initialize hash with random val at the time we allocate - * the first hashtable entry */ - if (!ht->rnd_initialized) { - get_random_bytes(&ht->rnd, sizeof(ht->rnd)); - ht->rnd_initialized = true; - } - - if (ht->cfg.max && ht->count >= ht->cfg.max) { - /* FIXME: do something. question is what.. */ - if (net_ratelimit()) - printk(KERN_WARNING - "xt_hashlimit: max count of %u reached\n", - ht->cfg.max); - return NULL; - } - ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); - if (!ent) { - if (net_ratelimit()) - printk(KERN_ERR - "xt_hashlimit: can't allocate dsthash_ent\n"); - return NULL; - } - memcpy(&ent->dst, dst, sizeof(ent->dst)); +static void dsthash_free_rcu(struct rcu_head *head) +{ + struct dsthash_ent *ent = container_of(head, struct dsthash_ent, rcu); - hlist_add_head(&ent->node, &ht->hash[hash_dst(ht, dst)]); - ht->count++; - return ent; + kmem_cache_free(hashlimit_cachep, ent); } -static inline void -dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) +static void dsthash_free(struct xt_hashlimit_htable *ht, struct dsthash_ent *ent) { - hlist_del(&ent->node); - kmem_cache_free(hashlimit_cachep, ent); + hlist_del_rcu(&ent->node); + call_rcu_bh(&ent->rcu, dsthash_free_rcu); ht->count--; } static void htable_gc(unsigned long htlong); @@ -216,7 +225,7 @@ static int htable_create_v0(struct net *net, struct xt_hashlimit_info *minfo, u_ hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) + sizeof(struct list_head) * size); if (!hinfo) { - printk(KERN_ERR "xt_hashlimit: unable to create hashtable\n"); + pr_err("xt_hashlimit: unable to create hashtable\n"); return -1; } minfo->hinfo = hinfo; @@ -229,9 +238,10 @@ static int htable_create_v0(struct net *net, struct xt_hashlimit_info *minfo, u_ hinfo->cfg.gc_interval = minfo->cfg.gc_interval; hinfo->cfg.expire = minfo->cfg.expire; - if (family == NFPROTO_IPV4) + if (family == NFPROTO_IPV4) { hinfo->cfg.srcmask = hinfo->cfg.dstmask = 32; - else + hinfo->nsrcmask = hinfo->ndstmask = 0xFFFFFFFF; + } else hinfo->cfg.srcmask = hinfo->cfg.dstmask = 128; hinfo->cfg.size = size; @@ -243,11 +253,16 @@ static int htable_create_v0(struct net *net, struct xt_hashlimit_info *minfo, u_ for (i = 0; i < hinfo->cfg.size; i++) INIT_HLIST_HEAD(&hinfo->hash[i]); + hinfo->credit_cap = user2credits(hinfo->cfg.avg * hinfo->cfg.burst); hinfo->use = 1; hinfo->count = 0; hinfo->family = family; hinfo->rnd_initialized = false; - spin_lock_init(&hinfo->lock); + + hinfo->sz32 = (family == NFPROTO_IPV4) ? (1 + 1) : (4 + 4); + if (hinfo->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT)) + hinfo->sz32++; + spin_lock_init(&hinfo->hlock); hinfo->pde = proc_create_data(minfo->name, 0, (family == NFPROTO_IPV4) ? hashlimit_net->ipt_hashlimit : hashlimit_net->ip6t_hashlimit, @@ -289,7 +304,7 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo, hinfo = vmalloc(sizeof(struct xt_hashlimit_htable) + sizeof(struct list_head) * size); if (hinfo == NULL) { - printk(KERN_ERR "xt_hashlimit: unable to create hashtable\n"); + pr_err("xt_hashlimit: unable to create hashtable\n"); return -1; } minfo->hinfo = hinfo; @@ -302,14 +317,22 @@ static int htable_create(struct net *net, struct xt_hashlimit_mtinfo1 *minfo, else if (hinfo->cfg.max < hinfo->cfg.size) hinfo->cfg.max = hinfo->cfg.size; + if (family == NFPROTO_IPV4) { + hinfo->nsrcmask = htonl(~0 << (32 - hinfo->cfg.srcmask)); + hinfo->ndstmask = htonl(~0 << (32 - hinfo->cfg.dstmask)); + } for (i = 0; i < hinfo->cfg.size; i++) INIT_HLIST_HEAD(&hinfo->hash[i]); + hinfo->credit_cap = user2credits(hinfo->cfg.avg * hinfo->cfg.burst); hinfo->use = 1; hinfo->count = 0; hinfo->family = family; hinfo->rnd_initialized = false; - spin_lock_init(&hinfo->lock); + hinfo->sz32 = (family == NFPROTO_IPV4) ? (1 + 1) : (4 + 4); + if (hinfo->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT)) + hinfo->sz32++; + spin_lock_init(&hinfo->hlock); hinfo->pde = proc_create_data(minfo->name, 0, (family == NFPROTO_IPV4) ? @@ -349,7 +372,7 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, unsigned int i; /* lock hash table and iterate over it */ - spin_lock_bh(&ht->lock); + spin_lock_bh(&ht->hlock); for (i = 0; i < ht->cfg.size; i++) { struct dsthash_ent *dh; struct hlist_node *pos, *n; @@ -358,7 +381,7 @@ static void htable_selective_cleanup(struct xt_hashlimit_htable *ht, dsthash_free(ht, dh); } } - spin_unlock_bh(&ht->lock); + spin_unlock_bh(&ht->hlock); } /* hash table garbage collector, run by timer */ @@ -417,109 +440,54 @@ static void htable_put(struct xt_hashlimit_htable *hinfo) mutex_unlock(&hashlimit_mutex); } -/* The algorithm used is the Simple Token Bucket Filter (TBF) - * see net/sched/sch_tbf.c in the linux source tree - */ - -/* Rusty: This is my (non-mathematically-inclined) understanding of - this algorithm. The `average rate' in jiffies becomes your initial - amount of credit `credit' and the most credit you can ever have - `credit_cap'. The `peak rate' becomes the cost of passing the - test, `cost'. - - `prev' tracks the last packet hit: you gain one credit per jiffy. - If you get credit balance more than this, the extra credit is - discarded. Every time the match passes, you lose `cost' credits; - if you don't have that many, the test fails. - - See Alexey's formal explanation in net/sched/sch_tbf.c. - - To get the maximum range, we multiply by this factor (ie. you get N - credits per jiffy). We want to allow a rate as low as 1 per day - (slowest userspace tool allows), which means - CREDITS_PER_JIFFY*HZ*60*60*24 < 2^32 ie. -*/ -#define MAX_CPJ (0xFFFFFFFF / (HZ*60*60*24)) - -/* Repeated shift and or gives us all 1s, final shift and add 1 gives - * us the power of 2 below the theoretical max, so GCC simply does a - * shift. */ -#define _POW2_BELOW2(x) ((x)|((x)>>1)) -#define _POW2_BELOW4(x) (_POW2_BELOW2(x)|_POW2_BELOW2((x)>>2)) -#define _POW2_BELOW8(x) (_POW2_BELOW4(x)|_POW2_BELOW4((x)>>4)) -#define _POW2_BELOW16(x) (_POW2_BELOW8(x)|_POW2_BELOW8((x)>>8)) -#define _POW2_BELOW32(x) (_POW2_BELOW16(x)|_POW2_BELOW16((x)>>16)) -#define POW2_BELOW32(x) ((_POW2_BELOW32(x)>>1) + 1) - -#define CREDITS_PER_JIFFY POW2_BELOW32(MAX_CPJ) - -/* Precision saver. */ -static inline u_int32_t -user2credits(u_int32_t user) -{ - /* If multiplying would overflow... */ - if (user > 0xFFFFFFFF / (HZ*CREDITS_PER_JIFFY)) - /* Divide first. */ - return (user / XT_HASHLIMIT_SCALE) * HZ * CREDITS_PER_JIFFY; - - return (user * HZ * CREDITS_PER_JIFFY) / XT_HASHLIMIT_SCALE; -} - -static inline void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now) +static inline void rateinfo_recalc(struct dsthash_ent *dh, unsigned long now, + struct xt_hashlimit_htable *hinfo) { dh->rateinfo.credit += (now - dh->rateinfo.prev) * CREDITS_PER_JIFFY; - if (dh->rateinfo.credit > dh->rateinfo.credit_cap) - dh->rateinfo.credit = dh->rateinfo.credit_cap; + if (dh->rateinfo.credit > hinfo->credit_cap) + dh->rateinfo.credit = hinfo->credit_cap; dh->rateinfo.prev = now; } -static inline __be32 maskl(__be32 a, unsigned int l) +#if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) +static u32 *ipv6_addr_partial_copy(u32 *key, __be32 *addr, + unsigned int leading_words, __be32 lastmask) { - return l ? htonl(ntohl(a) & ~0 << (32 - l)) : 0; + switch (leading_words) { + case 3: *key++ = *addr++; + case 2: *key++ = *addr++; + case 1: *key++ = *addr++; + } + *key++ = lastmask & *addr; + return key; } -#if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) -static void hashlimit_ipv6_mask(__be32 *i, unsigned int p) +static const u32 *ipv6_addr_rebuild(const u32 *key, __be32 *addr, unsigned int leading_words) { - switch (p) { - case 0 ... 31: - i[0] = maskl(i[0], p); - i[1] = i[2] = i[3] = 0; - break; - case 32 ... 63: - i[1] = maskl(i[1], p - 32); - i[2] = i[3] = 0; - break; - case 64 ... 95: - i[2] = maskl(i[2], p - 64); - i[3] = 0; - case 96 ... 127: - i[3] = maskl(i[3], p - 96); - break; - case 128: - break; + memset(addr, 0, 16); + + while (leading_words) { + *addr++ = *key++; + leading_words--; } + *addr = *key++; + return key; } #endif static int -hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, - struct dsthash_dst *dst, +hashlimit_init_key(const struct xt_hashlimit_htable *hinfo, u32 *key, const struct sk_buff *skb, unsigned int protoff) { __be16 _ports[2], *ports; u8 nexthdr; - memset(dst, 0, sizeof(*dst)); - switch (hinfo->family) { case NFPROTO_IPV4: if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_DIP) - dst->ip.dst = maskl(ip_hdr(skb)->daddr, - hinfo->cfg.dstmask); + *key++ = ip_hdr(skb)->daddr & hinfo->ndstmask; if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_SIP) - dst->ip.src = maskl(ip_hdr(skb)->saddr, - hinfo->cfg.srcmask); + *key++ = ip_hdr(skb)->saddr & hinfo->nsrcmask; if (!(hinfo->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT))) @@ -528,16 +496,14 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, break; #if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) case NFPROTO_IPV6: - if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_DIP) { - memcpy(&dst->ip6.dst, &ipv6_hdr(skb)->daddr, - sizeof(dst->ip6.dst)); - hashlimit_ipv6_mask(dst->ip6.dst, hinfo->cfg.dstmask); - } - if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_SIP) { - memcpy(&dst->ip6.src, &ipv6_hdr(skb)->saddr, - sizeof(dst->ip6.src)); - hashlimit_ipv6_mask(dst->ip6.src, hinfo->cfg.srcmask); - } + if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_DIP) + key = ipv6_addr_partial_copy(key, ipv6_hdr(skb)->daddr.s6_addr32, + hinfo->cfg.dstmask >> 5, + hinfo->ndstmask); + if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_SIP) + key = ipv6_addr_partial_copy(key, ipv6_hdr(skb)->saddr.s6_addr32, + hinfo->cfg.srcmask >> 5, + hinfo->nsrcmask); if (!(hinfo->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT))) @@ -563,19 +529,60 @@ hashlimit_init_dst(const struct xt_hashlimit_htable *hinfo, &_ports); break; default: - _ports[0] = _ports[1] = 0; - ports = _ports; - break; + return 0; } if (!ports) return -1; - if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_SPT) - dst->src_port = ports[0]; + *key = (hinfo->cfg.mode & XT_HASHLIMIT_HASH_SPT) ? (ports[0] << 16) : 0; if (hinfo->cfg.mode & XT_HASHLIMIT_HASH_DPT) - dst->dst_port = ports[1]; + *key |= ports[1]; return 0; } +/* allocate dsthash_ent, initialize dst, put in htable and lock it */ +static struct dsthash_ent * +dsthash_alloc_init(struct xt_hashlimit_htable *ht, + const u32 *key, u32 hash) +{ + struct dsthash_ent *ent = NULL; + + spin_lock(&ht->hlock); + + /* initialize hash with random val at the time we allocate + * the first hashtable entry. In this case, recompute hash value. + */ + if (unlikely(!ht->rnd_initialized)) { + get_random_bytes(&ht->rnd, sizeof(ht->rnd)); + ht->rnd_initialized = true; + hash = hash_func(ht, key); + } + + if (ht->cfg.max && ht->count >= ht->cfg.max) { + /* FIXME: do something. question is what.. */ + if (net_ratelimit()) + pr_warning("xt_hashlimit: max count of %u reached\n", + ht->cfg.max); + } else + ent = kmem_cache_alloc(hashlimit_cachep, GFP_ATOMIC); + if (!ent) { + if (net_ratelimit()) + pr_err("xt_hashlimit: can't allocate dsthash_ent\n"); + } else { + memcpy(&ent->key, key, ht->sz32 * sizeof(u32)); + spin_lock_init(&ent->lock); + ent->expires = jiffies + msecs_to_jiffies(ht->cfg.expire); + ent->rateinfo.prev = jiffies; + ent->rateinfo.credit = ht->credit_cap; + ent->rateinfo.cost = user2credits(ht->cfg.avg); + spin_lock(&ent->lock); + + hlist_add_head_rcu(&ent->node, &ht->hash[hash]); + ht->count++; + } + spin_unlock(&ht->hlock); + return ent; +} + static bool hashlimit_mt_v0(const struct sk_buff *skb, const struct xt_match_param *par) { @@ -583,43 +590,36 @@ hashlimit_mt_v0(const struct sk_buff *skb, const struct xt_match_param *par) struct xt_hashlimit_htable *hinfo = r->hinfo; unsigned long now = jiffies; struct dsthash_ent *dh; - struct dsthash_dst dst; + u32 key[MAXKEYSIZE], hash; - if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) + if (hashlimit_init_key(hinfo, key, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); - dh = dsthash_find(hinfo, &dst); + hash = hash_func(hinfo, key); + rcu_read_lock_bh(); + dh = dsthash_find(hinfo, key, hash); if (!dh) { - dh = dsthash_alloc_init(hinfo, &dst); + dh = dsthash_alloc_init(hinfo, key, hash); if (!dh) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); - dh->rateinfo.prev = jiffies; - dh->rateinfo.credit = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.credit_cap = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.cost = user2credits(hinfo->cfg.avg); } else { /* update expiration timeout */ dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire); - rateinfo_recalc(dh, now); + rateinfo_recalc(dh, now, hinfo); } if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* We're underlimit. */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return true; } - spin_unlock_bh(&hinfo->lock); - - /* default case: we're overlimit, thus don't match */ + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return false; hotdrop: @@ -634,45 +634,41 @@ hashlimit_mt(const struct sk_buff *skb, const struct xt_match_param *par) struct xt_hashlimit_htable *hinfo = info->hinfo; unsigned long now = jiffies; struct dsthash_ent *dh; - struct dsthash_dst dst; + u32 key[MAXKEYSIZE], hash; - if (hashlimit_init_dst(hinfo, &dst, skb, par->thoff) < 0) + if (hashlimit_init_key(hinfo, key, skb, par->thoff) < 0) goto hotdrop; - spin_lock_bh(&hinfo->lock); - dh = dsthash_find(hinfo, &dst); + hash = hash_func(hinfo, key); + rcu_read_lock_bh(); + dh = dsthash_find(hinfo, key, hash); if (dh == NULL) { - dh = dsthash_alloc_init(hinfo, &dst); + dh = dsthash_alloc_init(hinfo, key, hash); if (dh == NULL) { - spin_unlock_bh(&hinfo->lock); + rcu_read_unlock_bh(); goto hotdrop; } - dh->expires = jiffies + msecs_to_jiffies(hinfo->cfg.expire); - dh->rateinfo.prev = jiffies; - dh->rateinfo.credit = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.credit_cap = user2credits(hinfo->cfg.avg * - hinfo->cfg.burst); - dh->rateinfo.cost = user2credits(hinfo->cfg.avg); } else { /* update expiration timeout */ dh->expires = now + msecs_to_jiffies(hinfo->cfg.expire); - rateinfo_recalc(dh, now); + rateinfo_recalc(dh, now, hinfo); } if (dh->rateinfo.credit >= dh->rateinfo.cost) { /* below the limit */ dh->rateinfo.credit -= dh->rateinfo.cost; - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); return !(info->cfg.mode & XT_HASHLIMIT_INVERT); } - spin_unlock_bh(&hinfo->lock); + spin_unlock(&dh->lock); + rcu_read_unlock_bh(); /* default match is underlimit - so over the limit, we need to invert */ return info->cfg.mode & XT_HASHLIMIT_INVERT; - hotdrop: +hotdrop: *par->hotdrop = true; return false; } @@ -685,7 +681,7 @@ static bool hashlimit_mt_check_v0(const struct xt_mtchk_param *par) /* Check for overflow. */ if (r->cfg.burst == 0 || user2credits(r->cfg.avg * r->cfg.burst) < user2credits(r->cfg.avg)) { - printk(KERN_ERR "xt_hashlimit: overflow, try lower: %u/%u\n", + pr_err("xt_hashlimit: overflow, try lower: %u/%u\n", r->cfg.avg, r->cfg.burst); return false; } @@ -722,7 +718,7 @@ static bool hashlimit_mt_check(const struct xt_mtchk_param *par) if (info->cfg.burst == 0 || user2credits(info->cfg.avg * info->cfg.burst) < user2credits(info->cfg.avg)) { - printk(KERN_ERR "xt_hashlimit: overflow, try lower: %u/%u\n", + pr_err("xt_hashlimit: overflow, try lower: %u/%u\n", info->cfg.avg, info->cfg.burst); return false; } @@ -843,16 +839,17 @@ static struct xt_match hashlimit_mt_reg[] __read_mostly = { /* PROC stuff */ static void *dl_seq_start(struct seq_file *s, loff_t *pos) - __acquires(htable->lock) + __acquires(htable->hlock) { struct xt_hashlimit_htable *htable = s->private; - unsigned int *bucket; + unsigned int *bucket = kmalloc(sizeof(unsigned int), GFP_KERNEL); - spin_lock_bh(&htable->lock); - if (*pos >= htable->cfg.size) - return NULL; + spin_lock_bh(&htable->hlock); - bucket = kmalloc(sizeof(unsigned int), GFP_ATOMIC); + if (*pos >= htable->cfg.size) { + kfree(bucket); + return NULL; + } if (!bucket) return ERR_PTR(-ENOMEM); @@ -874,46 +871,68 @@ static void *dl_seq_next(struct seq_file *s, void *v, loff_t *pos) } static void dl_seq_stop(struct seq_file *s, void *v) - __releases(htable->lock) + __releases(htable->hlock) { struct xt_hashlimit_htable *htable = s->private; unsigned int *bucket = (unsigned int *)v; - kfree(bucket); - spin_unlock_bh(&htable->lock); + if (!IS_ERR(bucket)) + kfree(bucket); + spin_unlock_bh(&htable->hlock); } static int dl_seq_real_show(struct dsthash_ent *ent, u_int8_t family, - struct seq_file *s) + struct seq_file *s, struct xt_hashlimit_htable *htable) { + int res = 0; + __be32 ports; + const u32 *key = ent->key; + + spin_lock(&ent->lock); /* recalculate to show accurate numbers */ - rateinfo_recalc(ent, jiffies); + rateinfo_recalc(ent, jiffies, htable); switch (family) { case NFPROTO_IPV4: - return seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", + { + __be32 src = (htable->cfg.mode & XT_HASHLIMIT_HASH_DIP) ? *key++ : 0; + __be32 dst = (htable->cfg.mode & XT_HASHLIMIT_HASH_SIP) ? *key++ : 0; + + ports = (htable->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT)) ? + *key : 0; + res = seq_printf(s, "%ld %pI4:%u->%pI4:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, - &ent->dst.ip.src, - ntohs(ent->dst.src_port), - &ent->dst.ip.dst, - ntohs(ent->dst.dst_port), - ent->rateinfo.credit, ent->rateinfo.credit_cap, + &src, ntohs(ports >> 16), + &dst, ntohs(ports & 0xFFFF), + ent->rateinfo.credit, htable->credit_cap, ent->rateinfo.cost); + break; + } #if defined(CONFIG_IP6_NF_IPTABLES) || defined(CONFIG_IP6_NF_IPTABLES_MODULE) case NFPROTO_IPV6: - return seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", + { + struct in6_addr src; + struct in6_addr dst; + if (htable->cfg.mode & XT_HASHLIMIT_HASH_DIP) + key = ipv6_addr_rebuild(key, src.s6_addr32, htable->cfg.dstmask >> 5); + if (htable->cfg.mode & XT_HASHLIMIT_HASH_SIP) + key = ipv6_addr_rebuild(key, dst.s6_addr32, htable->cfg.srcmask >> 5); + ports = (htable->cfg.mode & (XT_HASHLIMIT_HASH_DPT | XT_HASHLIMIT_HASH_SPT)) ? + *key : 0; + res = seq_printf(s, "%ld %pI6:%u->%pI6:%u %u %u %u\n", (long)(ent->expires - jiffies)/HZ, - &ent->dst.ip6.src, - ntohs(ent->dst.src_port), - &ent->dst.ip6.dst, - ntohs(ent->dst.dst_port), - ent->rateinfo.credit, ent->rateinfo.credit_cap, + &src, ntohs(ports >> 16), + &dst, ntohs(ports & 0xFFFF), + ent->rateinfo.credit, htable->credit_cap, ent->rateinfo.cost); + break; + } #endif default: BUG(); - return 0; } + spin_unlock(&ent->lock); + return res; } static int dl_seq_show(struct seq_file *s, void *v) @@ -925,7 +944,7 @@ static int dl_seq_show(struct seq_file *s, void *v) if (!hlist_empty(&htable->hash[*bucket])) { hlist_for_each_entry(ent, pos, &htable->hash[*bucket], node) - if (dl_seq_real_show(ent, htable->family, s)) + if (dl_seq_real_show(ent, htable->family, s, htable)) return -1; } return 0; @@ -1022,7 +1041,7 @@ static int __init hashlimit_mt_init(void) sizeof(struct dsthash_ent), 0, 0, NULL); if (!hashlimit_cachep) { - printk(KERN_ERR "xt_hashlimit: unable to create slab cache\n"); + pr_err("xt_hashlimit: unable to create slab cache\n"); goto err2; } return 0; @@ -1037,9 +1056,9 @@ err1: static void __exit hashlimit_mt_exit(void) { - kmem_cache_destroy(hashlimit_cachep); xt_unregister_matches(hashlimit_mt_reg, ARRAY_SIZE(hashlimit_mt_reg)); unregister_pernet_subsys(&hashlimit_net_ops); + kmem_cache_destroy(hashlimit_cachep); } module_init(hashlimit_mt_init); -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-26 10:41 ` Jorrit Kronjee 2010-03-26 11:21 ` Eric Dumazet @ 2010-03-26 14:17 ` Eric Dumazet 2010-03-26 15:54 ` Jorrit Kronjee 1 sibling, 1 reply; 29+ messages in thread From: Eric Dumazet @ 2010-03-26 14:17 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel Le vendredi 26 mars 2010 à 11:41 +0100, Jorrit Kronjee a écrit : > And iptables-save -c produced this: > # Generated by iptables-save v1.4.4 on Fri Mar 26 11:24:59 2010 > *filter > :INPUT ACCEPT [1043:60514] > :FORWARD ACCEPT [0:0] > :OUTPUT ACCEPT [942:282723] > [99563191:3783420610] -A FORWARD -m hashlimit --hashlimit-upto 10000/sec > --hashlimit-burst 100 --hashlimit-mode dstip --hashlimit-name hashtable > --hashlimit-htable-max 131072 --hashlimit-htable-expire 1000 -j ACCEPT > [0:0] -A FORWARD -m limit --limit 5/sec -j LOG --log-prefix "HASHLIMITED > -- " Hmm, --hashlimit-htable-expire 1000 & gcinterval 1000 (default) are very aggressive. That might explain high number of spinlocks/unlocks (many entries are inserted/deleted per second) I would let entries forever in table (no more expensive locks/unlocks) --hashlimit-htable-expire 100000 --hashlimit-htable-gcinterval 3600000 (garbage collect every hour) --hashlimit-htable-size 65536 -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-26 14:17 ` Eric Dumazet @ 2010-03-26 15:54 ` Jorrit Kronjee 0 siblings, 0 replies; 29+ messages in thread From: Jorrit Kronjee @ 2010-03-26 15:54 UTC (permalink / raw) To: Eric Dumazet, netfilter-devel Eric, Changing the "expire"-value doesn't seem to have much effect, since the traffic I'm sending updates the expiration value too regularly anyway. However, changing the garbage collector interval made the amount of interrupts drop from ~1900 irqs/s to ~50 irqs/s according to perf top. I tried cranking up the traffic to see how far I can push it, but I'm starting to reach the limitations of my DoS machine. I can now bridge about 390 kpps without any packet drops. Regards, Jorrit Kronjee On 3/26/2010 3:17 PM, Eric Dumazet wrote: > Le vendredi 26 mars 2010 à 11:41 +0100, Jorrit Kronjee a écrit : > > >> And iptables-save -c produced this: >> # Generated by iptables-save v1.4.4 on Fri Mar 26 11:24:59 2010 >> *filter >> :INPUT ACCEPT [1043:60514] >> :FORWARD ACCEPT [0:0] >> :OUTPUT ACCEPT [942:282723] >> [99563191:3783420610] -A FORWARD -m hashlimit --hashlimit-upto 10000/sec >> --hashlimit-burst 100 --hashlimit-mode dstip --hashlimit-name hashtable >> --hashlimit-htable-max 131072 --hashlimit-htable-expire 1000 -j ACCEPT >> [0:0] -A FORWARD -m limit --limit 5/sec -j LOG --log-prefix "HASHLIMITED >> -- " >> > Hmm, --hashlimit-htable-expire 1000 & gcinterval 1000 (default) are very > aggressive. > > That might explain high number of spinlocks/unlocks (many entries are > inserted/deleted per second) > > I would let entries forever in table (no more expensive locks/unlocks) > > --hashlimit-htable-expire 100000 > --hashlimit-htable-gcinterval 3600000 (garbage collect every hour) > --hashlimit-htable-size 65536 > > > -- Manager ICT Infopact Network Solutions Hoogvlietsekerkweg 170 3194 AM Rotterdam Hoogvliet tel. +31 (0)88 - 4636700 fax. +31 (0)88 - 4636799 j.kronjee@infopact.nl http://www.infopact.nl/ -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-22 10:41 debugging kernel during packet drops Jorrit Kronjee 2010-03-22 17:16 ` Patrick McHardy @ 2010-03-23 17:04 ` James King 2010-03-23 17:23 ` Eric Dumazet 1 sibling, 1 reply; 29+ messages in thread From: James King @ 2010-03-23 17:04 UTC (permalink / raw) To: Jorrit Kronjee; +Cc: netfilter-devel On Mon, Mar 22, 2010 at 3:41 AM, Jorrit Kronjee <j.kronjee@infopact.nl> wrote: > At around 300 kpps, the amount of packet drops is 40 kpps. For me, this > amount is too significant to ignore. I see the load average go from a > comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 > kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's > still 2 kpps. > > When I disable the hashlimit module the packet drops disappear again. > Now I know that hashlimit is made for more than one thing, namely > limiting packets based on source/destination host and source/destination > port, so it's not as efficient as it could be for my purposes. I could > rewrite it, but before I do that, I would like to know if the module > itself is really what's causing it, or if there's some underlying cause > that I'm not seeing. So my question in short: how can I discover why > it's dropping packets? I'm not sure whether it's even related to the problem you're having, but I had a similar problem on a bnx2 interface with high packet rates when using l7-filter. ifconfig reported huge numbers of dropped packets, corresponding to rx_fw_discards from "ethtool -S ethX" output. I resolved this by bumping up the driver RX ring size (which was defaulting to 100 out of a maximum possible size of 1020). ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: debugging kernel during packet drops 2010-03-23 17:04 ` James King @ 2010-03-23 17:23 ` Eric Dumazet 0 siblings, 0 replies; 29+ messages in thread From: Eric Dumazet @ 2010-03-23 17:23 UTC (permalink / raw) To: James King; +Cc: Jorrit Kronjee, netfilter-devel Le mardi 23 mars 2010 à 10:04 -0700, James King a écrit : > On Mon, Mar 22, 2010 at 3:41 AM, Jorrit Kronjee <j.kronjee@infopact.nl> wrote: > > At around 300 kpps, the amount of packet drops is 40 kpps. For me, this > > amount is too significant to ignore. I see the load average go from a > > comfortable 0.00 to 1.78, mainly caused by ksoftirqd processes. At 200 > > kpps, the average amount of packet drops is 23 kpps. At 100 kpps, it's > > still 2 kpps. > > > > When I disable the hashlimit module the packet drops disappear again. > > Now I know that hashlimit is made for more than one thing, namely > > limiting packets based on source/destination host and source/destination > > port, so it's not as efficient as it could be for my purposes. I could > > rewrite it, but before I do that, I would like to know if the module > > itself is really what's causing it, or if there's some underlying cause > > that I'm not seeing. So my question in short: how can I discover why > > it's dropping packets? > > I'm not sure whether it's even related to the problem you're having, > but I had a similar problem on a bnx2 interface with high packet rates > when using l7-filter. ifconfig reported huge numbers of dropped > packets, corresponding to rx_fw_discards from "ethtool -S ethX" > output. I resolved this by bumping up the driver RX ring size (which > was defaulting to 100 out of a maximum possible size of 1020). > - If increasing latencies is OK, then yes, increasing queue lengthes is an answer. Sometime, its better to drop packets after a given threshold of workload. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2010-04-01 12:36 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-03-22 10:41 debugging kernel during packet drops Jorrit Kronjee 2010-03-22 17:16 ` Patrick McHardy 2010-03-22 17:53 ` Jan Engelhardt 2010-03-22 18:02 ` Patrick McHardy 2010-03-23 15:14 ` Jorrit Kronjee 2010-03-23 15:39 ` Patrick McHardy 2010-03-23 17:21 ` Eric Dumazet 2010-03-23 20:07 ` Eric Dumazet 2010-03-24 15:20 ` Jorrit Kronjee 2010-03-24 16:21 ` Eric Dumazet 2010-03-24 16:28 ` Jan Engelhardt 2010-03-24 17:04 ` Eric Dumazet 2010-03-24 17:25 ` Jan Engelhardt 2010-03-25 9:32 ` Eric Dumazet 2010-03-25 10:35 ` Patrick McHardy 2010-03-25 11:02 ` Eric Dumazet 2010-03-31 12:23 ` [PATCH nf-next-2.6] xt_hashlimit: RCU conversion Eric Dumazet 2010-04-01 11:03 ` Patrick McHardy 2010-04-01 12:10 ` Eric Dumazet 2010-04-01 12:36 ` Patrick McHardy 2010-03-25 12:42 ` debugging kernel during packet drops Jan Engelhardt 2010-03-30 12:06 ` Jan Engelhardt 2010-03-30 14:12 ` Patrick McHardy 2010-03-26 10:41 ` Jorrit Kronjee 2010-03-26 11:21 ` Eric Dumazet 2010-03-26 14:17 ` Eric Dumazet 2010-03-26 15:54 ` Jorrit Kronjee 2010-03-23 17:04 ` James King 2010-03-23 17:23 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).