* Possible regression: Packet drops during iptables calls @ 2010-12-14 14:46 Jesper Dangaard Brouer 2010-12-14 15:31 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-14 14:46 UTC (permalink / raw) To: Stephen Hemminger, netfilter-devel; +Cc: netdev I'm experiencing RX packet drops during call to iptables, on my production servers. Further investigations showed, that its only the CPU executing the iptables command that experience packet drops!? Thus, a quick fix was to force the iptables command to run on one of the idle CPUs (This can be achieved with the "taskset" command). I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We only use 8 CPUs due to a multiqueue limitation of 8 queues in the 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet processing via smp_affinity. Can someone explain why the packet drops only occur on the CPU executing the iptables command? What can we do to solve this issue? I should note that I have a very large ruleset on this machine, and the production machine is routing around 800 Mbit/s, in each direction. The issue occurs on a simple iptables rule listing. I think (untested) the problem is related to kernel git commit: commit 942e4a2bd680c606af0211e64eb216be2e19bf61 Author: Stephen Hemminger <shemminger@vyatta.com> Date: Tue Apr 28 22:36:33 2009 -0700 netfilter: revised locking for x_tables The x_tables are organized with a table structure and a per-cpu copies of the counters and rules. On older kernels there was a reader/writer lock per table which was a performance bottleneck. In 2.6.30-rc, this was converted to use RCU and the counters/rules which solved the performance problems for do_table but made replacing rules much slower because of the necessary RCU grace period. This version uses a per-cpu set of spinlocks and counters to allow to table processing to proceed without the cache thrashing of a global reader lock and keeps the same performance for table updates. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net> -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-14 14:46 Possible regression: Packet drops during iptables calls Jesper Dangaard Brouer @ 2010-12-14 15:31 ` Eric Dumazet 2010-12-14 16:09 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-14 15:31 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Stephen Hemminger, netfilter-devel, netdev Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a écrit : > I'm experiencing RX packet drops during call to iptables, on my > production servers. > > Further investigations showed, that its only the CPU executing the > iptables command that experience packet drops!? Thus, a quick fix was > to force the iptables command to run on one of the idle CPUs (This can > be achieved with the "taskset" command). > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > processing via smp_affinity. > > Can someone explain why the packet drops only occur on the CPU > executing the iptables command? > > It blocks BH take a look at commits : 24b36f0193467fa727b85b4c004016a8dae999b9 netfilter: {ip,ip6,arp}_tables: dont block bottom half more than necessary 001389b9581c13fe5fc357a0f89234f85af4215d netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive for attempts to let BH fly ... Unfortunately, lockdep rules :( > What can we do to solve this issue? > > > I should note that I have a very large ruleset on this machine, and > the production machine is routing around 800 Mbit/s, in each > direction. The issue occurs on a simple iptables rule listing. > > > I think (untested) the problem is related to kernel git commit: > > commit 942e4a2bd680c606af0211e64eb216be2e19bf61 > Author: Stephen Hemminger <shemminger@vyatta.com> > Date: Tue Apr 28 22:36:33 2009 -0700 > > netfilter: revised locking for x_tables > > The x_tables are organized with a table structure and a per-cpu copies > of the counters and rules. On older kernels there was a reader/writer > lock per table which was a performance bottleneck. In 2.6.30-rc, this > was converted to use RCU and the counters/rules which solved the performance > problems for do_table but made replacing rules much slower because of > the necessary RCU grace period. > > This version uses a per-cpu set of spinlocks and counters to allow to > table processing to proceed without the cache thrashing of a global > reader lock and keeps the same performance for table updates. > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> > Acked-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: David S. Miller <davem@davemloft.net> > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-14 15:31 ` Eric Dumazet @ 2010-12-14 16:09 ` Jesper Dangaard Brouer 2010-12-14 16:24 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-14 16:09 UTC (permalink / raw) To: Eric Dumazet; +Cc: Stephen Hemminger, netfilter-devel, netdev On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote: > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a > écrit : > > I'm experiencing RX packet drops during call to iptables, on my > > production servers. > > > > Further investigations showed, that its only the CPU executing the > > iptables command that experience packet drops!? Thus, a quick fix was > > to force the iptables command to run on one of the idle CPUs (This can > > be achieved with the "taskset" command). > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > > processing via smp_affinity. > > > > Can someone explain why the packet drops only occur on the CPU > > executing the iptables command? > > > > It blocks BH > > take a look at commits : > > 24b36f0193467fa727b85b4c004016a8dae999b9 > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than > necessary > > 001389b9581c13fe5fc357a0f89234f85af4215d > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive > > for attempts to let BH fly ... > > Unfortunately, lockdep rules :( Is the lockdep check a false positive? Could I run with 24b36f0193 in production, to fix my problem? I forgot to mention I run kernel 2.6.35.8-comx01+ (based on Greg's stable kernel tree). $ git describe --contains 24b36f019346 v2.6.36-rc1~571^2~46^2~7 $ git describe --contains 001389b9581c1 v2.6.36-rc3~2^2~42 > > What can we do to solve this issue? Any ideas how we can proceed? Looking closer at the two combined code change, I see that the code path has been improved (a bit), as the local BH is only disabled inside the for_each_possible_cpu(cpu). Before local_bh was disabled for the hole function. Guess I need to reproduce this in my testlab. Thanks for your 'ninja' input ;-) -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-14 16:09 ` Jesper Dangaard Brouer @ 2010-12-14 16:24 ` Eric Dumazet 2010-12-16 14:04 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-14 16:24 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: Stephen Hemminger, netfilter-devel, netdev Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a écrit : > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote: > > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a > > écrit : > > > I'm experiencing RX packet drops during call to iptables, on my > > > production servers. > > > > > > Further investigations showed, that its only the CPU executing the > > > iptables command that experience packet drops!? Thus, a quick fix was > > > to force the iptables command to run on one of the idle CPUs (This can > > > be achieved with the "taskset" command). > > > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > > > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > > > processing via smp_affinity. > > > > > > Can someone explain why the packet drops only occur on the CPU > > > executing the iptables command? > > > > > > > It blocks BH > > > > take a look at commits : > > > > 24b36f0193467fa727b85b4c004016a8dae999b9 > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than > > necessary > > > > 001389b9581c13fe5fc357a0f89234f85af4215d > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive > > > > for attempts to let BH fly ... > > > > Unfortunately, lockdep rules :( > > Is the lockdep check a false positive? Yes its a false positive. > Could I run with 24b36f0193 in production, to fix my problem? > Yes, but you could also run a kernel with both commits: We now block BH for each cpu we are "summing", instead of blocking BH for the whole 16 possible cpus summation. (so BH should be blocked for smaller amount of time) > I forgot to mention I run kernel 2.6.35.8-comx01+ (based on Greg's stable kernel tree). > > $ git describe --contains 24b36f019346 > v2.6.36-rc1~571^2~46^2~7 > $ git describe --contains 001389b9581c1 > v2.6.36-rc3~2^2~42 > > > > > What can we do to solve this issue? > > Any ideas how we can proceed? > > Looking closer at the two combined code change, I see that the code path > has been improved (a bit), as the local BH is only disabled inside the > for_each_possible_cpu(cpu). Before local_bh was disabled for the hole > function. Guess I need to reproduce this in my testlab. > Yes, so current kernel is a bit better. Note that even with the 'false positive' problem, we had to blocks BH for the current cpu sum, so the max BH latency is probably the same with or without 001389b9581c13fe5 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-14 16:24 ` Eric Dumazet @ 2010-12-16 14:04 ` Jesper Dangaard Brouer 2010-12-16 14:12 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-16 14:04 UTC (permalink / raw) To: Arnaldo Carvalho de Melo, Steven Rostedt, Eric Dumazet, Alexander Duyck Cc: Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote: > Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a écrit : > > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote: > > > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a > > > écrit : > > > > I'm experiencing RX packet drops during call to iptables, on my > > > > production servers. > > > > > > > > Further investigations showed, that its only the CPU executing the > > > > iptables command that experience packet drops!? Thus, a quick fix was > > > > to force the iptables command to run on one of the idle CPUs (This can > > > > be achieved with the "taskset" command). > > > > > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > > > > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > > > > processing via smp_affinity. > > > > > > > > Can someone explain why the packet drops only occur on the CPU > > > > executing the iptables command? > > > > > > > > > > It blocks BH > > > > > > take a look at commits : > > > > > > 24b36f0193467fa727b85b4c004016a8dae999b9 > > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than > > > necessary > > > > > > 001389b9581c13fe5fc357a0f89234f85af4215d > > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv <... cut ...> > > > > Looking closer at the two combined code change, I see that the code path > > has been improved (a bit), as the local BH is only disabled inside the > > for_each_possible_cpu(cpu). Before local_bh was disabled for the hole > > function. Guess I need to reproduce this in my testlab. To do some further investigation into the unfortunate behavior of the iptables get_counters() function I started to use "ftrace". This is a really useful tool (thanks Steven Rostedt). # Select the tracer "function_graph" echo function_graph > /sys/kernel/debug/tracing/current_tracer # Limit the number of function we look at: echo local_bh_\* > /sys/kernel/debug/tracing/set_ftrace_filter echo get_counters >> /sys/kernel/debug/tracing/set_ftrace_filter # Enable tracing while calling iptables cd /sys/kernel/debug/tracing echo 0 > trace echo 1 > tracing_enabled; taskset 1 iptables -vnL > /dev/null ; echo 0 > tracing_enabled cat trace | less The reduced output: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 2) 2.772 us | local_bh_disable(); .... 0) 0.228 us | local_bh_enable(); 0) | get_counters() { 0) 0.232 us | local_bh_disable(); 0) 7.919 us | local_bh_enable(); 0) ! 109467.1 us | } 0) 2.344 us | local_bh_disable(); The output show that we spend no less that 100 ms with local BH disabled. So, no wonder that this causes packet drops in the NIC (attached to this CPU). My iptables rule set in question is also very large, it contains: Chains: 20929 Rules: 81239 The vmalloc size is approx 19 MB (19.820.544 bytes) (see /proc/vmallocinfo). Looking through vmallocinfo I realized that even-though I only have 16 CPUs, there is 32 allocated rulesets "xt_alloc_table_info" (for the filter table). Thus, I have approx 634MB iptables filter rules in the kernel, half of which is totally unused. Guess this is because we use: "for_each_possible_cpu" instead of "for_each_online_cpu". (Feel free to fix this, or point me to some documentation of this CPU hotplug stuff... I see we are missing get_cpu() and put_cpu() a lot of places). The GOOD NEWS, is that moving the local BH disable section into the "for_each_possible_cpu" fixed the problem with packet drops during iptables calls. I wanted to profile with ftrace on the new code, but I cannot get the measurement I want. Perhaps Steven or Acme can help? Now I want to measure the time used between the local_bh_disable() and local_bh_enable, within the loop. I cannot figure out howto do that? The new trace looks almost the same as before, just a lot of local_bh_* inside the get_counters() function call. Guess is that the time spend is: 100 ms / 32 = 3.125 ms. Now I just need to calculate, how large a NIC buffer I need to buffer 3.125 ms at 1Gbit/s. 3.125 ms * 1Gbit/s = 390625 bytes Can this be correct? How much buffer does each queue have in the 82576 NIC? (Hope Alexander Duyck can answer this one?) -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:04 ` Jesper Dangaard Brouer @ 2010-12-16 14:12 ` Eric Dumazet 2010-12-16 14:24 ` Jesper Dangaard Brouer 2010-12-16 14:13 ` Possible regression: Packet drops during iptables calls Eric Dumazet 2010-12-16 14:20 ` Steven Rostedt 2 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 14:12 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a écrit : > On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote: > > Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a écrit : > > > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote: > > > > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a > > > > écrit : > > > > > I'm experiencing RX packet drops during call to iptables, on my > > > > > production servers. > > > > > > > > > > Further investigations showed, that its only the CPU executing the > > > > > iptables command that experience packet drops!? Thus, a quick fix was > > > > > to force the iptables command to run on one of the idle CPUs (This can > > > > > be achieved with the "taskset" command). > > > > > > > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled). We > > > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the > > > > > 1Gbit/s NICs (82576 chips). CPUs 0 to 7 is assigned for packet > > > > > processing via smp_affinity. > > > > > > > > > > Can someone explain why the packet drops only occur on the CPU > > > > > executing the iptables command? > > > > > > > > > > > > > It blocks BH > > > > > > > > take a look at commits : > > > > > > > > 24b36f0193467fa727b85b4c004016a8dae999b9 > > > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than > > > > necessary > > > > > > > > 001389b9581c13fe5fc357a0f89234f85af4215d > > > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv > <... cut ...> > > > > > > Looking closer at the two combined code change, I see that the code path > > > has been improved (a bit), as the local BH is only disabled inside the > > > for_each_possible_cpu(cpu). Before local_bh was disabled for the hole > > > function. Guess I need to reproduce this in my testlab. > > > To do some further investigation into the unfortunate behavior of the > iptables get_counters() function I started to use "ftrace". This is a > really useful tool (thanks Steven Rostedt). > > # Select the tracer "function_graph" > echo function_graph > /sys/kernel/debug/tracing/current_tracer > > # Limit the number of function we look at: > echo local_bh_\* > /sys/kernel/debug/tracing/set_ftrace_filter > echo get_counters >> /sys/kernel/debug/tracing/set_ftrace_filter > > # Enable tracing while calling iptables > cd /sys/kernel/debug/tracing > echo 0 > trace > echo 1 > tracing_enabled; > taskset 1 iptables -vnL > /dev/null ; > echo 0 > tracing_enabled > cat trace | less > > > The reduced output: > > # tracer: function_graph > # > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 2) 2.772 us | local_bh_disable(); > .... > 0) 0.228 us | local_bh_enable(); > 0) | get_counters() { > 0) 0.232 us | local_bh_disable(); > 0) 7.919 us | local_bh_enable(); > 0) ! 109467.1 us | } > 0) 2.344 us | local_bh_disable(); > > > The output show that we spend no less that 100 ms with local BH > disabled. So, no wonder that this causes packet drops in the NIC > (attached to this CPU). > > My iptables rule set in question is also very large, it contains: > Chains: 20929 > Rules: 81239 > > The vmalloc size is approx 19 MB (19.820.544 bytes) (see > /proc/vmallocinfo). Looking through vmallocinfo I realized that > even-though I only have 16 CPUs, there is 32 allocated rulesets > "xt_alloc_table_info" (for the filter table). Thus, I have approx > 634MB iptables filter rules in the kernel, half of which is totally > unused. Boot your machine with : "maxcpus=16 possible_cpus=16", it will be much better ;) > > Guess this is because we use: "for_each_possible_cpu" instead of > "for_each_online_cpu". (Feel free to fix this, or point me to some > documentation of this CPU hotplug stuff... I see we are missing > get_cpu() and put_cpu() a lot of places). Are you really using cpu hotplug ? If not, the "maxcpus=16 possible_cpus=16" trick should be enough for you. > > > The GOOD NEWS, is that moving the local BH disable section into the > "for_each_possible_cpu" fixed the problem with packet drops during > iptables calls. > > I wanted to profile with ftrace on the new code, but I cannot get the > measurement I want. Perhaps Steven or Acme can help? > > Now I want to measure the time used between the local_bh_disable() and > local_bh_enable, within the loop. I cannot figure out howto do that? > The new trace looks almost the same as before, just a lot of > local_bh_* inside the get_counters() function call. > > Guess is that the time spend is: 100 ms / 32 = 3.125 ms. > yes, approximatly. In order to accelerate, you could eventually pre-fill cpu cache before the local_bh_disable() (just reading the table). So that critical section is short, because mostly in your cpu cache. > Now I just need to calculate, how large a NIC buffer I need to buffer > 3.125 ms at 1Gbit/s. > > 3.125 ms * 1Gbit/s = 390625 bytes > > Can this be correct? > > How much buffer does each queue have in the 82576 NIC? > (Hope Alexander Duyck can answer this one?) > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:12 ` Eric Dumazet @ 2010-12-16 14:24 ` Jesper Dangaard Brouer 2010-12-16 14:29 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-16 14:24 UTC (permalink / raw) To: Eric Dumazet Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 2010-12-16 at 15:12 +0100, Eric Dumazet wrote: > Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a > > The vmalloc size is approx 19 MB (19.820.544 bytes) (see > > /proc/vmallocinfo). Looking through vmallocinfo I realized that > > even-though I only have 16 CPUs, there is 32 allocated rulesets > > "xt_alloc_table_info" (for the filter table). Thus, I have approx > > 634MB iptables filter rules in the kernel, half of which is totally > > unused. > > Boot your machine with : "maxcpus=16 possible_cpus=16", it will be much > better ;) Good, trick. I'll use that. > > Guess this is because we use: "for_each_possible_cpu" instead of > > "for_each_online_cpu". (Feel free to fix this, or point me to some > > documentation of this CPU hotplug stuff... I see we are missing > > get_cpu() and put_cpu() a lot of places). > > Are you really using cpu hotplug ? If not, the "maxcpus=16 > possible_cpus=16" trick should be enough for you. No, not using hotplug CPUs. Its just a pitty that we waste kernel memory on this, for every one which does not know the "maxcpus=16 possible_cpus=16" trick... But as I don't have a hotplug CPU system, I have no chance of testing an eventual code fix/patch. > > > > In order to accelerate, you could eventually pre-fill cpu cache before > the local_bh_disable() (just reading the table). So that critical > section is short, because mostly in your cpu cache. In my case I think this will not help. I'll kill the cache anyways, as the ruleset is 19MB and my CPU cache is 8MB. -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:24 ` Jesper Dangaard Brouer @ 2010-12-16 14:29 ` Eric Dumazet 2010-12-16 15:02 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 14:29 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a écrit : > In my case I think this will not help. I'll kill the cache anyways, as > the ruleset is 19MB and my CPU cache is 8MB. > > Yep ;) By the way, you speak of a 'possible regression', but we always masked BH while doing get_counters(). Only very recent kernels are masking them for each unit (cpu) of work. There was attempt to use a lockless read for each counter (using a seqlock), but it was not completed. I guess we could do something to ressurect this idea. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:29 ` Eric Dumazet @ 2010-12-16 15:02 ` Eric Dumazet 2010-12-16 16:07 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 15:02 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 15:29 +0100, Eric Dumazet a écrit : > Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a > écrit : > > > In my case I think this will not help. I'll kill the cache anyways, as > > the ruleset is 19MB and my CPU cache is 8MB. > > > > > > Yep ;) > > By the way, you speak of a 'possible regression', but we always masked > BH while doing get_counters(). > > Only very recent kernels are masking them for each unit (cpu) of work. > > There was attempt to use a lockless read for each counter (using a > seqlock), but it was not completed. I guess we could do something to > ressurect this idea. > > Something like following patch : net/ipv4/netfilter/ip_tables.c | 51 +++++++++++++------------------ 1 files changed, 22 insertions(+), 29 deletions(-) diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..ed54f80 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -293,6 +293,8 @@ struct ipt_entry *ipt_next_entry(const struct ipt_entry *entry) return (void *)entry + entry->next_offset; } +static DEFINE_PER_CPU(seqcount_t, counters_seq); + /* Returns one of the generic firewall policies, like NF_ACCEPT. */ unsigned int ipt_do_table(struct sk_buff *skb, @@ -311,6 +313,7 @@ ipt_do_table(struct sk_buff *skb, unsigned int *stackptr, origptr, cpu; const struct xt_table_info *private; struct xt_action_param acpar; + seqcount_t *seq; /* Initialization */ ip = ip_hdr(skb); @@ -364,7 +367,11 @@ ipt_do_table(struct sk_buff *skb, goto no_match; } + seq = &__get_cpu_var(counters_seq); + /* could be faster if we had this_cpu_write_seqcount_begin() */ + write_seqcount_begin(seq); ADD_COUNTER(e->counters, skb->len, 1); + write_seqcount_end(seq); t = ipt_get_target(e); IP_NF_ASSERT(t->u.kernel.target); @@ -877,6 +884,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0, return ret; } + static void get_counters(const struct xt_table_info *t, struct xt_counters counters[]) @@ -884,42 +892,27 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ + + memset(counters, 0, sizeof(struct xt_counters) * t->size); for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqcount_t *seq = &per_cpu(counters_seq, cpu); + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqcount_begin(seq); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqcount_retry(seq, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters 2010-12-16 15:02 ` Eric Dumazet @ 2010-12-16 16:07 ` Eric Dumazet 2010-12-16 16:53 ` [PATCH v2 " Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 16:07 UTC (permalink / raw) To: Jesper Dangaard Brouer, Patrick McHardy Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 16:02 +0100, Eric Dumazet a écrit : > Le jeudi 16 décembre 2010 à 15:29 +0100, Eric Dumazet a écrit : > > Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a > > écrit : > > > > > In my case I think this will not help. I'll kill the cache anyways, as > > > the ruleset is 19MB and my CPU cache is 8MB. > > > > > > > > > > Yep ;) > > > > By the way, you speak of a 'possible regression', but we always masked > > BH while doing get_counters(). > > > > Only very recent kernels are masking them for each unit (cpu) of work. > > > > There was attempt to use a lockless read for each counter (using a > > seqlock), but it was not completed. I guess we could do something to > > ressurect this idea. > > > > > > Something like following patch : Here is a tested version : no need for a (buggy in previous patch) memset() if we use vzalloc() Note : We miss a this_cpu_write_seqcount_begin() interface. I'll bug lkml to get it asap. Thanks [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Using "iptables -L" with a lot of rules might have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqcount scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). This slow down a bit each counter updates, using two extra increments. Note : We miss a this_cpu_write_seqcount_begin() interface, so we are forced to compute the address of our per_cpu seqcount to call write_seqcount_begin(). Once available, overhead will be exactly two "incl %gs:counters_seq" instructions on x86 Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- net/ipv4/netfilter/ip_tables.c | 52 ++++++++++++------------------- 1 file changed, 21 insertions(+), 31 deletions(-) diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..ae18ead 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -293,6 +293,8 @@ struct ipt_entry *ipt_next_entry(const struct ipt_entry *entry) return (void *)entry + entry->next_offset; } +static DEFINE_PER_CPU(seqcount_t, counters_seq); + /* Returns one of the generic firewall policies, like NF_ACCEPT. */ unsigned int ipt_do_table(struct sk_buff *skb, @@ -311,6 +313,7 @@ ipt_do_table(struct sk_buff *skb, unsigned int *stackptr, origptr, cpu; const struct xt_table_info *private; struct xt_action_param acpar; + seqcount_t *seq; /* Initialization */ ip = ip_hdr(skb); @@ -364,7 +367,11 @@ ipt_do_table(struct sk_buff *skb, goto no_match; } + seq = &__get_cpu_var(counters_seq); + /* could be faster if we had this_cpu_write_seqcount_begin() */ + write_seqcount_begin(seq); ADD_COUNTER(e->counters, skb->len, 1); + write_seqcount_end(seq); t = ipt_get_target(e); IP_NF_ASSERT(t->u.kernel.target); @@ -884,42 +891,25 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqcount_t *seq = &per_cpu(counters_seq, cpu); + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqcount_begin(seq); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqcount_retry(seq, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -932,7 +922,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1203,7 +1193,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ipt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters 2010-12-16 16:07 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet @ 2010-12-16 16:53 ` Eric Dumazet 2010-12-16 17:31 ` Stephen Hemminger 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 16:53 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 17:07 +0100, Eric Dumazet a écrit : > Here is a tested version : no need for a (buggy in previous patch) > memset() if we use vzalloc() > > Note : We miss a this_cpu_write_seqcount_begin() interface. > I'll bug lkml to get it asap. Well, we have a faster solution : Add seqcount in "struct xt_info_lock" so that we make the increment pair once per table, not once per rule, and we already have the seq address, so no need for this_cpu_write_seqcount_begin() interface. [PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Using "iptables -L" with a lot of rules might have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqcount scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- include/linux/netfilter/x_tables.h | 9 ++++- net/ipv4/netfilter/ip_tables.c | 45 ++++++++------------------- 2 files changed, 21 insertions(+), 33 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 742bec0..7027762 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -473,6 +473,7 @@ extern void xt_free_table_info(struct xt_table_info *info); */ struct xt_info_lock { spinlock_t lock; + seqcount_t seq; unsigned char readers; }; DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks); @@ -496,16 +497,20 @@ static inline void xt_info_rdlock_bh(void) local_bh_disable(); lock = &__get_cpu_var(xt_info_locks); - if (likely(!lock->readers++)) + if (likely(!lock->readers++)) { spin_lock(&lock->lock); + write_seqcount_begin(&lock->seq); + } } static inline void xt_info_rdunlock_bh(void) { struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks); - if (likely(!--lock->readers)) + if (likely(!--lock->readers)) { + write_seqcount_end(&lock->seq); spin_unlock(&lock->lock); + } local_bh_enable(); } diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..7fe3d7c 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqcount_t *seq = &per_cpu(xt_info_locks, cpu).seq; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqcount_begin(seq); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqcount_retry(seq, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ipt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters 2010-12-16 16:53 ` [PATCH v2 " Eric Dumazet @ 2010-12-16 17:31 ` Stephen Hemminger 2010-12-16 17:53 ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Stephen Hemminger @ 2010-12-16 17:31 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 16 Dec 2010 17:53:56 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote: > spinlock_t lock; > + seqcount_t seq; Since lock and seqcount_t are associated together, why isn't this a seqlock instead? -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 17:31 ` Stephen Hemminger @ 2010-12-16 17:53 ` Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 17:53 UTC (permalink / raw) To: Stephen Hemminger Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 09:31 -0800, Stephen Hemminger a écrit : > On Thu, 16 Dec 2010 17:53:56 +0100 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > spinlock_t lock; > > + seqcount_t seq; > > Since lock and seqcount_t are associated together, why isn't this a seqlock instead? > Absolutely, I found this a bit later while preparing final patch. Thanks ! [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters Using "iptables -L" with a lot of rules might have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqcount scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> --- include/linux/netfilter/x_tables.h | 10 +++--- net/ipv4/netfilter/arp_tables.c | 45 ++++++++------------------- net/ipv4/netfilter/ip_tables.c | 45 ++++++++------------------- net/ipv6/netfilter/ip6_tables.c | 45 ++++++++------------------- 4 files changed, 47 insertions(+), 98 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 742bec0..6712e71 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info); * necessary for reading the counters. */ struct xt_info_lock { - spinlock_t lock; + seqlock_t lock; unsigned char readers; }; DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks); @@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void) local_bh_disable(); lock = &__get_cpu_var(xt_info_locks); if (likely(!lock->readers++)) - spin_lock(&lock->lock); + write_seqlock(&lock->lock); } static inline void xt_info_rdunlock_bh(void) @@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void) struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks); if (likely(!--lock->readers)) - spin_unlock(&lock->lock); + write_sequnlock(&lock->lock); local_bh_enable(); } @@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void) */ static inline void xt_info_wrlock(unsigned int cpu) { - spin_lock(&per_cpu(xt_info_locks, cpu).lock); + write_seqlock(&per_cpu(xt_info_locks, cpu).lock); } static inline void xt_info_wrunlock(unsigned int cpu) { - spin_unlock(&per_cpu(xt_info_locks, cpu).lock); + write_sequnlock(&per_cpu(xt_info_locks, cpu).lock); } /* diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 3fac340..e855fff 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t, struct arpt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) * about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, struct arpt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..652efea 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ipt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 4555823..7d227c6 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t, struct ip6t_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ip6t_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 17:53 ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet @ 2010-12-16 17:57 ` Stephen Hemminger 2010-12-16 19:58 ` Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger 2010-12-18 4:29 ` [PATCH v4 " Eric Dumazet 2 siblings, 1 reply; 26+ messages in thread From: Stephen Hemminger @ 2010-12-16 17:57 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 16 Dec 2010 18:53:06 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote: > @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) > * about). > */ > countersize = sizeof(struct xt_counters) * private->number; > - counters = vmalloc(countersize); > + counters = vzalloc(countersize); > > if (counters == NULL) > return ERR_PTR(-ENOMEM); > @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, > struct arpt_entry *iter; > > ret = 0; > - counters = vmalloc(num_counters * sizeof(struct xt_counters)); > + counters = vzalloc(num_counters * sizeof(struct xt_counters)); > if (!counters) { > ret = -ENOMEM; > goto out; This seems like a different and unrelated change. -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 17:57 ` Stephen Hemminger @ 2010-12-16 19:58 ` Eric Dumazet 2010-12-16 20:12 ` Stephen Hemminger 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 19:58 UTC (permalink / raw) To: Stephen Hemminger Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 09:57 -0800, Stephen Hemminger a écrit : > On Thu, 16 Dec 2010 18:53:06 +0100 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) > > * about). > > */ > > countersize = sizeof(struct xt_counters) * private->number; > > - counters = vmalloc(countersize); > > + counters = vzalloc(countersize); > > > > if (counters == NULL) > > return ERR_PTR(-ENOMEM); > > @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, > > struct arpt_entry *iter; > > > > ret = 0; > > - counters = vmalloc(num_counters * sizeof(struct xt_counters)); > > + counters = vzalloc(num_counters * sizeof(struct xt_counters)); > > if (!counters) { > > ret = -ENOMEM; > > goto out; > > This seems like a different and unrelated change. > Since you later Acked the patch, you probably know that we now provide to get_counters() a zeroed area, since we do a sum for each possible cpu, I am answering anyway for other readers ;) Thanks for reviewing ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 19:58 ` Eric Dumazet @ 2010-12-16 20:12 ` Stephen Hemminger 2010-12-16 20:40 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Stephen Hemminger @ 2010-12-16 20:12 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 16 Dec 2010 20:58:39 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le jeudi 16 décembre 2010 à 09:57 -0800, Stephen Hemminger a écrit : > > On Thu, 16 Dec 2010 18:53:06 +0100 > > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) > > > * about). > > > */ > > > countersize = sizeof(struct xt_counters) * private->number; > > > - counters = vmalloc(countersize); > > > + counters = vzalloc(countersize); > > > > > > if (counters == NULL) > > > return ERR_PTR(-ENOMEM); > > > @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, > > > struct arpt_entry *iter; > > > > > > ret = 0; > > > - counters = vmalloc(num_counters * sizeof(struct xt_counters)); > > > + counters = vzalloc(num_counters * sizeof(struct xt_counters)); > > > if (!counters) { > > > ret = -ENOMEM; > > > goto out; > > > > This seems like a different and unrelated change. > > > > Since you later Acked the patch, you probably know that we now provide > to get_counters() a zeroed area, since we do a sum for each possible > cpu, I am answering anyway for other readers ;) > > Thanks for reviewing ! You changed from: get local cpu counters then sum other cpus to: sum all cpu's. This is fine, and will give the same answer. -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 20:12 ` Stephen Hemminger @ 2010-12-16 20:40 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 20:40 UTC (permalink / raw) To: Stephen Hemminger Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 12:12 -0800, Stephen Hemminger a écrit : > You changed from: > get local cpu counters > then sum other cpus > to: > sum all cpu's. > > This is fine, and will give the same answer. > Yes, I did that because the fetch of individual counters is now more complex. I tried to make code smaller and less complex. Eventually we are now allowed to be preempted and even migrated to another cpu during the process. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 17:53 ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger @ 2010-12-16 17:57 ` Stephen Hemminger 2010-12-18 4:29 ` [PATCH v4 " Eric Dumazet 2 siblings, 0 replies; 26+ messages in thread From: Stephen Hemminger @ 2010-12-16 17:57 UTC (permalink / raw) To: Eric Dumazet Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 16 Dec 2010 18:53:06 +0100 Eric Dumazet <eric.dumazet@gmail.com> wrote: > [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters > > Using "iptables -L" with a lot of rules might have a too big BH latency. > Jesper mentioned ~6 ms and worried of frame drops. > > Switch to a per_cpu seqcount scheme, so that taking a snapshot of > counters doesnt need to block BH (for this cpu, but also other cpus). > > Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> -- ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-16 17:53 ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger 2010-12-16 17:57 ` Stephen Hemminger @ 2010-12-18 4:29 ` Eric Dumazet 2010-12-20 13:42 ` Jesper Dangaard Brouer 2011-01-08 16:45 ` Eric Dumazet 2 siblings, 2 replies; 26+ messages in thread From: Eric Dumazet @ 2010-12-18 4:29 UTC (permalink / raw) To: Patrick McHardy Cc: Jesper Dangaard Brouer, netfilter-devel, netdev, Stephen Hemminger Using "iptables -L" with a lot of rules have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqlock scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). This adds two increments on seqlock sequence per ipt_do_table() call, its a reasonable cost for allowing "iptables -L" not block BH processing. Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> --- v4: I forgot net/netfilter/x_tables.c change in v3 include/linux/netfilter/x_tables.h | 10 +++--- net/ipv4/netfilter/arp_tables.c | 45 ++++++++------------------- net/ipv4/netfilter/ip_tables.c | 45 ++++++++------------------- net/ipv6/netfilter/ip6_tables.c | 45 ++++++++------------------- net/netfilter/x_tables.c | 3 + 5 files changed, 49 insertions(+), 99 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 742bec0..6712e71 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info); * necessary for reading the counters. */ struct xt_info_lock { - spinlock_t lock; + seqlock_t lock; unsigned char readers; }; DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks); @@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void) local_bh_disable(); lock = &__get_cpu_var(xt_info_locks); if (likely(!lock->readers++)) - spin_lock(&lock->lock); + write_seqlock(&lock->lock); } static inline void xt_info_rdunlock_bh(void) @@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void) struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks); if (likely(!--lock->readers)) - spin_unlock(&lock->lock); + write_sequnlock(&lock->lock); local_bh_enable(); } @@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void) */ static inline void xt_info_wrlock(unsigned int cpu) { - spin_lock(&per_cpu(xt_info_locks, cpu).lock); + write_seqlock(&per_cpu(xt_info_locks, cpu).lock); } static inline void xt_info_wrunlock(unsigned int cpu) { - spin_unlock(&per_cpu(xt_info_locks, cpu).lock); + write_sequnlock(&per_cpu(xt_info_locks, cpu).lock); } /* diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 3fac340..e855fff 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t, struct arpt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) * about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, struct arpt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..652efea 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ipt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 4555823..7d227c6 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t, struct ip6t_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ip6t_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 8046350..c942376 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1325,7 +1325,8 @@ static int __init xt_init(void) for_each_possible_cpu(i) { struct xt_info_lock *lock = &per_cpu(xt_info_locks, i); - spin_lock_init(&lock->lock); + + seqlock_init(&lock->lock); lock->readers = 0; } ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-18 4:29 ` [PATCH v4 " Eric Dumazet @ 2010-12-20 13:42 ` Jesper Dangaard Brouer 2010-12-20 14:45 ` Eric Dumazet 2011-01-08 16:45 ` Eric Dumazet 1 sibling, 1 reply; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-20 13:42 UTC (permalink / raw) To: Eric Dumazet; +Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger I have tested the patch on 2.6.35. Which implies I also needed to cherry-pick 870f67dcf7ef, which implements the vzalloc() call. (Latest net-next has a problem with my HP CCISS driver/controller or the PCI layout, and will not boot) According to the function_graph trace, the execution time of get_counters() has increased (approx) from 109 ms to 120 ms, which is the expected result. The results are not all positive, but I think its related to the debugging options I have enabled. Because I now see packet drops if my 1Gbit/s pktgen script are sending packet with a packet size below 512 bytes, which is "only" approx 230 kpps (this is 1Gbit/s on my 10G labsetup where I have seen 5 Mpps). There is no packet overruns/drops, iif I run "iptables -vnL > /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512 bytes packets. If I enable tracing while calling iptables I see packet drops/overruns. So I guess this is caused by the tracing overhead. I'll try to rerun my test without all the lock debugging options enabled. -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer On Sat, 2010-12-18 at 05:29 +0100, Eric Dumazet wrote: > Using "iptables -L" with a lot of rules have a too big BH latency. > Jesper mentioned ~6 ms and worried of frame drops. > > Switch to a per_cpu seqlock scheme, so that taking a snapshot of > counters doesnt need to block BH (for this cpu, but also other cpus). > > This adds two increments on seqlock sequence per ipt_do_table() call, > its a reasonable cost for allowing "iptables -L" not block BH > processing. > > Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > Acked-by: Stephen Hemminger <shemminger@vyatta.com> > --- ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-20 13:42 ` Jesper Dangaard Brouer @ 2010-12-20 14:45 ` Eric Dumazet 2010-12-21 16:48 ` Jesper Dangaard Brouer 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2010-12-20 14:45 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger Le lundi 20 décembre 2010 à 14:42 +0100, Jesper Dangaard Brouer a écrit : > I have tested the patch on 2.6.35. Which implies I also needed to > cherry-pick 870f67dcf7ef, which implements the vzalloc() call. (Latest > net-next has a problem with my HP CCISS driver/controller or the PCI > layout, and will not boot) > Ah wait, you need to switch from cciss to hpsa driver, I hit same problem some weeks ago ;) (and eventually rename your partitions to /dev/sdaX instead of /dev/cciss/c0d0pX) > According to the function_graph trace, the execution time of > get_counters() has increased (approx) from 109 ms to 120 ms, which is > the expected result. > > The results are not all positive, but I think its related to the > debugging options I have enabled. > > Because I now see packet drops if my 1Gbit/s pktgen script are sending > packet with a packet size below 512 bytes, which is "only" approx 230 > kpps (this is 1Gbit/s on my 10G labsetup where I have seen 5 Mpps). > > There is no packet overruns/drops, iif I run "iptables -vnL > > /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512 > bytes packets. If I enable tracing while calling iptables I see > packet drops/overruns. So I guess this is caused by the tracing > overhead. yes, probably :) > > I'll try to rerun my test without all the lock debugging options > enabled. > Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-20 14:45 ` Eric Dumazet @ 2010-12-21 16:48 ` Jesper Dangaard Brouer 0 siblings, 0 replies; 26+ messages in thread From: Jesper Dangaard Brouer @ 2010-12-21 16:48 UTC (permalink / raw) To: Eric Dumazet; +Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger On Mon, 2010-12-20 at 15:45 +0100, Eric Dumazet wrote: ... > > There is no packet overruns/drops, iif I run "iptables -vnL > > > /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512 > > bytes packets. If I enable tracing while calling iptables I see > > packet drops/overruns. So I guess this is caused by the tracing > > overhead. > > yes, probably :) > > > > > I'll try to rerun my test without all the lock debugging options > > enabled. Results are much better without the kernel debugging options enabled. I took the .config from production and enabled tracer "function_graph". And applied your patches (plus vzalloc) on top of 2.6.36-stable tree. I can now hit the system with a pktgen at 128 bytes, and see no drops/overruns while running iptables. (This packet load at 128bytes is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains: 20929 rules: 81239). If I reduce the ftrace filter to only track get_counters, I can even run a trace without any drops. echo get_counters > /sys/kernel/debug/tracing/set_ftrace_filter Some trace funny stats on get_counters(), under the packet storm. When running iptables on a CPU not processing packets (via taskset), the execution time is increased to 124ms. If I force iptables to run on a CPU processing packets, the execution time is increased to 1308ms, which is large but the expected behavior. Acked-by: Jesper Dangaard Brouer <hawk@comx.dk> -- Med venlig hilsen / Best regards Jesper Brouer ComX Networks A/S Linux Network Kernel Developer Cand. Scient Datalog / MSc.CS Author of http://adsl-optimizer.dk LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2010-12-18 4:29 ` [PATCH v4 " Eric Dumazet 2010-12-20 13:42 ` Jesper Dangaard Brouer @ 2011-01-08 16:45 ` Eric Dumazet 2011-01-09 21:31 ` Pablo Neira Ayuso 1 sibling, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2011-01-08 16:45 UTC (permalink / raw) To: Patrick McHardy, David Miller Cc: Jesper Dangaard Brouer, netfilter-devel, netdev, Stephen Hemminger David, I am resending this patch, sent 3 weeks ago, Patrick gave no answer. I believe it should be included in linux-2.6.38 and stable kernels. Some people found they had to change NIC RX ring sizes in order not missing frames (from 1024 to 2048), while root cause of the problem was this. Quoting Jesper : "I can now hit the system with a pktgen at 128 bytes, and see no drops/overruns while running iptables. (This packet load at 128bytes is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains: 20929 rules: 81239)." Thanks [PATCH v4] netfilter: x_tables: dont block BH while reading counters Using "iptables -L" with a lot of rules have a too big BH latency. Jesper mentioned ~6 ms and worried of frame drops. Switch to a per_cpu seqlock scheme, so that taking a snapshot of counters doesnt need to block BH (for this cpu, but also other cpus). This adds two increments on seqlock sequence per ipt_do_table() call, its a reasonable cost for allowing "iptables -L" not block BH processing. Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Patrick McHardy <kaber@trash.net> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Jesper Dangaard Brouer <hawk@comx.dk> --- include/linux/netfilter/x_tables.h | 10 +++--- net/ipv4/netfilter/arp_tables.c | 45 ++++++++------------------- net/ipv4/netfilter/ip_tables.c | 45 ++++++++------------------- net/ipv6/netfilter/ip6_tables.c | 45 ++++++++------------------- net/netfilter/x_tables.c | 3 + 5 files changed, 49 insertions(+), 99 deletions(-) diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h index 742bec0..6712e71 100644 --- a/include/linux/netfilter/x_tables.h +++ b/include/linux/netfilter/x_tables.h @@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info); * necessary for reading the counters. */ struct xt_info_lock { - spinlock_t lock; + seqlock_t lock; unsigned char readers; }; DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks); @@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void) local_bh_disable(); lock = &__get_cpu_var(xt_info_locks); if (likely(!lock->readers++)) - spin_lock(&lock->lock); + write_seqlock(&lock->lock); } static inline void xt_info_rdunlock_bh(void) @@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void) struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks); if (likely(!--lock->readers)) - spin_unlock(&lock->lock); + write_sequnlock(&lock->lock); local_bh_enable(); } @@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void) */ static inline void xt_info_wrlock(unsigned int cpu) { - spin_lock(&per_cpu(xt_info_locks, cpu).lock); + write_seqlock(&per_cpu(xt_info_locks, cpu).lock); } static inline void xt_info_wrunlock(unsigned int cpu) { - spin_unlock(&per_cpu(xt_info_locks, cpu).lock); + write_sequnlock(&per_cpu(xt_info_locks, cpu).lock); } /* diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c index 3fac340..e855fff 100644 --- a/net/ipv4/netfilter/arp_tables.c +++ b/net/ipv4/netfilter/arp_tables.c @@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t, struct arpt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) * about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name, struct arpt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c index a846d63..652efea 100644 --- a/net/ipv4/netfilter/ip_tables.c +++ b/net/ipv4/netfilter/ip_tables.c @@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t, struct ipt_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU. - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; /* macro does multi eval of i */ } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ipt_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c index 4555823..7d227c6 100644 --- a/net/ipv6/netfilter/ip6_tables.c +++ b/net/ipv6/netfilter/ip6_tables.c @@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t, struct ip6t_entry *iter; unsigned int cpu; unsigned int i; - unsigned int curcpu = get_cpu(); - - /* Instead of clearing (by a previous call to memset()) - * the counters and using adds, we set the counters - * with data used by 'current' CPU - * - * Bottom half has to be disabled to prevent deadlock - * if new softirq were to run and call ipt_do_table - */ - local_bh_disable(); - i = 0; - xt_entry_foreach(iter, t->entries[curcpu], t->size) { - SET_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); - ++i; - } - local_bh_enable(); - /* Processing counters from other cpus, we can let bottom half enabled, - * (preemption is disabled) - */ for_each_possible_cpu(cpu) { - if (cpu == curcpu) - continue; + seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock; + i = 0; - local_bh_disable(); - xt_info_wrlock(cpu); xt_entry_foreach(iter, t->entries[cpu], t->size) { - ADD_COUNTER(counters[i], iter->counters.bcnt, - iter->counters.pcnt); + u64 bcnt, pcnt; + unsigned int start; + + do { + start = read_seqbegin(lock); + bcnt = iter->counters.bcnt; + pcnt = iter->counters.pcnt; + } while (read_seqretry(lock, start)); + + ADD_COUNTER(counters[i], bcnt, pcnt); ++i; } - xt_info_wrunlock(cpu); - local_bh_enable(); } - put_cpu(); } static struct xt_counters *alloc_counters(const struct xt_table *table) @@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table) (other than comefrom, which userspace doesn't care about). */ countersize = sizeof(struct xt_counters) * private->number; - counters = vmalloc(countersize); + counters = vzalloc(countersize); if (counters == NULL) return ERR_PTR(-ENOMEM); @@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks, struct ip6t_entry *iter; ret = 0; - counters = vmalloc(num_counters * sizeof(struct xt_counters)); + counters = vzalloc(num_counters * sizeof(struct xt_counters)); if (!counters) { ret = -ENOMEM; goto out; diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c index 8046350..c942376 100644 --- a/net/netfilter/x_tables.c +++ b/net/netfilter/x_tables.c @@ -1325,7 +1325,8 @@ static int __init xt_init(void) for_each_possible_cpu(i) { struct xt_info_lock *lock = &per_cpu(xt_info_locks, i); - spin_lock_init(&lock->lock); + + seqlock_init(&lock->lock); lock->readers = 0; } ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters 2011-01-08 16:45 ` Eric Dumazet @ 2011-01-09 21:31 ` Pablo Neira Ayuso 0 siblings, 0 replies; 26+ messages in thread From: Pablo Neira Ayuso @ 2011-01-09 21:31 UTC (permalink / raw) To: Eric Dumazet Cc: Patrick McHardy, David Miller, Jesper Dangaard Brouer, netfilter-devel, netdev, Stephen Hemminger On 08/01/11 17:45, Eric Dumazet wrote: > David, > > I am resending this patch, sent 3 weeks ago, Patrick gave no answer. > > I believe it should be included in linux-2.6.38 and stable kernels. > > Some people found they had to change NIC RX ring sizes in order not > missing frames (from 1024 to 2048), while root cause of the problem was > this. > > Quoting Jesper : "I can now hit the system with a pktgen at 128 bytes, > and see no drops/overruns while running iptables. (This packet load at > 128bytes is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains: > 20929 rules: 81239)." I have enqueued this patch to me tree. http://1984.lsi.us.es/git/?p=net-2.6/.git;a=summary I'll pass it to davem for -stable submission. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:04 ` Jesper Dangaard Brouer 2010-12-16 14:12 ` Eric Dumazet @ 2010-12-16 14:13 ` Eric Dumazet 2010-12-16 14:20 ` Steven Rostedt 2 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2010-12-16 14:13 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a écrit : > Now I just need to calculate, how large a NIC buffer I need to buffer > 3.125 ms at 1Gbit/s. > > 3.125 ms * 1Gbit/s = 390625 bytes > > Can this be correct? > > How much buffer does each queue have in the 82576 NIC? > (Hope Alexander Duyck can answer this one?) > Worst case is if you receive very small frames, because 3.125 ms is about 5000 frames at 1Gbit/s -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Possible regression: Packet drops during iptables calls 2010-12-16 14:04 ` Jesper Dangaard Brouer 2010-12-16 14:12 ` Eric Dumazet 2010-12-16 14:13 ` Possible regression: Packet drops during iptables calls Eric Dumazet @ 2010-12-16 14:20 ` Steven Rostedt 2 siblings, 0 replies; 26+ messages in thread From: Steven Rostedt @ 2010-12-16 14:20 UTC (permalink / raw) To: Jesper Dangaard Brouer Cc: Arnaldo Carvalho de Melo, Eric Dumazet, Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr On Thu, 2010-12-16 at 15:04 +0100, Jesper Dangaard Brouer wrote: > > To do some further investigation into the unfortunate behavior of the > iptables get_counters() function I started to use "ftrace". This is a > really useful tool (thanks Steven Rostedt). > > # Select the tracer "function_graph" > echo function_graph > /sys/kernel/debug/tracing/current_tracer > > # Limit the number of function we look at: > echo local_bh_\* > /sys/kernel/debug/tracing/set_ftrace_filter > echo get_counters >> /sys/kernel/debug/tracing/set_ftrace_filter > > # Enable tracing while calling iptables > cd /sys/kernel/debug/tracing > echo 0 > trace > echo 1 > tracing_enabled; > taskset 1 iptables -vnL > /dev/null ; > echo 0 > tracing_enabled > cat trace | less Just an fyi, you can do the above much easier with trace-cmd: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git # trace-cmd record -p function_graph -l 'local_bh_*' -l get_counters taskset 1 iptables -vnL > /dev/null # trace-cmd report -- Steve > > > The reduced output: > > # tracer: function_graph > # > # CPU DURATION FUNCTION CALLS > # | | | | | | | > 2) 2.772 us | local_bh_disable(); > .... > 0) 0.228 us | local_bh_enable(); > 0) | get_counters() { > 0) 0.232 us | local_bh_disable(); > 0) 7.919 us | local_bh_enable(); > 0) ! 109467.1 us | } > 0) 2.344 us | local_bh_disable(); > > > The output show that we spend no less that 100 ms with local BH > disabled. So, no wonder that this causes packet drops in the NIC > (attached to this CPU). > > My iptables rule set in question is also very large, it contains: > Chains: 20929 > Rules: 81239 > > The vmalloc size is approx 19 MB (19.820.544 bytes) (see > /proc/vmallocinfo). Looking through vmallocinfo I realized that > even-though I only have 16 CPUs, there is 32 allocated rulesets > "xt_alloc_table_info" (for the filter table). Thus, I have approx > 634MB iptables filter rules in the kernel, half of which is totally > unused. > > Guess this is because we use: "for_each_possible_cpu" instead of > "for_each_online_cpu". (Feel free to fix this, or point me to some > documentation of this CPU hotplug stuff... I see we are missing > get_cpu() and put_cpu() a lot of places). > > > The GOOD NEWS, is that moving the local BH disable section into the > "for_each_possible_cpu" fixed the problem with packet drops during > iptables calls. > > I wanted to profile with ftrace on the new code, but I cannot get the > measurement I want. Perhaps Steven or Acme can help? > > Now I want to measure the time used between the local_bh_disable() and > local_bh_enable, within the loop. I cannot figure out howto do that? > The new trace looks almost the same as before, just a lot of > local_bh_* inside the get_counters() function call. > > Guess is that the time spend is: 100 ms / 32 = 3.125 ms. > > Now I just need to calculate, how large a NIC buffer I need to buffer > 3.125 ms at 1Gbit/s. > > 3.125 ms * 1Gbit/s = 390625 bytes > > Can this be correct? > > How much buffer does each queue have in the 82576 NIC? > (Hope Alexander Duyck can answer this one?) > ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2011-01-09 21:31 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-12-14 14:46 Possible regression: Packet drops during iptables calls Jesper Dangaard Brouer 2010-12-14 15:31 ` Eric Dumazet 2010-12-14 16:09 ` Jesper Dangaard Brouer 2010-12-14 16:24 ` Eric Dumazet 2010-12-16 14:04 ` Jesper Dangaard Brouer 2010-12-16 14:12 ` Eric Dumazet 2010-12-16 14:24 ` Jesper Dangaard Brouer 2010-12-16 14:29 ` Eric Dumazet 2010-12-16 15:02 ` Eric Dumazet 2010-12-16 16:07 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet 2010-12-16 16:53 ` [PATCH v2 " Eric Dumazet 2010-12-16 17:31 ` Stephen Hemminger 2010-12-16 17:53 ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger 2010-12-16 19:58 ` Eric Dumazet 2010-12-16 20:12 ` Stephen Hemminger 2010-12-16 20:40 ` Eric Dumazet 2010-12-16 17:57 ` Stephen Hemminger 2010-12-18 4:29 ` [PATCH v4 " Eric Dumazet 2010-12-20 13:42 ` Jesper Dangaard Brouer 2010-12-20 14:45 ` Eric Dumazet 2010-12-21 16:48 ` Jesper Dangaard Brouer 2011-01-08 16:45 ` Eric Dumazet 2011-01-09 21:31 ` Pablo Neira Ayuso 2010-12-16 14:13 ` Possible regression: Packet drops during iptables calls Eric Dumazet 2010-12-16 14:20 ` Steven Rostedt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).