Possible regression: Packet drops during iptables calls

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Possible regression: Packet drops during iptables calls
@ 2010-12-14 14:46 Jesper Dangaard Brouer
  2010-12-14 15:31 ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-14 14:46 UTC (permalink / raw)
  To: Stephen Hemminger, netfilter-devel; +Cc: netdev

I'm experiencing RX packet drops during call to iptables, on my
production servers.

Further investigations showed, that its only the CPU executing the
iptables command that experience packet drops!?  Thus, a quick fix was
to force the iptables command to run on one of the idle CPUs (This can
be achieved with the "taskset" command).

I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
only use 8 CPUs due to a multiqueue limitation of 8 queues in the
1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
processing via smp_affinity.

Can someone explain why the packet drops only occur on the CPU
executing the iptables command?

What can we do to solve this issue?

I should note that I have a very large ruleset on this machine, and
the production machine is routing around 800 Mbit/s, in each
direction.  The issue occurs on a simple iptables rule listing.

I think (untested) the problem is related to kernel git commit:

 commit 942e4a2bd680c606af0211e64eb216be2e19bf61
 Author: Stephen Hemminger <shemminger@vyatta.com>
 Date: Tue Apr 28 22:36:33 2009 -0700

 netfilter: revised locking for x_tables

 The x_tables are organized with a table structure and a per-cpu copies
 of the counters and rules. On older kernels there was a reader/writer
 lock per table which was a performance bottleneck. In 2.6.30-rc, this
 was converted to use RCU and the counters/rules which solved the performance
 problems for do_table but made replacing rules much slower because of
 the necessary RCU grace period.

 This version uses a per-cpu set of spinlocks and counters to allow to
 table processing to proceed without the cache thrashing of a global
 reader lock and keeps the same performance for table updates.

 Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
 Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
 Signed-off-by: David S. Miller <davem@davemloft.net>

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-14 14:46 Possible regression: Packet drops during iptables calls Jesper Dangaard Brouer
@ 2010-12-14 15:31 ` Eric Dumazet
  2010-12-14 16:09   ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-14 15:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Stephen Hemminger, netfilter-devel, netdev

Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a
écrit :
> I'm experiencing RX packet drops during call to iptables, on my
> production servers.
> 
> Further investigations showed, that its only the CPU executing the
> iptables command that experience packet drops!?  Thus, a quick fix was
> to force the iptables command to run on one of the idle CPUs (This can
> be achieved with the "taskset" command).
> 
> I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> processing via smp_affinity.
> 
> Can someone explain why the packet drops only occur on the CPU
> executing the iptables command?
> 
> 

It blocks BH

take a look at commits :

24b36f0193467fa727b85b4c004016a8dae999b9
netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
necessary 

001389b9581c13fe5fc357a0f89234f85af4215d
netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive

for attempts to let BH fly ...

Unfortunately, lockdep rules :(


> What can we do to solve this issue?
> 
> 
> I should note that I have a very large ruleset on this machine, and
> the production machine is routing around 800 Mbit/s, in each
> direction.  The issue occurs on a simple iptables rule listing.
> 
> 
> I think (untested) the problem is related to kernel git commit:
> 
>  commit 942e4a2bd680c606af0211e64eb216be2e19bf61
>  Author: Stephen Hemminger <shemminger@vyatta.com>
>  Date: Tue Apr 28 22:36:33 2009 -0700
> 
>  netfilter: revised locking for x_tables
> 
>  The x_tables are organized with a table structure and a per-cpu copies
>  of the counters and rules. On older kernels there was a reader/writer
>  lock per table which was a performance bottleneck. In 2.6.30-rc, this
>  was converted to use RCU and the counters/rules which solved the performance
>  problems for do_table but made replacing rules much slower because of
>  the necessary RCU grace period.
> 
>  This version uses a per-cpu set of spinlocks and counters to allow to
>  table processing to proceed without the cache thrashing of a global
>  reader lock and keeps the same performance for table updates.
> 
>  Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>  Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
>  Signed-off-by: David S. Miller <davem@davemloft.net>
> 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-14 15:31 ` Eric Dumazet
@ 2010-12-14 16:09   ` Jesper Dangaard Brouer
  2010-12-14 16:24     ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-14 16:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Stephen Hemminger, netfilter-devel, netdev

On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote:
> Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a
> écrit :
> > I'm experiencing RX packet drops during call to iptables, on my
> > production servers.
> > 
> > Further investigations showed, that its only the CPU executing the
> > iptables command that experience packet drops!?  Thus, a quick fix was
> > to force the iptables command to run on one of the idle CPUs (This can
> > be achieved with the "taskset" command).
> > 
> > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> > only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> > 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> > processing via smp_affinity.
> > 
> > Can someone explain why the packet drops only occur on the CPU
> > executing the iptables command?
> > 
> 
> It blocks BH
> 
> take a look at commits :
> 
> 24b36f0193467fa727b85b4c004016a8dae999b9
> netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
> necessary 
> 
> 001389b9581c13fe5fc357a0f89234f85af4215d
> netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive
> 
> for attempts to let BH fly ...
> 
> Unfortunately, lockdep rules :(

Is the lockdep check a false positive?
Could I run with 24b36f0193 in production, to fix my problem?

I forgot to mention I run kernel 2.6.35.8-comx01+ (based on Greg's stable kernel tree).

$ git describe --contains 24b36f019346
v2.6.36-rc1~571^2~46^2~7
$ git describe --contains 001389b9581c1
v2.6.36-rc3~2^2~42


> > What can we do to solve this issue?

Any ideas how we can proceed?

Looking closer at the two combined code change, I see that the code path
has been improved (a bit), as the local BH is only disabled inside the
for_each_possible_cpu(cpu).  Before local_bh was disabled for the hole
function.  Guess I need to reproduce this in my testlab.

Thanks for your 'ninja' input ;-)
-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-14 16:09   ` Jesper Dangaard Brouer
@ 2010-12-14 16:24     ` Eric Dumazet
  2010-12-16 14:04       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-14 16:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Stephen Hemminger, netfilter-devel, netdev

Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a
écrit :
> On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote:
> > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a
> > écrit :
> > > I'm experiencing RX packet drops during call to iptables, on my
> > > production servers.
> > > 
> > > Further investigations showed, that its only the CPU executing the
> > > iptables command that experience packet drops!?  Thus, a quick fix was
> > > to force the iptables command to run on one of the idle CPUs (This can
> > > be achieved with the "taskset" command).
> > > 
> > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> > > 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> > > processing via smp_affinity.
> > > 
> > > Can someone explain why the packet drops only occur on the CPU
> > > executing the iptables command?
> > > 
> > 
> > It blocks BH
> > 
> > take a look at commits :
> > 
> > 24b36f0193467fa727b85b4c004016a8dae999b9
> > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
> > necessary 
> > 
> > 001389b9581c13fe5fc357a0f89234f85af4215d
> > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positive
> > 
> > for attempts to let BH fly ...
> > 
> > Unfortunately, lockdep rules :(
> 
> Is the lockdep check a false positive?

Yes its a false positive.

> Could I run with 24b36f0193 in production, to fix my problem?
> 

Yes, but you could also run a kernel with both commits:

We now block BH for each cpu we are "summing", instead of blocking BH
for the whole 16 possible cpus summation. (so BH should be blocked for
smaller amount of time)

> I forgot to mention I run kernel 2.6.35.8-comx01+ (based on Greg's stable kernel tree).
> 
> $ git describe --contains 24b36f019346
> v2.6.36-rc1~571^2~46^2~7
> $ git describe --contains 001389b9581c1
> v2.6.36-rc3~2^2~42
> 
> 
> > > What can we do to solve this issue?
> 
> Any ideas how we can proceed?
> 
> Looking closer at the two combined code change, I see that the code path
> has been improved (a bit), as the local BH is only disabled inside the
> for_each_possible_cpu(cpu).  Before local_bh was disabled for the hole
> function.  Guess I need to reproduce this in my testlab.
> 

Yes, so current kernel is a bit better.

Note that even with the 'false positive' problem, we had to blocks BH
for the current cpu sum, so the max BH latency is probably the same with
or without 001389b9581c13fe5




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-14 16:24     ` Eric Dumazet
@ 2010-12-16 14:04       ` Jesper Dangaard Brouer
  2010-12-16 14:12         ` Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-16 14:04 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo, Steven Rostedt, Eric Dumazet,
	Alexander Duyck
  Cc: Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote:
> Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a écrit :
> > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote:
> > > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a
> > > écrit :
> > > > I'm experiencing RX packet drops during call to iptables, on my
> > > > production servers.
> > > > 
> > > > Further investigations showed, that its only the CPU executing the
> > > > iptables command that experience packet drops!?  Thus, a quick fix was
> > > > to force the iptables command to run on one of the idle CPUs (This can
> > > > be achieved with the "taskset" command).
> > > > 
> > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> > > > 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> > > > processing via smp_affinity.
> > > > 
> > > > Can someone explain why the packet drops only occur on the CPU
> > > > executing the iptables command?
> > > > 
> > > 
> > > It blocks BH
> > > 
> > > take a look at commits :
> > > 
> > > 24b36f0193467fa727b85b4c004016a8dae999b9
> > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
> > > necessary 
> > > 
> > > 001389b9581c13fe5fc357a0f89234f85af4215d
> > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv
<... cut ...>
> > 
> > Looking closer at the two combined code change, I see that the code path
> > has been improved (a bit), as the local BH is only disabled inside the
> > for_each_possible_cpu(cpu).  Before local_bh was disabled for the hole
> > function.  Guess I need to reproduce this in my testlab.

To do some further investigation into the unfortunate behavior of the
iptables get_counters() function I started to use "ftrace".  This is a
really useful tool (thanks Steven Rostedt).

 # Select the tracer "function_graph"
 echo function_graph > /sys/kernel/debug/tracing/current_tracer

 # Limit the number of function we look at:
 echo local_bh_\*  >   /sys/kernel/debug/tracing/set_ftrace_filter
 echo get_counters >>  /sys/kernel/debug/tracing/set_ftrace_filter

 # Enable tracing while calling iptables
 cd /sys/kernel/debug/tracing
 echo 0 > trace
 echo 1 > tracing_enabled;
   taskset 1 iptables -vnL > /dev/null ;
 echo 0 > tracing_enabled
 cat trace | less

The reduced output:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
  2)   2.772 us    |  local_bh_disable();
....
  0)   0.228 us    |  local_bh_enable();
  0)               |  get_counters() {
  0)   0.232 us    |    local_bh_disable();
  0)   7.919 us    |    local_bh_enable();
  0) ! 109467.1 us |  }
  0)   2.344 us    |  local_bh_disable();

The output show that we spend no less that 100 ms with local BH
disabled.  So, no wonder that this causes packet drops in the NIC
(attached to this CPU).

My iptables rule set in question is also very large, it contains:
 Chains: 20929
 Rules: 81239

The vmalloc size is approx 19 MB (19.820.544 bytes) (see
/proc/vmallocinfo).  Looking through vmallocinfo I realized that
even-though I only have 16 CPUs, there is 32 allocated rulesets
"xt_alloc_table_info" (for the filter table). Thus, I have approx
634MB iptables filter rules in the kernel, half of which is totally
unused.

Guess this is because we use: "for_each_possible_cpu" instead of
"for_each_online_cpu". (Feel free to fix this, or point me to some
documentation of this CPU hotplug stuff... I see we are missing
get_cpu() and put_cpu() a lot of places).

The GOOD NEWS, is that moving the local BH disable section into the
"for_each_possible_cpu" fixed the problem with packet drops during
iptables calls.

I wanted to profile with ftrace on the new code, but I cannot get the
measurement I want. Perhaps Steven or Acme can help?

Now I want to measure the time used between the local_bh_disable() and
local_bh_enable, within the loop.  I cannot figure out howto do that?
The new trace looks almost the same as before, just a lot of
local_bh_* inside the get_counters() function call.

 Guess is that the time spend is: 100 ms / 32 = 3.125 ms.

Now I just need to calculate, how large a NIC buffer I need to buffer
3.125 ms at 1Gbit/s.

 3.125 ms *  1Gbit/s = 390625 bytes

Can this be correct?

How much buffer does each queue have in the 82576 NIC?
(Hope Alexander Duyck can answer this one?)

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:04       ` Jesper Dangaard Brouer
@ 2010-12-16 14:12         ` Eric Dumazet
  2010-12-16 14:24           ` Jesper Dangaard Brouer
  2010-12-16 14:13         ` Possible regression: Packet drops during iptables calls Eric Dumazet
  2010-12-16 14:20         ` Steven Rostedt
  2 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 14:12 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a
écrit :
> On Tue, 2010-12-14 at 17:24 +0100, Eric Dumazet wrote:
> > Le mardi 14 décembre 2010 à 17:09 +0100, Jesper Dangaard Brouer a écrit :
> > > On Tue, 2010-12-14 at 16:31 +0100, Eric Dumazet wrote:
> > > > Le mardi 14 décembre 2010 à 15:46 +0100, Jesper Dangaard Brouer a
> > > > écrit :
> > > > > I'm experiencing RX packet drops during call to iptables, on my
> > > > > production servers.
> > > > > 
> > > > > Further investigations showed, that its only the CPU executing the
> > > > > iptables command that experience packet drops!?  Thus, a quick fix was
> > > > > to force the iptables command to run on one of the idle CPUs (This can
> > > > > be achieved with the "taskset" command).
> > > > > 
> > > > > I have a 2x Xeon 5550 CPU system, thus 16 CPUs (with HT enabled).  We
> > > > > only use 8 CPUs due to a multiqueue limitation of 8 queues in the
> > > > > 1Gbit/s NICs (82576 chips).  CPUs 0 to 7 is assigned for packet
> > > > > processing via smp_affinity.
> > > > > 
> > > > > Can someone explain why the packet drops only occur on the CPU
> > > > > executing the iptables command?
> > > > > 
> > > > 
> > > > It blocks BH
> > > > 
> > > > take a look at commits :
> > > > 
> > > > 24b36f0193467fa727b85b4c004016a8dae999b9
> > > > netfilter: {ip,ip6,arp}_tables: dont block bottom half more than
> > > > necessary 
> > > > 
> > > > 001389b9581c13fe5fc357a0f89234f85af4215d
> > > > netfilter: {ip,ip6,arp}_tables: avoid lockdep false positiv
> <... cut ...>
> > > 
> > > Looking closer at the two combined code change, I see that the code path
> > > has been improved (a bit), as the local BH is only disabled inside the
> > > for_each_possible_cpu(cpu).  Before local_bh was disabled for the hole
> > > function.  Guess I need to reproduce this in my testlab.
> 
> 
> To do some further investigation into the unfortunate behavior of the
> iptables get_counters() function I started to use "ftrace".  This is a
> really useful tool (thanks Steven Rostedt).
> 
>  # Select the tracer "function_graph"
>  echo function_graph > /sys/kernel/debug/tracing/current_tracer
> 
>  # Limit the number of function we look at:
>  echo local_bh_\*  >   /sys/kernel/debug/tracing/set_ftrace_filter
>  echo get_counters >>  /sys/kernel/debug/tracing/set_ftrace_filter
> 
>  # Enable tracing while calling iptables
>  cd /sys/kernel/debug/tracing
>  echo 0 > trace
>  echo 1 > tracing_enabled;
>    taskset 1 iptables -vnL > /dev/null ;
>  echo 0 > tracing_enabled
>  cat trace | less
> 
> 
> The reduced output:
> 
> # tracer: function_graph
> #
> # CPU  DURATION                  FUNCTION CALLS
> # |     |   |                     |   |   |   |
>   2)   2.772 us    |  local_bh_disable();
> ....
>   0)   0.228 us    |  local_bh_enable();
>   0)               |  get_counters() {
>   0)   0.232 us    |    local_bh_disable();
>   0)   7.919 us    |    local_bh_enable();
>   0) ! 109467.1 us |  }
>   0)   2.344 us    |  local_bh_disable();
> 
> 
> The output show that we spend no less that 100 ms with local BH
> disabled.  So, no wonder that this causes packet drops in the NIC
> (attached to this CPU).
> 
> My iptables rule set in question is also very large, it contains:
>  Chains: 20929
>  Rules: 81239
> 
> The vmalloc size is approx 19 MB (19.820.544 bytes) (see
> /proc/vmallocinfo).  Looking through vmallocinfo I realized that
> even-though I only have 16 CPUs, there is 32 allocated rulesets
> "xt_alloc_table_info" (for the filter table). Thus, I have approx
> 634MB iptables filter rules in the kernel, half of which is totally
> unused.

Boot your machine with : "maxcpus=16 possible_cpus=16", it will be much
better ;)

> 
> Guess this is because we use: "for_each_possible_cpu" instead of
> "for_each_online_cpu". (Feel free to fix this, or point me to some
> documentation of this CPU hotplug stuff... I see we are missing
> get_cpu() and put_cpu() a lot of places).

Are you really using cpu hotplug ? If not, the "maxcpus=16
possible_cpus=16" trick should be enough for you.

> 
> 
> The GOOD NEWS, is that moving the local BH disable section into the
> "for_each_possible_cpu" fixed the problem with packet drops during
> iptables calls.
> 
> I wanted to profile with ftrace on the new code, but I cannot get the
> measurement I want. Perhaps Steven or Acme can help?
> 
> Now I want to measure the time used between the local_bh_disable() and
> local_bh_enable, within the loop.  I cannot figure out howto do that?
> The new trace looks almost the same as before, just a lot of
> local_bh_* inside the get_counters() function call.
> 
>  Guess is that the time spend is: 100 ms / 32 = 3.125 ms.
> 

yes, approximatly.

In order to accelerate, you could eventually pre-fill cpu cache before
the local_bh_disable() (just reading the table). So that critical
section is short, because mostly in your cpu cache.

> Now I just need to calculate, how large a NIC buffer I need to buffer
> 3.125 ms at 1Gbit/s.
> 
>  3.125 ms *  1Gbit/s = 390625 bytes
> 
> Can this be correct?
> 
> How much buffer does each queue have in the 82576 NIC?
> (Hope Alexander Duyck can answer this one?)
> 


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:12         ` Eric Dumazet
@ 2010-12-16 14:24           ` Jesper Dangaard Brouer
  2010-12-16 14:29             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-16 14:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

On Thu, 2010-12-16 at 15:12 +0100, Eric Dumazet wrote:
> Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a

> > The vmalloc size is approx 19 MB (19.820.544 bytes) (see
> > /proc/vmallocinfo).  Looking through vmallocinfo I realized that
> > even-though I only have 16 CPUs, there is 32 allocated rulesets
> > "xt_alloc_table_info" (for the filter table). Thus, I have approx
> > 634MB iptables filter rules in the kernel, half of which is totally
> > unused.
> 
> Boot your machine with : "maxcpus=16 possible_cpus=16", it will be much
> better ;)

Good, trick.  I'll use that.

> > Guess this is because we use: "for_each_possible_cpu" instead of
> > "for_each_online_cpu". (Feel free to fix this, or point me to some
> > documentation of this CPU hotplug stuff... I see we are missing
> > get_cpu() and put_cpu() a lot of places).
> 
> Are you really using cpu hotplug ? If not, the "maxcpus=16
> possible_cpus=16" trick should be enough for you.

No, not using hotplug CPUs.  Its just a pitty that we waste kernel
memory on this, for every one which does not know the "maxcpus=16
possible_cpus=16" trick...

But as I don't have a hotplug CPU system, I have no chance of testing an
eventual code fix/patch.


> > 
> 
> In order to accelerate, you could eventually pre-fill cpu cache before
> the local_bh_disable() (just reading the table). So that critical
> section is short, because mostly in your cpu cache.

In my case I think this will not help. I'll kill the cache anyways, as
the ruleset is 19MB and my CPU cache is 8MB.


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:24           ` Jesper Dangaard Brouer
@ 2010-12-16 14:29             ` Eric Dumazet
  2010-12-16 15:02               ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 14:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a
écrit :

> In my case I think this will not help. I'll kill the cache anyways, as
> the ruleset is 19MB and my CPU cache is 8MB.
> 
> 

Yep ;)

By the way, you speak of a 'possible regression', but we always masked
BH while doing get_counters().

Only very recent kernels are masking them for each unit (cpu) of work.

There was attempt to use a lockless read for each counter (using a
seqlock), but it was not completed. I guess we could do something to
ressurect this idea.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:29             ` Eric Dumazet
@ 2010-12-16 15:02               ` Eric Dumazet
  2010-12-16 16:07                 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 15:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 15:29 +0100, Eric Dumazet a écrit :
> Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a
> écrit :
> 
> > In my case I think this will not help. I'll kill the cache anyways, as
> > the ruleset is 19MB and my CPU cache is 8MB.
> > 
> > 
> 
> Yep ;)
> 
> By the way, you speak of a 'possible regression', but we always masked
> BH while doing get_counters().
> 
> Only very recent kernels are masking them for each unit (cpu) of work.
> 
> There was attempt to use a lockless read for each counter (using a
> seqlock), but it was not completed. I guess we could do something to
> ressurect this idea.
> 
> 

Something like following patch :

 net/ipv4/netfilter/ip_tables.c |   51 +++++++++++++------------------
 1 files changed, 22 insertions(+), 29 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..ed54f80 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -293,6 +293,8 @@ struct ipt_entry *ipt_next_entry(const struct ipt_entry *entry)
 	return (void *)entry + entry->next_offset;
 }
 
+static DEFINE_PER_CPU(seqcount_t, counters_seq);
+
 /* Returns one of the generic firewall policies, like NF_ACCEPT. */
 unsigned int
 ipt_do_table(struct sk_buff *skb,
@@ -311,6 +313,7 @@ ipt_do_table(struct sk_buff *skb,
 	unsigned int *stackptr, origptr, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
+	seqcount_t *seq;
 
 	/* Initialization */
 	ip = ip_hdr(skb);
@@ -364,7 +367,11 @@ ipt_do_table(struct sk_buff *skb,
 				goto no_match;
 		}
 
+		seq = &__get_cpu_var(counters_seq);
+		/* could be faster if we had this_cpu_write_seqcount_begin() */
+		write_seqcount_begin(seq);
 		ADD_COUNTER(e->counters, skb->len, 1);
+		write_seqcount_end(seq);
 
 		t = ipt_get_target(e);
 		IP_NF_ASSERT(t->u.kernel.target);
@@ -877,6 +884,7 @@ translate_table(struct net *net, struct xt_table_info *newinfo, void *entry0,
 	return ret;
 }
 
+
 static void
 get_counters(const struct xt_table_info *t,
 	     struct xt_counters counters[])
@@ -884,42 +892,27 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
+
+	memset(counters, 0, sizeof(struct xt_counters) * t->size);
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqcount_t *seq = &per_cpu(counters_seq, cpu);
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqcount_begin(seq);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqcount_retry(seq, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters
  2010-12-16 15:02               ` Eric Dumazet
@ 2010-12-16 16:07                 ` Eric Dumazet
  2010-12-16 16:53                   ` [PATCH v2 " Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 16:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Patrick McHardy
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 16:02 +0100, Eric Dumazet a écrit :
> Le jeudi 16 décembre 2010 à 15:29 +0100, Eric Dumazet a écrit :
> > Le jeudi 16 décembre 2010 à 15:24 +0100, Jesper Dangaard Brouer a
> > écrit :
> > 
> > > In my case I think this will not help. I'll kill the cache anyways, as
> > > the ruleset is 19MB and my CPU cache is 8MB.
> > > 
> > > 
> > 
> > Yep ;)
> > 
> > By the way, you speak of a 'possible regression', but we always masked
> > BH while doing get_counters().
> > 
> > Only very recent kernels are masking them for each unit (cpu) of work.
> > 
> > There was attempt to use a lockless read for each counter (using a
> > seqlock), but it was not completed. I guess we could do something to
> > ressurect this idea.
> > 
> > 
> 
> Something like following patch :

Here is a tested version : no need for a (buggy in previous patch)
memset() if we use vzalloc()

Note : We miss a this_cpu_write_seqcount_begin() interface.
I'll bug lkml to get it asap.

Thanks

[PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters

Using "iptables -L" with a lot of rules might have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqcount scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).
This slow down a bit each counter updates, using two extra increments.

Note : We miss a this_cpu_write_seqcount_begin() interface, so we are
forced to compute the address of our per_cpu seqcount to call
write_seqcount_begin(). Once available, overhead will be exactly two
"incl %gs:counters_seq" instructions on x86

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/netfilter/ip_tables.c |   52 ++++++++++++-------------------
 1 file changed, 21 insertions(+), 31 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..ae18ead 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -293,6 +293,8 @@ struct ipt_entry *ipt_next_entry(const struct ipt_entry *entry)
 	return (void *)entry + entry->next_offset;
 }
 
+static DEFINE_PER_CPU(seqcount_t, counters_seq);
+
 /* Returns one of the generic firewall policies, like NF_ACCEPT. */
 unsigned int
 ipt_do_table(struct sk_buff *skb,
@@ -311,6 +313,7 @@ ipt_do_table(struct sk_buff *skb,
 	unsigned int *stackptr, origptr, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
+	seqcount_t *seq;
 
 	/* Initialization */
 	ip = ip_hdr(skb);
@@ -364,7 +367,11 @@ ipt_do_table(struct sk_buff *skb,
 				goto no_match;
 		}
 
+		seq = &__get_cpu_var(counters_seq);
+		/* could be faster if we had this_cpu_write_seqcount_begin() */
+		write_seqcount_begin(seq);
 		ADD_COUNTER(e->counters, skb->len, 1);
+		write_seqcount_end(seq);
 
 		t = ipt_get_target(e);
 		IP_NF_ASSERT(t->u.kernel.target);
@@ -884,42 +891,25 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqcount_t *seq = &per_cpu(counters_seq, cpu);
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqcount_begin(seq);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqcount_retry(seq, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -932,7 +922,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1203,7 +1193,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ipt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters
  2010-12-16 16:07                 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet
@ 2010-12-16 16:53                   ` Eric Dumazet
  2010-12-16 17:31                     ` Stephen Hemminger
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Patrick McHardy, Arnaldo Carvalho de Melo, Steven Rostedt,
	Alexander Duyck, Stephen Hemminger, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 17:07 +0100, Eric Dumazet a écrit :

> Here is a tested version : no need for a (buggy in previous patch)
> memset() if we use vzalloc()
> 
> Note : We miss a this_cpu_write_seqcount_begin() interface.
> I'll bug lkml to get it asap.

Well, we have a faster solution :

Add seqcount in "struct xt_info_lock"
so that we make the increment pair once per table, not once per rule,
and we already have the seq address, so no need for
this_cpu_write_seqcount_begin() interface.


[PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters

Using "iptables -L" with a lot of rules might have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqcount scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netfilter/x_tables.h |    9 ++++-
 net/ipv4/netfilter/ip_tables.c     |   45 ++++++++-------------------
 2 files changed, 21 insertions(+), 33 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 742bec0..7027762 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -473,6 +473,7 @@ extern void xt_free_table_info(struct xt_table_info *info);
  */
 struct xt_info_lock {
 	spinlock_t lock;
+	seqcount_t seq;
 	unsigned char readers;
 };
 DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
@@ -496,16 +497,20 @@ static inline void xt_info_rdlock_bh(void)
 
 	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
-	if (likely(!lock->readers++))
+	if (likely(!lock->readers++)) {
 		spin_lock(&lock->lock);
+		write_seqcount_begin(&lock->seq);
+	}
 }
 
 static inline void xt_info_rdunlock_bh(void)
 {
 	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
 
-	if (likely(!--lock->readers))
+	if (likely(!--lock->readers)) {
+		write_seqcount_end(&lock->seq);
 		spin_unlock(&lock->lock);
+	}
 	local_bh_enable();
 }
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..7fe3d7c 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqcount_t *seq = &per_cpu(xt_info_locks, cpu).seq;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqcount_begin(seq);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqcount_retry(seq, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ipt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 net-next-2.6] netfilter: ip_tables: dont block BH while reading counters
  2010-12-16 16:53                   ` [PATCH v2 " Eric Dumazet
@ 2010-12-16 17:31                     ` Stephen Hemminger
  2010-12-16 17:53                       ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2010-12-16 17:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

On Thu, 16 Dec 2010 17:53:56 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

>  	spinlock_t lock;
> +	seqcount_t seq;

Since lock and seqcount_t are associated together, why isn't this a seqlock instead?

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 17:31                     ` Stephen Hemminger
@ 2010-12-16 17:53                       ` Eric Dumazet
  2010-12-16 17:57                         ` Stephen Hemminger
                                           ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 17:53 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 09:31 -0800, Stephen Hemminger a écrit :
> On Thu, 16 Dec 2010 17:53:56 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> >  	spinlock_t lock;
> > +	seqcount_t seq;
> 
> Since lock and seqcount_t are associated together, why isn't this a seqlock instead?
> 

Absolutely, I found this a bit later while preparing final patch.

Thanks !

[PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters

Using "iptables -L" with a lot of rules might have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqcount scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netfilter/x_tables.h |   10 +++---
 net/ipv4/netfilter/arp_tables.c    |   45 ++++++++-------------------
 net/ipv4/netfilter/ip_tables.c     |   45 ++++++++-------------------
 net/ipv6/netfilter/ip6_tables.c    |   45 ++++++++-------------------
 4 files changed, 47 insertions(+), 98 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 742bec0..6712e71 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info);
  *  necessary for reading the counters.
  */
 struct xt_info_lock {
-	spinlock_t lock;
+	seqlock_t lock;
 	unsigned char readers;
 };
 DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
@@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void)
 	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
 	if (likely(!lock->readers++))
-		spin_lock(&lock->lock);
+		write_seqlock(&lock->lock);
 }
 
 static inline void xt_info_rdunlock_bh(void)
@@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void)
 	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
 
 	if (likely(!--lock->readers))
-		spin_unlock(&lock->lock);
+		write_sequnlock(&lock->lock);
 	local_bh_enable();
 }
 
@@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void)
  */
 static inline void xt_info_wrlock(unsigned int cpu)
 {
-	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
+	write_seqlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 static inline void xt_info_wrunlock(unsigned int cpu)
 {
-	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
+	write_sequnlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 /*
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 3fac340..e855fff 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t,
 	struct arpt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	 * about).
 	 */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
 	struct arpt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..652efea 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ipt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 4555823..7d227c6 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t,
 	struct ip6t_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ip6t_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 17:53                       ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet
@ 2010-12-16 17:57                         ` Stephen Hemminger
  2010-12-16 19:58                           ` Eric Dumazet
  2010-12-16 17:57                         ` Stephen Hemminger
  2010-12-18  4:29                         ` [PATCH v4 " Eric Dumazet
  2 siblings, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2010-12-16 17:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

On Thu, 16 Dec 2010 18:53:06 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
>  	 * about).
>  	 */
>  	countersize = sizeof(struct xt_counters) * private->number;
> -	counters = vmalloc(countersize);
> +	counters = vzalloc(countersize);
>  
>  	if (counters == NULL)
>  		return ERR_PTR(-ENOMEM);
> @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
>  	struct arpt_entry *iter;
>  
>  	ret = 0;
> -	counters = vmalloc(num_counters * sizeof(struct xt_counters));
> +	counters = vzalloc(num_counters * sizeof(struct xt_counters));
>  	if (!counters) {
>  		ret = -ENOMEM;
>  		goto out;

This seems like a different and unrelated change.

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 17:57                         ` Stephen Hemminger
@ 2010-12-16 19:58                           ` Eric Dumazet
  2010-12-16 20:12                             ` Stephen Hemminger
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 19:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 09:57 -0800, Stephen Hemminger a écrit :
> On Thu, 16 Dec 2010 18:53:06 +0100
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
> >  	 * about).
> >  	 */
> >  	countersize = sizeof(struct xt_counters) * private->number;
> > -	counters = vmalloc(countersize);
> > +	counters = vzalloc(countersize);
> >  
> >  	if (counters == NULL)
> >  		return ERR_PTR(-ENOMEM);
> > @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
> >  	struct arpt_entry *iter;
> >  
> >  	ret = 0;
> > -	counters = vmalloc(num_counters * sizeof(struct xt_counters));
> > +	counters = vzalloc(num_counters * sizeof(struct xt_counters));
> >  	if (!counters) {
> >  		ret = -ENOMEM;
> >  		goto out;
> 
> This seems like a different and unrelated change.
> 

Since you later Acked the patch, you probably know that we now provide
to get_counters() a zeroed area, since we do a sum for each possible
cpu, I am answering anyway for other readers ;)

Thanks for reviewing !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 19:58                           ` Eric Dumazet
@ 2010-12-16 20:12                             ` Stephen Hemminger
  2010-12-16 20:40                               ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Stephen Hemminger @ 2010-12-16 20:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

On Thu, 16 Dec 2010 20:58:39 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le jeudi 16 décembre 2010 à 09:57 -0800, Stephen Hemminger a écrit :
> > On Thu, 16 Dec 2010 18:53:06 +0100
> > Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > @@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
> > >  	 * about).
> > >  	 */
> > >  	countersize = sizeof(struct xt_counters) * private->number;
> > > -	counters = vmalloc(countersize);
> > > +	counters = vzalloc(countersize);
> > >  
> > >  	if (counters == NULL)
> > >  		return ERR_PTR(-ENOMEM);
> > > @@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
> > >  	struct arpt_entry *iter;
> > >  
> > >  	ret = 0;
> > > -	counters = vmalloc(num_counters * sizeof(struct xt_counters));
> > > +	counters = vzalloc(num_counters * sizeof(struct xt_counters));
> > >  	if (!counters) {
> > >  		ret = -ENOMEM;
> > >  		goto out;
> > 
> > This seems like a different and unrelated change.
> > 
> 
> Since you later Acked the patch, you probably know that we now provide
> to get_counters() a zeroed area, since we do a sum for each possible
> cpu, I am answering anyway for other readers ;)
> 
> Thanks for reviewing !

You changed from:
  get local cpu counters
  then sum other cpus
to:
  sum all cpu's.

This is fine, and will give the same answer. 

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 20:12                             ` Stephen Hemminger
@ 2010-12-16 20:40                               ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 20:40 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 12:12 -0800, Stephen Hemminger a écrit :
> You changed from:
>   get local cpu counters
>   then sum other cpus
> to:
>   sum all cpu's.
> 
> This is fine, and will give the same answer. 
> 

Yes, I did that because the fetch of individual counters is now more
complex.

I tried to make code smaller and less complex. Eventually we are now
allowed to be preempted and even migrated to another cpu during the
process.


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 17:53                       ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet
  2010-12-16 17:57                         ` Stephen Hemminger
@ 2010-12-16 17:57                         ` Stephen Hemminger
  2010-12-18  4:29                         ` [PATCH v4 " Eric Dumazet
  2 siblings, 0 replies; 26+ messages in thread
From: Stephen Hemminger @ 2010-12-16 17:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Patrick McHardy, Arnaldo Carvalho de Melo,
	Steven Rostedt, Alexander Duyck, netfilter-devel, netdev,
	Peter P Waskiewicz Jr

On Thu, 16 Dec 2010 18:53:06 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> [PATCH v3 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
> 
> Using "iptables -L" with a lot of rules might have a too big BH latency.
> Jesper mentioned ~6 ms and worried of frame drops.
> 
> Switch to a per_cpu seqcount scheme, so that taking a snapshot of
> counters doesnt need to block BH (for this cpu, but also other cpus).
> 
> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Stephen Hemminger <shemminger@vyatta.com>

-- 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-16 17:53                       ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet
  2010-12-16 17:57                         ` Stephen Hemminger
  2010-12-16 17:57                         ` Stephen Hemminger
@ 2010-12-18  4:29                         ` Eric Dumazet
  2010-12-20 13:42                           ` Jesper Dangaard Brouer
  2011-01-08 16:45                           ` Eric Dumazet
  2 siblings, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-12-18  4:29 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, netfilter-devel, netdev,
	Stephen Hemminger

Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
---
v4: I forgot net/netfilter/x_tables.c change in v3

 include/linux/netfilter/x_tables.h |   10 +++---
 net/ipv4/netfilter/arp_tables.c    |   45 ++++++++-------------------
 net/ipv4/netfilter/ip_tables.c     |   45 ++++++++-------------------
 net/ipv6/netfilter/ip6_tables.c    |   45 ++++++++-------------------
 net/netfilter/x_tables.c           |    3 +
 5 files changed, 49 insertions(+), 99 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 742bec0..6712e71 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info);
  *  necessary for reading the counters.
  */
 struct xt_info_lock {
-	spinlock_t lock;
+	seqlock_t lock;
 	unsigned char readers;
 };
 DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
@@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void)
 	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
 	if (likely(!lock->readers++))
-		spin_lock(&lock->lock);
+		write_seqlock(&lock->lock);
 }
 
 static inline void xt_info_rdunlock_bh(void)
@@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void)
 	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
 
 	if (likely(!--lock->readers))
-		spin_unlock(&lock->lock);
+		write_sequnlock(&lock->lock);
 	local_bh_enable();
 }
 
@@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void)
  */
 static inline void xt_info_wrlock(unsigned int cpu)
 {
-	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
+	write_seqlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 static inline void xt_info_wrunlock(unsigned int cpu)
 {
-	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
+	write_sequnlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 /*
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 3fac340..e855fff 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t,
 	struct arpt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	 * about).
 	 */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
 	struct arpt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..652efea 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ipt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 4555823..7d227c6 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t,
 	struct ip6t_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ip6t_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 8046350..c942376 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1325,7 +1325,8 @@ static int __init xt_init(void)
 
 	for_each_possible_cpu(i) {
 		struct xt_info_lock *lock = &per_cpu(xt_info_locks, i);
-		spin_lock_init(&lock->lock);
+
+		seqlock_init(&lock->lock);
 		lock->readers = 0;
 	}
 



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-18  4:29                         ` [PATCH v4 " Eric Dumazet
@ 2010-12-20 13:42                           ` Jesper Dangaard Brouer
  2010-12-20 14:45                             ` Eric Dumazet
  2011-01-08 16:45                           ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-20 13:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger

I have tested the patch on 2.6.35.  Which implies I also needed to
cherry-pick 870f67dcf7ef, which implements the vzalloc() call. (Latest
net-next has a problem with my HP CCISS driver/controller or the PCI
layout, and will not boot)

According to the function_graph trace, the execution time of
get_counters() has increased (approx) from 109 ms to 120 ms, which is
the expected result.

The results are not all positive, but I think its related to the
debugging options I have enabled.

Because I now see packet drops if my 1Gbit/s pktgen script are sending
packet with a packet size below 512 bytes, which is "only" approx 230
kpps (this is 1Gbit/s on my 10G labsetup where I have seen 5 Mpps).

There is no packet overruns/drops, iif I run "iptables -vnL >
/dev/null" without tracing enabled and only 1Gbit/s pktgen at 512
bytes packets.  If I enable tracing while calling iptables I see
packet drops/overruns.  So I guess this is caused by the tracing
overhead.

I'll try to rerun my test without all the lock debugging options
enabled.

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

On Sat, 2010-12-18 at 05:29 +0100, Eric Dumazet wrote:
> Using "iptables -L" with a lot of rules have a too big BH latency.
> Jesper mentioned ~6 ms and worried of frame drops.
> 
> Switch to a per_cpu seqlock scheme, so that taking a snapshot of
> counters doesnt need to block BH (for this cpu, but also other cpus).
> 
> This adds two increments on seqlock sequence per ipt_do_table() call,
> its a reasonable cost for allowing "iptables -L" not block BH
> processing.
> 
> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Acked-by: Stephen Hemminger <shemminger@vyatta.com>
> ---

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-20 13:42                           ` Jesper Dangaard Brouer
@ 2010-12-20 14:45                             ` Eric Dumazet
  2010-12-21 16:48                               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2010-12-20 14:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger

Le lundi 20 décembre 2010 à 14:42 +0100, Jesper Dangaard Brouer a
écrit :
> I have tested the patch on 2.6.35.  Which implies I also needed to
> cherry-pick 870f67dcf7ef, which implements the vzalloc() call. (Latest
> net-next has a problem with my HP CCISS driver/controller or the PCI
> layout, and will not boot)
> 

Ah wait, you need to switch from cciss to hpsa driver, I hit same
problem some weeks ago ;) (and eventually rename your partitions
to /dev/sdaX instead of /dev/cciss/c0d0pX)


> According to the function_graph trace, the execution time of
> get_counters() has increased (approx) from 109 ms to 120 ms, which is
> the expected result.
> 
> The results are not all positive, but I think its related to the
> debugging options I have enabled.
> 
> Because I now see packet drops if my 1Gbit/s pktgen script are sending
> packet with a packet size below 512 bytes, which is "only" approx 230
> kpps (this is 1Gbit/s on my 10G labsetup where I have seen 5 Mpps).
> 



> There is no packet overruns/drops, iif I run "iptables -vnL >
> /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512
> bytes packets.  If I enable tracing while calling iptables I see
> packet drops/overruns.  So I guess this is caused by the tracing
> overhead.

yes, probably :)

> 
> I'll try to rerun my test without all the lock debugging options
> enabled.
> 

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-20 14:45                             ` Eric Dumazet
@ 2010-12-21 16:48                               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 26+ messages in thread
From: Jesper Dangaard Brouer @ 2010-12-21 16:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, netfilter-devel, netdev, Stephen Hemminger

On Mon, 2010-12-20 at 15:45 +0100, Eric Dumazet wrote:
...
> > There is no packet overruns/drops, iif I run "iptables -vnL >
> > /dev/null" without tracing enabled and only 1Gbit/s pktgen at 512
> > bytes packets.  If I enable tracing while calling iptables I see
> > packet drops/overruns.  So I guess this is caused by the tracing
> > overhead.
> 
> yes, probably :)
> 
> > 
> > I'll try to rerun my test without all the lock debugging options
> > enabled.

Results are much better without the kernel debugging options enabled.
I took the .config from production and enabled tracer "function_graph".
And applied your patches (plus vzalloc) on top of 2.6.36-stable tree.

I can now hit the system with a pktgen at 128 bytes, and see no
drops/overruns while running iptables.  (This packet load at 128bytes
is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains: 20929
rules: 81239).

If I reduce the ftrace filter to only track get_counters, I can even
run a trace without any drops.

 echo get_counters >  /sys/kernel/debug/tracing/set_ftrace_filter

Some trace funny stats on get_counters(), under the packet storm.
When running iptables on a CPU not processing packets (via taskset),
the execution time is increased to 124ms.  If I force iptables to run
on a CPU processing packets, the execution time is increased to
1308ms, which is large but the expected behavior.

Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>

-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2010-12-18  4:29                         ` [PATCH v4 " Eric Dumazet
  2010-12-20 13:42                           ` Jesper Dangaard Brouer
@ 2011-01-08 16:45                           ` Eric Dumazet
  2011-01-09 21:31                             ` Pablo Neira Ayuso
  1 sibling, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2011-01-08 16:45 UTC (permalink / raw)
  To: Patrick McHardy, David Miller
  Cc: Jesper Dangaard Brouer, netfilter-devel, netdev,
	Stephen Hemminger

David,

I am resending this patch, sent 3 weeks ago, Patrick gave no answer.

I believe it should be included in linux-2.6.38 and stable kernels.

Some people found they had to change NIC RX ring sizes in order not
missing frames (from 1024 to 2048), while root cause of the problem was
this.

Quoting Jesper : "I can now hit the system with a pktgen at 128 bytes,
and see no drops/overruns while running iptables.  (This packet load at
128bytes is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains:
20929 rules: 81239)."

Thanks

[PATCH v4] netfilter: x_tables: dont block BH while reading counters

Using "iptables -L" with a lot of rules have a too big BH latency.
Jesper mentioned ~6 ms and worried of frame drops.

Switch to a per_cpu seqlock scheme, so that taking a snapshot of
counters doesnt need to block BH (for this cpu, but also other cpus).

This adds two increments on seqlock sequence per ipt_do_table() call,
its a reasonable cost for allowing "iptables -L" not block BH
processing.

Reported-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Acked-by: Jesper Dangaard Brouer <hawk@comx.dk>
---
 include/linux/netfilter/x_tables.h |   10 +++---
 net/ipv4/netfilter/arp_tables.c    |   45 ++++++++-------------------
 net/ipv4/netfilter/ip_tables.c     |   45 ++++++++-------------------
 net/ipv6/netfilter/ip6_tables.c    |   45 ++++++++-------------------
 net/netfilter/x_tables.c           |    3 +
 5 files changed, 49 insertions(+), 99 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 742bec0..6712e71 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -472,7 +472,7 @@ extern void xt_free_table_info(struct xt_table_info *info);
  *  necessary for reading the counters.
  */
 struct xt_info_lock {
-	spinlock_t lock;
+	seqlock_t lock;
 	unsigned char readers;
 };
 DECLARE_PER_CPU(struct xt_info_lock, xt_info_locks);
@@ -497,7 +497,7 @@ static inline void xt_info_rdlock_bh(void)
 	local_bh_disable();
 	lock = &__get_cpu_var(xt_info_locks);
 	if (likely(!lock->readers++))
-		spin_lock(&lock->lock);
+		write_seqlock(&lock->lock);
 }
 
 static inline void xt_info_rdunlock_bh(void)
@@ -505,7 +505,7 @@ static inline void xt_info_rdunlock_bh(void)
 	struct xt_info_lock *lock = &__get_cpu_var(xt_info_locks);
 
 	if (likely(!--lock->readers))
-		spin_unlock(&lock->lock);
+		write_sequnlock(&lock->lock);
 	local_bh_enable();
 }
 
@@ -516,12 +516,12 @@ static inline void xt_info_rdunlock_bh(void)
  */
 static inline void xt_info_wrlock(unsigned int cpu)
 {
-	spin_lock(&per_cpu(xt_info_locks, cpu).lock);
+	write_seqlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 static inline void xt_info_wrunlock(unsigned int cpu)
 {
-	spin_unlock(&per_cpu(xt_info_locks, cpu).lock);
+	write_sequnlock(&per_cpu(xt_info_locks, cpu).lock);
 }
 
 /*
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 3fac340..e855fff 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -710,42 +710,25 @@ static void get_counters(const struct xt_table_info *t,
 	struct arpt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -759,7 +742,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	 * about).
 	 */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1007,7 +990,7 @@ static int __do_replace(struct net *net, const char *name,
 	struct arpt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index a846d63..652efea 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -884,42 +884,25 @@ get_counters(const struct xt_table_info *t,
 	struct ipt_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU.
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -932,7 +915,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1203,7 +1186,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ipt_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 4555823..7d227c6 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -897,42 +897,25 @@ get_counters(const struct xt_table_info *t,
 	struct ip6t_entry *iter;
 	unsigned int cpu;
 	unsigned int i;
-	unsigned int curcpu = get_cpu();
-
-	/* Instead of clearing (by a previous call to memset())
-	 * the counters and using adds, we set the counters
-	 * with data used by 'current' CPU
-	 *
-	 * Bottom half has to be disabled to prevent deadlock
-	 * if new softirq were to run and call ipt_do_table
-	 */
-	local_bh_disable();
-	i = 0;
-	xt_entry_foreach(iter, t->entries[curcpu], t->size) {
-		SET_COUNTER(counters[i], iter->counters.bcnt,
-			    iter->counters.pcnt);
-		++i;
-	}
-	local_bh_enable();
-	/* Processing counters from other cpus, we can let bottom half enabled,
-	 * (preemption is disabled)
-	 */
 
 	for_each_possible_cpu(cpu) {
-		if (cpu == curcpu)
-			continue;
+		seqlock_t *lock = &per_cpu(xt_info_locks, cpu).lock;
+
 		i = 0;
-		local_bh_disable();
-		xt_info_wrlock(cpu);
 		xt_entry_foreach(iter, t->entries[cpu], t->size) {
-			ADD_COUNTER(counters[i], iter->counters.bcnt,
-				    iter->counters.pcnt);
+			u64 bcnt, pcnt;
+			unsigned int start;
+
+			do {
+				start = read_seqbegin(lock);
+				bcnt = iter->counters.bcnt;
+				pcnt = iter->counters.pcnt;
+			} while (read_seqretry(lock, start));
+
+			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
 		}
-		xt_info_wrunlock(cpu);
-		local_bh_enable();
 	}
-	put_cpu();
 }
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
@@ -945,7 +928,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
 	   (other than comefrom, which userspace doesn't care
 	   about). */
 	countersize = sizeof(struct xt_counters) * private->number;
-	counters = vmalloc(countersize);
+	counters = vzalloc(countersize);
 
 	if (counters == NULL)
 		return ERR_PTR(-ENOMEM);
@@ -1216,7 +1199,7 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
 	struct ip6t_entry *iter;
 
 	ret = 0;
-	counters = vmalloc(num_counters * sizeof(struct xt_counters));
+	counters = vzalloc(num_counters * sizeof(struct xt_counters));
 	if (!counters) {
 		ret = -ENOMEM;
 		goto out;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 8046350..c942376 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1325,7 +1325,8 @@ static int __init xt_init(void)
 
 	for_each_possible_cpu(i) {
 		struct xt_info_lock *lock = &per_cpu(xt_info_locks, i);
-		spin_lock_init(&lock->lock);
+
+		seqlock_init(&lock->lock);
 		lock->readers = 0;
 	}
 



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 net-next-2.6] netfilter: x_tables: dont block BH while reading counters
  2011-01-08 16:45                           ` Eric Dumazet
@ 2011-01-09 21:31                             ` Pablo Neira Ayuso
  0 siblings, 0 replies; 26+ messages in thread
From: Pablo Neira Ayuso @ 2011-01-09 21:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Miller, Jesper Dangaard Brouer,
	netfilter-devel, netdev, Stephen Hemminger

On 08/01/11 17:45, Eric Dumazet wrote:
> David,
> 
> I am resending this patch, sent 3 weeks ago, Patrick gave no answer.
> 
> I believe it should be included in linux-2.6.38 and stable kernels.
> 
> Some people found they had to change NIC RX ring sizes in order not
> missing frames (from 1024 to 2048), while root cause of the problem was
> this.
> 
> Quoting Jesper : "I can now hit the system with a pktgen at 128 bytes,
> and see no drops/overruns while running iptables.  (This packet load at
> 128bytes is 822 kpps and 840Mbit/s) (iptables ruleset is the big chains:
> 20929 rules: 81239)."

I have enqueued this patch to me tree.

http://1984.lsi.us.es/git/?p=net-2.6/.git;a=summary

I'll pass it to davem for -stable submission.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:04       ` Jesper Dangaard Brouer
  2010-12-16 14:12         ` Eric Dumazet
@ 2010-12-16 14:13         ` Eric Dumazet
  2010-12-16 14:20         ` Steven Rostedt
  2 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2010-12-16 14:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, Steven Rostedt, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

Le jeudi 16 décembre 2010 à 15:04 +0100, Jesper Dangaard Brouer a
écrit :

> Now I just need to calculate, how large a NIC buffer I need to buffer
> 3.125 ms at 1Gbit/s.
> 
>  3.125 ms *  1Gbit/s = 390625 bytes
> 
> Can this be correct?
> 
> How much buffer does each queue have in the 82576 NIC?
> (Hope Alexander Duyck can answer this one?)
> 

Worst case is if you receive very small frames, because 3.125 ms is
about 5000 frames at 1Gbit/s



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Possible regression: Packet drops during iptables calls
  2010-12-16 14:04       ` Jesper Dangaard Brouer
  2010-12-16 14:12         ` Eric Dumazet
  2010-12-16 14:13         ` Possible regression: Packet drops during iptables calls Eric Dumazet
@ 2010-12-16 14:20         ` Steven Rostedt
  2 siblings, 0 replies; 26+ messages in thread
From: Steven Rostedt @ 2010-12-16 14:20 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Arnaldo Carvalho de Melo, Eric Dumazet, Alexander Duyck,
	Stephen Hemminger, netfilter-devel, netdev, Peter P Waskiewicz Jr

On Thu, 2010-12-16 at 15:04 +0100, Jesper Dangaard Brouer wrote:

> 
> To do some further investigation into the unfortunate behavior of the
> iptables get_counters() function I started to use "ftrace".  This is a
> really useful tool (thanks Steven Rostedt).
> 
>  # Select the tracer "function_graph"
>  echo function_graph > /sys/kernel/debug/tracing/current_tracer
> 
>  # Limit the number of function we look at:
>  echo local_bh_\*  >   /sys/kernel/debug/tracing/set_ftrace_filter
>  echo get_counters >>  /sys/kernel/debug/tracing/set_ftrace_filter
> 
>  # Enable tracing while calling iptables
>  cd /sys/kernel/debug/tracing
>  echo 0 > trace
>  echo 1 > tracing_enabled;
>    taskset 1 iptables -vnL > /dev/null ;
>  echo 0 > tracing_enabled
>  cat trace | less

Just an fyi, you can do the above much easier with trace-cmd:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git

# trace-cmd record -p function_graph -l 'local_bh_*' -l get_counters taskset 1 iptables -vnL > /dev/null
# trace-cmd report

-- Steve

> 
> 
> The reduced output:
> 
> # tracer: function_graph
> #
> # CPU  DURATION                  FUNCTION CALLS
> # |     |   |                     |   |   |   |
>   2)   2.772 us    |  local_bh_disable();
> ....
>   0)   0.228 us    |  local_bh_enable();
>   0)               |  get_counters() {
>   0)   0.232 us    |    local_bh_disable();
>   0)   7.919 us    |    local_bh_enable();
>   0) ! 109467.1 us |  }
>   0)   2.344 us    |  local_bh_disable();
> 
> 
> The output show that we spend no less that 100 ms with local BH
> disabled.  So, no wonder that this causes packet drops in the NIC
> (attached to this CPU).
> 
> My iptables rule set in question is also very large, it contains:
>  Chains: 20929
>  Rules: 81239
> 
> The vmalloc size is approx 19 MB (19.820.544 bytes) (see
> /proc/vmallocinfo).  Looking through vmallocinfo I realized that
> even-though I only have 16 CPUs, there is 32 allocated rulesets
> "xt_alloc_table_info" (for the filter table). Thus, I have approx
> 634MB iptables filter rules in the kernel, half of which is totally
> unused.
> 
> Guess this is because we use: "for_each_possible_cpu" instead of
> "for_each_online_cpu". (Feel free to fix this, or point me to some
> documentation of this CPU hotplug stuff... I see we are missing
> get_cpu() and put_cpu() a lot of places).
> 
> 
> The GOOD NEWS, is that moving the local BH disable section into the
> "for_each_possible_cpu" fixed the problem with packet drops during
> iptables calls.
> 
> I wanted to profile with ftrace on the new code, but I cannot get the
> measurement I want. Perhaps Steven or Acme can help?
> 
> Now I want to measure the time used between the local_bh_disable() and
> local_bh_enable, within the loop.  I cannot figure out howto do that?
> The new trace looks almost the same as before, just a lot of
> local_bh_* inside the get_counters() function call.
> 
>  Guess is that the time spend is: 100 ms / 32 = 3.125 ms.
> 
> Now I just need to calculate, how large a NIC buffer I need to buffer
> 3.125 ms at 1Gbit/s.
> 
>  3.125 ms *  1Gbit/s = 390625 bytes
> 
> Can this be correct?
> 
> How much buffer does each queue have in the 82576 NIC?
> (Hope Alexander Duyck can answer this one?)
> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2011-01-09 21:31 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-14 14:46 Possible regression: Packet drops during iptables calls Jesper Dangaard Brouer
2010-12-14 15:31 ` Eric Dumazet
2010-12-14 16:09   ` Jesper Dangaard Brouer
2010-12-14 16:24     ` Eric Dumazet
2010-12-16 14:04       ` Jesper Dangaard Brouer
2010-12-16 14:12         ` Eric Dumazet
2010-12-16 14:24           ` Jesper Dangaard Brouer
2010-12-16 14:29             ` Eric Dumazet
2010-12-16 15:02               ` Eric Dumazet
2010-12-16 16:07                 ` [PATCH net-next-2.6] netfilter: ip_tables: dont block BH while reading counters Eric Dumazet
2010-12-16 16:53                   ` [PATCH v2 " Eric Dumazet
2010-12-16 17:31                     ` Stephen Hemminger
2010-12-16 17:53                       ` [PATCH v3 net-next-2.6] netfilter: x_tables: " Eric Dumazet
2010-12-16 17:57                         ` Stephen Hemminger
2010-12-16 19:58                           ` Eric Dumazet
2010-12-16 20:12                             ` Stephen Hemminger
2010-12-16 20:40                               ` Eric Dumazet
2010-12-16 17:57                         ` Stephen Hemminger
2010-12-18  4:29                         ` [PATCH v4 " Eric Dumazet
2010-12-20 13:42                           ` Jesper Dangaard Brouer
2010-12-20 14:45                             ` Eric Dumazet
2010-12-21 16:48                               ` Jesper Dangaard Brouer
2011-01-08 16:45                           ` Eric Dumazet
2011-01-09 21:31                             ` Pablo Neira Ayuso
2010-12-16 14:13         ` Possible regression: Packet drops during iptables calls Eric Dumazet
2010-12-16 14:20         ` Steven Rostedt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).