From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Eric Dumazet <dada1@cosmosbay.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>,
David Miller <davem@davemloft.net>,
Jarek Poplawski <jarkao2@gmail.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ingo Molnar <mingo@elte.hu>, Paul Mackerras <paulus@samba.org>,
paulmck@linux.vnet.ibm.com, Evgeniy Polyakov <zbr@ioremap.net>,
kaber@trash.net, jeff.chua.linux@gmail.com, laijs@cn.fujitsu.com,
jengelh@medozas.de, r000n@r000n.net,
linux-kernel@vger.kernel.org, netfilter-devel@vger.kernel.org,
netdev@vger.kernel.org, benh@kernel.crashing.org
Subject: Re: [PATCH] netfilter: use per-CPU recursive lock {XV}
Date: Sun, 26 Apr 2009 14:56:46 -0400 [thread overview]
Message-ID: <20090426185646.GB29238@Krystal> (raw)
In-Reply-To: <49F4A6E3.7080102@cosmosbay.com>
* Eric Dumazet (dada1@cosmosbay.com) wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
>
> > Epilogue due to master Jarek. Lockdep carest not about the locking
> > doth bestowed. Therefore no keys are needed.
> >
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>
> So far, so good, should be ready for inclusion now, nobody complained :)
>
> I include the final patch, merge of your last two patches.
>
> David, could you please review it once again and apply it if it's OK ?
>
> Thanks to all for your help and patience
>
> [PATCH] netfilter: use per-CPU recursive lock {XV}
Hi Eric,
Please... could you rename this patch according to Linus'comments ?
Suitable name would probably be :
[PATCH] netfilter: use bh disabling with per-cpu read-write lock
>
> In days of old in 2.6.29, netfilter did locketh using a
> lock of the reader kind when doing its table business, and do
> a writer when with pen in hand like a overworked accountant
> did replace the tables. This sucketh and caused the single
> lock to fly back and forth like a poor errant boy.
>
> But then netfilter was blessed with RCU and the performance
> was divine, but alas there were those that suffered for
> trying to replace their many rules one at a time.
>
> So now RCU must be vanquished from the scene, and better
> chastity belts be placed upon this valuable asset most dear.
> The locks that were but one are now replaced by one per suitor.
>
> The repair was made after much discussion involving
> Eric the wise, and Linus the foul. With flowers springing
> up amid the thorns some peace has finally prevailed and
> all is soothed. This patch and purple prose was penned by
> in honor of "Talk like Shakespeare" day.
>
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
> What hath changed over the last two setting suns:
>
> * more words, mostly correct...
>
> * no need to locketh for writeh on current cpu tis
> always so
>
> * the locking of all cpu's on replace is always done as
> part of the get_counters cycle, so the sychronize swip
> in replace tables is gone with only a comment remaing
>
> * Epilogue due to master Jarek. Lockdep carest not about
> the locking doth bestowed. Therefore no keys are needed.
>
> include/linux/netfilter/x_tables.h | 55 ++++++++++-
> net/ipv4/netfilter/arp_tables.c | 125 +++++++-------------------
> net/ipv4/netfilter/ip_tables.c | 126 +++++++--------------------
> net/ipv6/netfilter/ip6_tables.c | 123 +++++++-------------------
> net/netfilter/x_tables.c | 50 +++++-----
> 5 files changed, 183 insertions(+), 296 deletions(-)
>
> diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
> index 7b1a652..511debb 100644
> --- a/include/linux/netfilter/x_tables.h
> +++ b/include/linux/netfilter/x_tables.h
> @@ -354,9 +354,6 @@ struct xt_table
> /* What hooks you will enter on */
> unsigned int valid_hooks;
>
> - /* Lock for the curtain */
> - struct mutex lock;
> -
> /* Man behind the curtain... */
> struct xt_table_info *private;
>
> @@ -434,8 +431,56 @@ extern void xt_proto_fini(struct net *net, u_int8_t af);
>
> extern struct xt_table_info *xt_alloc_table_info(unsigned int size);
> extern void xt_free_table_info(struct xt_table_info *info);
> -extern void xt_table_entry_swap_rcu(struct xt_table_info *old,
> - struct xt_table_info *new);
> +
> +/*
> + * Per-CPU read/write lock associated with per-cpu table entries.
> + * This is not a general solution but makes reader locking fast since
> + * there is no shared variable to cause cache ping-pong; but adds an
> + * additional write-side penalty since update must lock all
> + * possible CPU's.
> + *
> + * Read lock is used by ip/arp/ip6 tables rule processing which runs per-cpu.
> + * It needs to ensure that the rules are not being changed while packet
> + * is being processed. In some cases, the read lock will be acquired
> + * twice on the same CPU; this is okay because read locks handle nesting.
> + *
> + * Write lock is used in two cases:
> + * 1. reading counter values
> + * all readers need to be stopped and the per-CPU values are summed.
> + *
> + * 2. replacing tables
> + * any readers that are using the old tables have to complete
> + * before freeing the old table. This is handled by reading
> + * as a side effect of reading counters
Stating that the write lock must _always_ be taken with bh disabled
might not hurt here.
> + */
> +DECLARE_PER_CPU(rwlock_t, xt_info_locks);
> +
> +static inline void xt_info_rdlock_bh(void)
> +{
> + /*
> + * Note: can not use read_lock_bh(&__get_cpu_var(xt_info_locks))
> + * because need to ensure that preemption is disable before
> + * acquiring per-cpu-variable, so do it as a two step process
> + */
> + local_bh_disable();
> + read_lock(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_rdunlock_bh(void)
> +{
> + read_unlock_bh(&__get_cpu_var(xt_info_locks));
> +}
> +
> +static inline void xt_info_wrlock(unsigned int cpu)
> +{
> + write_lock(&per_cpu(xt_info_locks, cpu));
> +}
> +
> +static inline void xt_info_wrunlock(unsigned int cpu)
> +{
> +
> + write_unlock(&per_cpu(xt_info_locks, cpu));
> +}
>
> /*
> * This helper is performance critical and must be inlined
> diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
> index 5ba533d..831fe18 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -253,9 +253,9 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> indev = in ? in->name : nulldevname;
> outdev = out ? out->name : nulldevname;
>
> - rcu_read_lock_bh();
> - private = rcu_dereference(table->private);
> - table_base = rcu_dereference(private->entries[smp_processor_id()]);
> + xt_info_rdlock_bh();
> + private = table->private;
> + table_base = private->entries[smp_processor_id()];
>
> e = get_entry(table_base, private->hook_entry[hook]);
> back = get_entry(table_base, private->underflow[hook]);
> @@ -273,6 +273,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>
> hdr_len = sizeof(*arp) + (2 * sizeof(struct in_addr)) +
> (2 * skb->dev->addr_len);
> +
This is not a whitespace cleanup patch.
> ADD_COUNTER(e->counters, hdr_len, 1);
>
> t = arpt_get_target(e);
> @@ -328,8 +329,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> e = (void *)e + e->next_offset;
> }
> } while (!hotdrop);
> -
Whitespace cleanup ?
> - rcu_read_unlock_bh();
> + xt_info_rdunlock_bh();
>
> if (hotdrop)
> return NF_DROP;
> @@ -711,9 +711,12 @@ static void get_counters(const struct xt_table_info *t,
> /* Instead of clearing (by a previous call to memset())
> * the counters and using adds, we set the counters
> * with data used by 'current' CPU
> - * We dont care about preemption here.
> + *
> + * Bottom half has to be disabled to prevent deadlock
> + * if new softirq were to run and call ipt_do_table
> */
> - curcpu = raw_smp_processor_id();
> + local_bh_disable();
> + curcpu = smp_processor_id();
>
> i = 0;
> ARPT_ENTRY_ITERATE(t->entries[curcpu],
> @@ -726,73 +729,22 @@ static void get_counters(const struct xt_table_info *t,
> if (cpu == curcpu)
> continue;
> i = 0;
> + xt_info_wrlock(cpu);
> ARPT_ENTRY_ITERATE(t->entries[cpu],
> t->size,
> add_entry_to_counter,
> counters,
> &i);
> + xt_info_wrunlock(cpu);
> }
> -}
> -
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct arpt_entry *e,
> - const struct xt_counters addme[],
> - unsigned int *i)
> -{
> - ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> - (*i)++;
> - return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> - const struct xt_counters counters[])
> -{
> - unsigned int i, cpu;
> -
> - local_bh_disable();
> - cpu = smp_processor_id();
> - i = 0;
> - ARPT_ENTRY_ITERATE(t->entries[cpu],
> - t->size,
> - add_counter_to_entry,
> - counters,
> - &i);
> local_bh_enable();
> }
Did you really need to move add_counter_to_entry and put_counters in
this patch ? This also seems more like a cleanup to me, if it is even
one. It does make the patch harder to follow though.
>
> -static inline int
> -zero_entry_counter(struct arpt_entry *e, void *arg)
> -{
> - e->counters.bcnt = 0;
> - e->counters.pcnt = 0;
> - return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> - unsigned int cpu;
> - const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> - memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> - for_each_possible_cpu(cpu) {
> - memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> - ARPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> - zero_entry_counter, NULL);
> - }
> -}
> -
> static struct xt_counters *alloc_counters(struct xt_table *table)
> {
> unsigned int countersize;
> struct xt_counters *counters;
> struct xt_table_info *private = table->private;
> - struct xt_table_info *info;
>
> /* We need atomic snapshot of counters: rest doesn't change
> * (other than comefrom, which userspace doesn't care
> @@ -802,30 +754,11 @@ static struct xt_counters *alloc_counters(struct xt_table *table)
> counters = vmalloc_node(countersize, numa_node_id());
>
> if (counters == NULL)
> - goto nomem;
> -
> - info = xt_alloc_table_info(private->size);
> - if (!info)
> - goto free_counters;
> -
> - clone_counters(info, private);
> -
> - mutex_lock(&table->lock);
> - xt_table_entry_swap_rcu(private, info);
> - synchronize_net(); /* Wait until smoke has cleared */
> + return ERR_PTR(-ENOMEM);
>
> - get_counters(info, counters);
> - put_counters(private, counters);
> - mutex_unlock(&table->lock);
> -
> - xt_free_table_info(info);
> + get_counters(private, counters);
>
> return counters;
> -
> - free_counters:
> - vfree(counters);
> - nomem:
> - return ERR_PTR(-ENOMEM);
> }
>
> static int copy_entries_to_user(unsigned int total_size,
> @@ -1094,8 +1027,9 @@ static int __do_replace(struct net *net, const char *name,
> (newinfo->number <= oldinfo->initial_entries))
> module_put(t->me);
>
> - /* Get the old counters. */
> + /* Get the old counters, and synchronize with replace */
> get_counters(oldinfo, counters);
> +
> /* Decrease module usage counts and free resource */
> loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
> ARPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1165,10 +1099,23 @@ static int do_replace(struct net *net, void __user *user, unsigned int len)
> return ret;
> }
>
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct arpt_entry *e,
> + const struct xt_counters addme[],
> + unsigned int *i)
> +{
> + ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> + (*i)++;
> + return 0;
> +}
> +
> static int do_add_counters(struct net *net, void __user *user, unsigned int len,
> int compat)
> {
> - unsigned int i;
> + unsigned int i, curcpu;
> struct xt_counters_info tmp;
> struct xt_counters *paddc;
> unsigned int num_counters;
> @@ -1224,26 +1171,26 @@ static int do_add_counters(struct net *net, void __user *user, unsigned int len,
> goto free;
> }
>
> - mutex_lock(&t->lock);
> + local_bh_disable();
> private = t->private;
> if (private->number != num_counters) {
> ret = -EINVAL;
> goto unlock_up_free;
> }
>
> - preempt_disable();
> i = 0;
> /* Choose the copy that is on our node */
> - loc_cpu_entry = private->entries[smp_processor_id()];
> + curcpu = smp_processor_id();
> + loc_cpu_entry = private->entries[curcpu];
> + xt_info_wrlock(curcpu);
> ARPT_ENTRY_ITERATE(loc_cpu_entry,
> private->size,
> add_counter_to_entry,
> paddc,
> &i);
> - preempt_enable();
> + xt_info_wrunlock(curcpu);
> unlock_up_free:
> - mutex_unlock(&t->lock);
> -
> + local_bh_enable();
> xt_table_unlock(t);
> module_put(t->me);
> free:
> diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
> index 810c0b6..2ec8d72 100644
> --- a/net/ipv4/netfilter/ip_tables.c
> +++ b/net/ipv4/netfilter/ip_tables.c
> @@ -338,10 +338,9 @@ ipt_do_table(struct sk_buff *skb,
> tgpar.hooknum = hook;
>
> IP_NF_ASSERT(table->valid_hooks & (1 << hook));
> -
> - rcu_read_lock_bh();
> - private = rcu_dereference(table->private);
> - table_base = rcu_dereference(private->entries[smp_processor_id()]);
> + xt_info_rdlock_bh();
> + private = table->private;
> + table_base = private->entries[smp_processor_id()];
>
> e = get_entry(table_base, private->hook_entry[hook]);
>
> @@ -436,8 +435,7 @@ ipt_do_table(struct sk_buff *skb,
> e = (void *)e + e->next_offset;
> }
> } while (!hotdrop);
> -
> - rcu_read_unlock_bh();
> + xt_info_rdunlock_bh();
>
> #ifdef DEBUG_ALLOW_ALL
> return NF_ACCEPT;
> @@ -896,10 +894,13 @@ get_counters(const struct xt_table_info *t,
>
> /* Instead of clearing (by a previous call to memset())
> * the counters and using adds, we set the counters
> - * with data used by 'current' CPU
> - * We dont care about preemption here.
> + * with data used by 'current' CPU.
> + *
> + * Bottom half has to be disabled to prevent deadlock
> + * if new softirq were to run and call ipt_do_table
> */
> - curcpu = raw_smp_processor_id();
> + local_bh_disable();
> + curcpu = smp_processor_id();
>
> i = 0;
> IPT_ENTRY_ITERATE(t->entries[curcpu],
> @@ -912,74 +913,22 @@ get_counters(const struct xt_table_info *t,
> if (cpu == curcpu)
> continue;
> i = 0;
> + xt_info_wrlock(cpu);
> IPT_ENTRY_ITERATE(t->entries[cpu],
> t->size,
> add_entry_to_counter,
> counters,
> &i);
> + xt_info_wrunlock(cpu);
> }
> -
> -}
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct ipt_entry *e,
> - const struct xt_counters addme[],
> - unsigned int *i)
> -{
> - ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> - (*i)++;
> - return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> - const struct xt_counters counters[])
> -{
> - unsigned int i, cpu;
> -
> - local_bh_disable();
> - cpu = smp_processor_id();
> - i = 0;
> - IPT_ENTRY_ITERATE(t->entries[cpu],
> - t->size,
> - add_counter_to_entry,
> - counters,
> - &i);
> local_bh_enable();
> }
>
> -
> -static inline int
> -zero_entry_counter(struct ipt_entry *e, void *arg)
> -{
> - e->counters.bcnt = 0;
> - e->counters.pcnt = 0;
> - return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> - unsigned int cpu;
> - const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> - memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> - for_each_possible_cpu(cpu) {
> - memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> - IPT_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> - zero_entry_counter, NULL);
> - }
> -}
> -
> static struct xt_counters * alloc_counters(struct xt_table *table)
> {
> unsigned int countersize;
> struct xt_counters *counters;
> struct xt_table_info *private = table->private;
> - struct xt_table_info *info;
>
> /* We need atomic snapshot of counters: rest doesn't change
> (other than comefrom, which userspace doesn't care
> @@ -988,30 +937,11 @@ static struct xt_counters * alloc_counters(struct xt_table *table)
> counters = vmalloc_node(countersize, numa_node_id());
>
> if (counters == NULL)
> - goto nomem;
> + return ERR_PTR(-ENOMEM);
>
> - info = xt_alloc_table_info(private->size);
> - if (!info)
> - goto free_counters;
> -
> - clone_counters(info, private);
> -
> - mutex_lock(&table->lock);
> - xt_table_entry_swap_rcu(private, info);
> - synchronize_net(); /* Wait until smoke has cleared */
> -
> - get_counters(info, counters);
> - put_counters(private, counters);
> - mutex_unlock(&table->lock);
> -
> - xt_free_table_info(info);
> + get_counters(private, counters);
>
> return counters;
> -
> - free_counters:
> - vfree(counters);
> - nomem:
> - return ERR_PTR(-ENOMEM);
> }
>
> static int
> @@ -1306,8 +1236,9 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
> (newinfo->number <= oldinfo->initial_entries))
> module_put(t->me);
>
> - /* Get the old counters. */
> + /* Get the old counters, and synchronize with replace */
> get_counters(oldinfo, counters);
> +
> /* Decrease module usage counts and free resource */
> loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
> IPT_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1377,11 +1308,23 @@ do_replace(struct net *net, void __user *user, unsigned int len)
> return ret;
> }
>
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct ipt_entry *e,
> + const struct xt_counters addme[],
> + unsigned int *i)
> +{
> + ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> + (*i)++;
> + return 0;
> +}
>
> static int
> do_add_counters(struct net *net, void __user *user, unsigned int len, int compat)
> {
> - unsigned int i;
> + unsigned int i, curcpu;
> struct xt_counters_info tmp;
> struct xt_counters *paddc;
> unsigned int num_counters;
> @@ -1437,25 +1380,26 @@ do_add_counters(struct net *net, void __user *user, unsigned int len, int compat
> goto free;
> }
>
> - mutex_lock(&t->lock);
> + local_bh_disable();
> private = t->private;
> if (private->number != num_counters) {
> ret = -EINVAL;
> goto unlock_up_free;
> }
>
> - preempt_disable();
> i = 0;
> /* Choose the copy that is on our node */
> - loc_cpu_entry = private->entries[raw_smp_processor_id()];
> + curcpu = smp_processor_id();
> + loc_cpu_entry = private->entries[curcpu];
> + xt_info_wrlock(curcpu);
> IPT_ENTRY_ITERATE(loc_cpu_entry,
> private->size,
> add_counter_to_entry,
> paddc,
> &i);
> - preempt_enable();
> + xt_info_wrunlock(curcpu);
> unlock_up_free:
> - mutex_unlock(&t->lock);
> + local_bh_enable();
> xt_table_unlock(t);
> module_put(t->me);
> free:
> diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
> index 800ae85..219e165 100644
> --- a/net/ipv6/netfilter/ip6_tables.c
> +++ b/net/ipv6/netfilter/ip6_tables.c
> @@ -365,9 +365,9 @@ ip6t_do_table(struct sk_buff *skb,
>
> IP_NF_ASSERT(table->valid_hooks & (1 << hook));
>
> - rcu_read_lock_bh();
> - private = rcu_dereference(table->private);
> - table_base = rcu_dereference(private->entries[smp_processor_id()]);
> + xt_info_rdlock_bh();
> + private = table->private;
> + table_base = private->entries[smp_processor_id()];
>
> e = get_entry(table_base, private->hook_entry[hook]);
>
> @@ -466,7 +466,7 @@ ip6t_do_table(struct sk_buff *skb,
> #ifdef CONFIG_NETFILTER_DEBUG
> ((struct ip6t_entry *)table_base)->comefrom = NETFILTER_LINK_POISON;
> #endif
> - rcu_read_unlock_bh();
> + xt_info_rdunlock_bh();
>
> #ifdef DEBUG_ALLOW_ALL
> return NF_ACCEPT;
> @@ -926,9 +926,12 @@ get_counters(const struct xt_table_info *t,
> /* Instead of clearing (by a previous call to memset())
> * the counters and using adds, we set the counters
> * with data used by 'current' CPU
> - * We dont care about preemption here.
> + *
> + * Bottom half has to be disabled to prevent deadlock
> + * if new softirq were to run and call ipt_do_table
> */
> - curcpu = raw_smp_processor_id();
> + local_bh_disable();
> + curcpu = smp_processor_id();
>
> i = 0;
> IP6T_ENTRY_ITERATE(t->entries[curcpu],
> @@ -941,72 +944,22 @@ get_counters(const struct xt_table_info *t,
> if (cpu == curcpu)
> continue;
> i = 0;
> + xt_info_wrlock(cpu);
> IP6T_ENTRY_ITERATE(t->entries[cpu],
> t->size,
> add_entry_to_counter,
> counters,
> &i);
> + xt_info_wrunlock(cpu);
> }
> -}
> -
> -/* We're lazy, and add to the first CPU; overflow works its fey magic
> - * and everything is OK. */
> -static int
> -add_counter_to_entry(struct ip6t_entry *e,
> - const struct xt_counters addme[],
> - unsigned int *i)
> -{
> - ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> -
> - (*i)++;
> - return 0;
> -}
> -
> -/* Take values from counters and add them back onto the current cpu */
> -static void put_counters(struct xt_table_info *t,
> - const struct xt_counters counters[])
> -{
> - unsigned int i, cpu;
> -
> - local_bh_disable();
> - cpu = smp_processor_id();
> - i = 0;
> - IP6T_ENTRY_ITERATE(t->entries[cpu],
> - t->size,
> - add_counter_to_entry,
> - counters,
> - &i);
> local_bh_enable();
> }
>
> -static inline int
> -zero_entry_counter(struct ip6t_entry *e, void *arg)
> -{
> - e->counters.bcnt = 0;
> - e->counters.pcnt = 0;
> - return 0;
> -}
> -
> -static void
> -clone_counters(struct xt_table_info *newinfo, const struct xt_table_info *info)
> -{
> - unsigned int cpu;
> - const void *loc_cpu_entry = info->entries[raw_smp_processor_id()];
> -
> - memcpy(newinfo, info, offsetof(struct xt_table_info, entries));
> - for_each_possible_cpu(cpu) {
> - memcpy(newinfo->entries[cpu], loc_cpu_entry, info->size);
> - IP6T_ENTRY_ITERATE(newinfo->entries[cpu], newinfo->size,
> - zero_entry_counter, NULL);
> - }
> -}
> -
> static struct xt_counters *alloc_counters(struct xt_table *table)
> {
> unsigned int countersize;
> struct xt_counters *counters;
> struct xt_table_info *private = table->private;
> - struct xt_table_info *info;
>
> /* We need atomic snapshot of counters: rest doesn't change
> (other than comefrom, which userspace doesn't care
> @@ -1015,30 +968,11 @@ static struct xt_counters *alloc_counters(struct xt_table *table)
> counters = vmalloc_node(countersize, numa_node_id());
>
> if (counters == NULL)
> - goto nomem;
> + return ERR_PTR(-ENOMEM);
>
> - info = xt_alloc_table_info(private->size);
> - if (!info)
> - goto free_counters;
> -
> - clone_counters(info, private);
> -
> - mutex_lock(&table->lock);
> - xt_table_entry_swap_rcu(private, info);
> - synchronize_net(); /* Wait until smoke has cleared */
> -
> - get_counters(info, counters);
> - put_counters(private, counters);
> - mutex_unlock(&table->lock);
> -
> - xt_free_table_info(info);
> + get_counters(private, counters);
>
> return counters;
> -
> - free_counters:
> - vfree(counters);
> - nomem:
> - return ERR_PTR(-ENOMEM);
> }
>
> static int
> @@ -1334,8 +1268,9 @@ __do_replace(struct net *net, const char *name, unsigned int valid_hooks,
> (newinfo->number <= oldinfo->initial_entries))
> module_put(t->me);
>
> - /* Get the old counters. */
> + /* Get the old counters, and synchronize with replace */
> get_counters(oldinfo, counters);
> +
> /* Decrease module usage counts and free resource */
> loc_cpu_old_entry = oldinfo->entries[raw_smp_processor_id()];
> IP6T_ENTRY_ITERATE(loc_cpu_old_entry, oldinfo->size, cleanup_entry,
> @@ -1405,11 +1340,24 @@ do_replace(struct net *net, void __user *user, unsigned int len)
> return ret;
> }
>
> +/* We're lazy, and add to the first CPU; overflow works its fey magic
> + * and everything is OK. */
> +static int
> +add_counter_to_entry(struct ip6t_entry *e,
> + const struct xt_counters addme[],
> + unsigned int *i)
> +{
> + ADD_COUNTER(e->counters, addme[*i].bcnt, addme[*i].pcnt);
> +
> + (*i)++;
> + return 0;
> +}
> +
> static int
> do_add_counters(struct net *net, void __user *user, unsigned int len,
> int compat)
> {
> - unsigned int i;
> + unsigned int i, curcpu;
> struct xt_counters_info tmp;
> struct xt_counters *paddc;
> unsigned int num_counters;
> @@ -1465,25 +1413,28 @@ do_add_counters(struct net *net, void __user *user, unsigned int len,
> goto free;
> }
>
> - mutex_lock(&t->lock);
> +
Incorrect whiteline added.
> + local_bh_disable();
> private = t->private;
> if (private->number != num_counters) {
> ret = -EINVAL;
> goto unlock_up_free;
> }
>
> - preempt_disable();
> i = 0;
> /* Choose the copy that is on our node */
> - loc_cpu_entry = private->entries[raw_smp_processor_id()];
> + curcpu = smp_processor_id();
> + xt_info_wrlock(curcpu);
> + loc_cpu_entry = private->entries[curcpu];
> IP6T_ENTRY_ITERATE(loc_cpu_entry,
> private->size,
> add_counter_to_entry,
> paddc,
> &i);
> - preempt_enable();
> + xt_info_wrunlock(curcpu);
> +
Inconsistent whiteline.
> unlock_up_free:
> - mutex_unlock(&t->lock);
> + local_bh_enable();
> xt_table_unlock(t);
> module_put(t->me);
> free:
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index 509a956..5807a4d 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -625,20 +625,6 @@ void xt_free_table_info(struct xt_table_info *info)
> }
> EXPORT_SYMBOL(xt_free_table_info);
>
> -void xt_table_entry_swap_rcu(struct xt_table_info *oldinfo,
> - struct xt_table_info *newinfo)
> -{
> - unsigned int cpu;
> -
> - for_each_possible_cpu(cpu) {
> - void *p = oldinfo->entries[cpu];
> - rcu_assign_pointer(oldinfo->entries[cpu], newinfo->entries[cpu]);
> - newinfo->entries[cpu] = p;
> - }
> -
> -}
> -EXPORT_SYMBOL_GPL(xt_table_entry_swap_rcu);
> -
> /* Find table by name, grabs mutex & ref. Returns ERR_PTR() on error. */
> struct xt_table *xt_find_table_lock(struct net *net, u_int8_t af,
> const char *name)
> @@ -676,32 +662,43 @@ void xt_compat_unlock(u_int8_t af)
> EXPORT_SYMBOL_GPL(xt_compat_unlock);
> #endif
>
> +DEFINE_PER_CPU(rwlock_t, xt_info_locks);
> +EXPORT_PER_CPU_SYMBOL_GPL(xt_info_locks);
> +
> +
> struct xt_table_info *
> xt_replace_table(struct xt_table *table,
> unsigned int num_counters,
> struct xt_table_info *newinfo,
> int *error)
> {
> - struct xt_table_info *oldinfo, *private;
> + struct xt_table_info *private;
>
> /* Do the substitution. */
> - mutex_lock(&table->lock);
> + local_bh_disable();
> private = table->private;
> +
> /* Check inside lock: is the old number correct? */
> if (num_counters != private->number) {
> duprintf("num_counters != table->private->number (%u/%u)\n",
> num_counters, private->number);
> - mutex_unlock(&table->lock);
> + local_bh_enable();
> *error = -EAGAIN;
> return NULL;
> }
> - oldinfo = private;
> - rcu_assign_pointer(table->private, newinfo);
> - newinfo->initial_entries = oldinfo->initial_entries;
> - mutex_unlock(&table->lock);
>
Whiteline.....
Mathieu
> - synchronize_net();
> - return oldinfo;
> + table->private = newinfo;
> + newinfo->initial_entries = private->initial_entries;
> +
> + /*
> + * Even though table entries have now been swapped, other CPU's
> + * may still be using the old entries. This is okay, because
> + * resynchronization happens because of the locking done
> + * during the get_counters() routine.
> + */
> + local_bh_enable();
> +
> + return private;
> }
> EXPORT_SYMBOL_GPL(xt_replace_table);
>
> @@ -734,7 +731,6 @@ struct xt_table *xt_register_table(struct net *net, struct xt_table *table,
>
> /* Simplifies replace_table code. */
> table->private = bootstrap;
> - mutex_init(&table->lock);
>
> if (!xt_replace_table(table, 0, newinfo, &ret))
> goto unlock;
> @@ -1147,7 +1143,11 @@ static struct pernet_operations xt_net_ops = {
>
> static int __init xt_init(void)
> {
> - int i, rv;
> + unsigned int i;
> + int rv;
> +
> + for_each_possible_cpu(i)
> + rwlock_init(&per_cpu(xt_info_locks, i));
>
> xt = kmalloc(sizeof(struct xt_af) * NFPROTO_NUMPROTO, GFP_KERNEL);
> if (!xt)
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2009-04-26 18:56 UTC|newest]
Thread overview: 215+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <Pine.LNX.4.64.0904101656190.2093@boston.corp.fedex.com>
[not found] ` <20090410095246.4fdccb56@s6510>
2009-04-11 1:25 ` iptables very slow after commit784544739a25c30637397ace5489eeb6e15d7d49 David Miller
2009-04-11 1:39 ` iptables very slow after commit 784544739a25c30637397ace5489eeb6e15d7d49 Linus Torvalds
2009-04-11 4:15 ` Paul E. McKenney
2009-04-11 5:14 ` Jan Engelhardt
2009-04-11 5:42 ` Paul E. McKenney
2009-04-11 6:00 ` David Miller
2009-04-11 18:12 ` Kyle Moffett
2009-04-11 18:32 ` Arkadiusz Miskiewicz
2009-04-12 0:54 ` david
2009-04-12 5:05 ` Kyle Moffett
2009-04-12 12:30 ` Harald Welte
2009-04-12 16:38 ` Jan Engelhardt
2009-04-11 15:07 ` Stephen Hemminger
2009-04-11 16:05 ` Jeff Chua
2009-04-11 17:51 ` Linus Torvalds
2009-04-11 7:08 ` Ingo Molnar
2009-04-11 15:05 ` Stephen Hemminger
2009-04-11 17:48 ` Paul E. McKenney
2009-04-12 10:54 ` Ingo Molnar
2009-04-12 11:34 ` Paul Mackerras
2009-04-12 17:31 ` Paul E. McKenney
2009-04-13 1:13 ` David Miller
2009-04-13 4:04 ` Paul E. McKenney
2009-04-13 16:53 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU Stephen Hemminger
2009-04-13 17:40 ` Eric Dumazet
2009-04-13 18:11 ` Stephen Hemminger
2009-04-13 19:06 ` Martin Josefsson
2009-04-13 19:17 ` Linus Torvalds
2009-04-13 22:24 ` Andrew Morton
2009-04-13 23:20 ` Stephen Hemminger
2009-04-13 23:26 ` Andrew Morton
2009-04-13 23:37 ` Linus Torvalds
2009-04-13 23:52 ` Ingo Molnar
2009-04-14 12:27 ` Patrick McHardy
2009-04-14 14:23 ` Eric Dumazet
2009-04-14 14:45 ` Stephen Hemminger
2009-04-14 15:49 ` Eric Dumazet
2009-04-14 16:51 ` Jeff Chua
2009-04-14 18:17 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v2) Stephen Hemminger
2009-04-14 19:28 ` Eric Dumazet
2009-04-14 21:11 ` Stephen Hemminger
2009-04-14 21:13 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3) Stephen Hemminger
2009-04-14 21:40 ` Eric Dumazet
2009-04-15 10:59 ` Patrick McHardy
2009-04-15 16:31 ` Stephen Hemminger
2009-04-15 20:55 ` Stephen Hemminger
2009-04-15 21:07 ` Eric Dumazet
2009-04-15 21:55 ` Jan Engelhardt
2009-04-16 12:12 ` Patrick McHardy
2009-04-16 12:24 ` Jan Engelhardt
2009-04-16 12:31 ` Patrick McHardy
2009-04-15 21:57 ` [PATCH] netfilter: use per-cpu rwlock rather than RCU (v4) Stephen Hemminger
2009-04-15 23:48 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3) David Miller
2009-04-16 0:01 ` Stephen Hemminger
2009-04-16 0:05 ` David Miller
2009-04-16 12:28 ` Patrick McHardy
2009-04-16 0:10 ` Linus Torvalds
2009-04-16 0:45 ` [PATCH] netfilter: use per-cpu spinlock and RCU (v5) Stephen Hemminger
2009-04-16 5:01 ` Eric Dumazet
2009-04-16 13:53 ` Patrick McHardy
2009-04-16 14:47 ` Paul E. McKenney
2009-04-16 16:10 ` [PATCH] netfilter: use per-cpu recursive spinlock (v6) Eric Dumazet
2009-04-16 16:20 ` Eric Dumazet
2009-04-16 16:37 ` Linus Torvalds
2009-04-16 16:59 ` Patrick McHardy
2009-04-16 17:58 ` Paul E. McKenney
2009-04-16 18:41 ` Eric Dumazet
2009-04-16 20:49 ` [PATCH[] netfilter: use per-cpu reader-writer lock (v0.7) Stephen Hemminger
2009-04-16 21:02 ` Linus Torvalds
2009-04-16 23:04 ` Ingo Molnar
2009-04-17 0:13 ` [PATCH] netfilter: use per-cpu recursive spinlock (v6) Paul E. McKenney
2009-04-16 13:11 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3) Patrick McHardy
2009-04-16 22:33 ` David Miller
2009-04-16 23:49 ` Paul E. McKenney
2009-04-16 23:52 ` [PATCH] netfilter: per-cpu spin-lock with recursion (v0.8) Stephen Hemminger
2009-04-17 0:15 ` Jeff Chua
2009-04-17 5:55 ` Peter Zijlstra
2009-04-17 6:03 ` Eric Dumazet
2009-04-17 6:14 ` Eric Dumazet
2009-04-17 17:08 ` Peter Zijlstra
2009-04-17 11:17 ` Patrick McHardy
2009-04-17 1:28 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3) Paul E. McKenney
2009-04-17 2:19 ` Mathieu Desnoyers
2009-04-17 5:05 ` Paul E. McKenney
2009-04-17 5:44 ` Mathieu Desnoyers
2009-04-17 14:51 ` Paul E. McKenney
2009-04-17 4:50 ` Stephen Hemminger
2009-04-17 5:08 ` Paul E. McKenney
2009-04-17 5:16 ` Eric Dumazet
2009-04-17 5:40 ` Paul E. McKenney
2009-04-17 8:07 ` David Miller
2009-04-17 15:00 ` Paul E. McKenney
2009-04-17 17:22 ` Peter Zijlstra
2009-04-17 17:32 ` Linus Torvalds
2009-04-17 6:12 ` Peter Zijlstra
2009-04-17 16:33 ` Paul E. McKenney
2009-04-17 16:51 ` Peter Zijlstra
2009-04-17 21:29 ` Paul E. McKenney
2009-04-18 9:40 ` Evgeniy Polyakov
2009-04-18 14:14 ` Paul E. McKenney
2009-04-20 17:34 ` [PATCH] netfilter: use per-cpu recursive lock (v10) Stephen Hemminger
2009-04-20 18:21 ` Paul E. McKenney
2009-04-20 18:25 ` Eric Dumazet
2009-04-20 20:32 ` Stephen Hemminger
2009-04-20 20:42 ` Stephen Hemminger
2009-04-20 21:05 ` Paul E. McKenney
2009-04-20 21:23 ` Paul Mackerras
2009-04-20 21:58 ` Paul E. McKenney
2009-04-20 22:41 ` Paul Mackerras
2009-04-20 23:01 ` [PATCH] netfilter: use per-cpu recursive lock (v11) Stephen Hemminger
2009-04-21 3:41 ` Lai Jiangshan
2009-04-21 3:56 ` Eric Dumazet
2009-04-21 4:15 ` Stephen Hemminger
2009-04-21 5:22 ` Lai Jiangshan
2009-04-21 5:45 ` Stephen Hemminger
2009-04-21 6:52 ` Lai Jiangshan
2009-04-21 8:16 ` Evgeniy Polyakov
2009-04-21 8:42 ` Lai Jiangshan
2009-04-21 8:49 ` David Miller
2009-04-21 8:55 ` Eric Dumazet
2009-04-21 9:22 ` Evgeniy Polyakov
2009-04-21 9:34 ` Lai Jiangshan
2009-04-21 5:34 ` Lai Jiangshan
2009-04-21 4:59 ` Eric Dumazet
2009-04-21 16:37 ` Paul E. McKenney
2009-04-21 5:46 ` Lai Jiangshan
2009-04-21 16:13 ` Linus Torvalds
2009-04-21 16:43 ` Stephen Hemminger
2009-04-21 16:50 ` Linus Torvalds
2009-04-21 18:02 ` Ingo Molnar
2009-04-21 18:15 ` Stephen Hemminger
2009-04-21 19:10 ` Ingo Molnar
2009-04-21 19:46 ` Eric Dumazet
2009-04-22 7:35 ` Ingo Molnar
2009-04-22 8:53 ` Eric Dumazet
2009-04-22 10:13 ` Jarek Poplawski
2009-04-22 11:26 ` Ingo Molnar
2009-04-22 11:39 ` Jarek Poplawski
2009-04-22 11:18 ` Ingo Molnar
2009-04-22 15:19 ` Linus Torvalds
2009-04-22 16:57 ` Eric Dumazet
2009-04-22 17:18 ` Linus Torvalds
2009-04-22 20:46 ` Jarek Poplawski
2009-04-22 17:48 ` Ingo Molnar
2009-04-21 21:04 ` Stephen Hemminger
2009-04-22 8:00 ` Ingo Molnar
2009-04-21 19:39 ` Ingo Molnar
2009-04-21 21:39 ` [PATCH] netfilter: use per-cpu recursive lock (v13) Stephen Hemminger
2009-04-22 4:17 ` Paul E. McKenney
2009-04-22 14:57 ` Eric Dumazet
2009-04-22 15:32 ` Linus Torvalds
2009-04-24 4:09 ` [PATCH] netfilter: use per-CPU recursive lock {XIV} Stephen Hemminger
2009-04-24 4:58 ` Eric Dumazet
2009-04-24 15:33 ` Patrick McHardy
2009-04-24 16:18 ` Stephen Hemminger
2009-04-24 20:43 ` Jarek Poplawski
2009-04-25 20:30 ` [PATCH] netfilter: iptables no lockdep is needed Stephen Hemminger
2009-04-26 8:18 ` Jarek Poplawski
2009-04-26 18:24 ` [PATCH] netfilter: use per-CPU recursive lock {XV} Eric Dumazet
2009-04-26 18:56 ` Mathieu Desnoyers [this message]
2009-04-26 21:57 ` Stephen Hemminger
2009-04-26 22:32 ` Mathieu Desnoyers
2009-04-27 17:44 ` Peter Zijlstra
2009-04-27 18:30 ` [PATCH] netfilter: use per-CPU r**ursive " Stephen Hemminger
2009-04-27 18:54 ` Ingo Molnar
2009-04-27 19:06 ` Stephen Hemminger
2009-04-27 19:46 ` Linus Torvalds
2009-04-27 19:48 ` Linus Torvalds
2009-04-27 20:36 ` Evgeniy Polyakov
2009-04-27 20:58 ` Linus Torvalds
2009-04-27 21:40 ` Stephen Hemminger
2009-04-27 22:24 ` Linus Torvalds
2009-04-27 23:01 ` Linus Torvalds
2009-04-27 23:03 ` Linus Torvalds
2009-04-28 6:58 ` Eric Dumazet
2009-04-28 11:53 ` David Miller
2009-04-28 12:40 ` Ingo Molnar
2009-04-28 13:43 ` David Miller
2009-04-28 13:52 ` Mathieu Desnoyers
2009-04-28 14:37 ` David Miller
2009-04-28 14:49 ` Mathieu Desnoyers
2009-04-28 15:00 ` David Miller
2009-04-28 16:24 ` [PATCH] netfilter: revised locking for x_tables Stephen Hemminger
2009-04-28 16:50 ` Linus Torvalds
2009-04-28 16:55 ` Linus Torvalds
2009-04-29 5:37 ` David Miller
[not found] ` <20090428.223708.168741998.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
2009-04-30 3:26 ` Jeff Chua
[not found] ` <b6a2187b0904292026k7d6107a7vcdc761d4149f40aa-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-04-30 3:31 ` David Miller
2009-05-01 8:38 ` [PATCH] netfilter: use likely() in xt_info_rdlock_bh() Eric Dumazet
2009-05-01 16:10 ` David Miller
2009-04-28 15:42 ` [PATCH] netfilter: use per-CPU r**ursive lock {XV} Paul E. McKenney
2009-04-28 17:35 ` Christoph Lameter
2009-04-28 15:09 ` Linus Torvalds
2009-04-27 23:32 ` Linus Torvalds
2009-04-28 7:41 ` Peter Zijlstra
2009-04-28 14:22 ` Paul E. McKenney
2009-04-28 7:42 ` Jan Engelhardt
2009-04-26 19:31 ` [PATCH] netfilter: use per-CPU recursive " Mathieu Desnoyers
2009-04-26 20:55 ` Eric Dumazet
2009-04-26 21:39 ` Mathieu Desnoyers
2009-04-21 18:34 ` [PATCH] netfilter: use per-cpu recursive lock (v11) Paul E. McKenney
2009-04-21 20:14 ` Linus Torvalds
2009-04-20 23:44 ` [PATCH] netfilter: use per-cpu recursive lock (v10) Paul E. McKenney
2009-04-16 0:02 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3) Linus Torvalds
2009-04-16 6:26 ` Eric Dumazet
2009-04-16 14:33 ` Paul E. McKenney
2009-04-15 3:23 ` David Miller
2009-04-14 17:19 ` [PATCH] netfilter: use per-cpu spinlock rather than RCU Stephen Hemminger
2009-04-11 15:50 ` iptables very slow after commit 784544739a25c30637397ace5489eeb6e15d7d49 Stephen Hemminger
2009-04-11 17:43 ` Paul E. McKenney
2009-04-11 18:57 ` Linus Torvalds
2009-04-12 0:34 ` Paul E. McKenney
2009-04-12 7:23 ` Evgeniy Polyakov
2009-04-12 16:06 ` Stephen Hemminger
2009-04-12 17:30 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090426185646.GB29238@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=benh@kernel.crashing.org \
--cc=dada1@cosmosbay.com \
--cc=davem@davemloft.net \
--cc=jarkao2@gmail.com \
--cc=jeff.chua.linux@gmail.com \
--cc=jengelh@medozas.de \
--cc=kaber@trash.net \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@elte.hu \
--cc=netdev@vger.kernel.org \
--cc=netfilter-devel@vger.kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=paulus@samba.org \
--cc=r000n@r000n.net \
--cc=shemminger@vyatta.com \
--cc=torvalds@linux-foundation.org \
--cc=zbr@ioremap.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).