netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] ipset patches for nf-next, v2
@ 2014-11-30 18:56 Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
                   ` (13 more replies)
  0 siblings, 14 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Hi Pablo,

Please consider to apply the next series of patches for ipset:

- Fix sparse warning "cast to restricted __be32"
- Fix parallel resizing and listing of the same set: when adding elements
  and listing of the same set were executed parallel, listing could start
  to list the original set (before resizing) and continue with the new one.
- Styles warned by checkpatch.pl fixed
- Introduce RCU locking in the hash types. The patch was
  performance tested by Jesper Dangaard Brouer:

  Generator: sending 12.2Mpps (tx:12264083 pps)
  Drop performance in "raw" with ipset: 8Mpps
  Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
- Introduce RCU locking in the list type
- Introduce RCU locking in the bitmap types
- Introduce RCU locking instead of rwlock per set in the core
- Remove rbtree from hash:net,iface for RCU locking
- Explicitly add padding elements to hash:net,net and hash:net,port,
  because the elements must be u32 sized for the used hash function.
- Allocate the proper size of memory when /0 networks are supported
- Simplify cidr handling for hash:*net* types (cleaning it up for RCU)
- Indicate explicitly when /0 networks are supported
- Alignment problem between 64bit kernel 32bit userspace fixed by
  introducing a new set match revision, reported by Sven-Haegar Koch
- Support updating element extensions when the set is full (fixes
  netfilter bugzilla id 880)

You can pull the changes from

        git://blackhole.kfki.hu/nf-next master

The iptables part of the new set match functionality
can be found in the iptables git tree, in the ipset
branch.

Thanks,
Jozsef
============================================================================
The following changes since commit beacd3e8ef237e077c8707395440813feef16d3f:

  netfilter: nfnetlink_log: Make use of pr_fmt where applicable (2014-11-20 14:09:01 +0100)

are available in the git repository at:

  git://blackhole.kfki.hu/nf-next master

for you to fetch changes up to 482ebc4fe2be7de3b20cff357d76fde1738e4531:

  netfilter: ipset: Fix sparse warning (2014-11-30 19:49:49 +0100)

----------------------------------------------------------------
Jozsef Kadlecsik (14):
      netfilter: ipset: Support updating extensions when the set is full
      netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace
      netfilter: ipset: Indicate when /0 networks are supported
      netfilter: ipset: Simplify cidr handling for hash:*net* types
      netfilter: ipset: Allocate the proper size of memory when /0 networks are supported
      netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net
      netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
      netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core
      netfilter: ipset: Introduce RCU locking in the bitmap types
      netfilter: ipset: Introduce RCU locking in the list type
      netfilter: ipset: Introduce RCU locking in the hash types
      netfilter: ipset: styles warned by checkpatch.pl fixed
      netfilter: ipset: Fix parallel resizing and listing of the same set
      netfilter: ipset: Fix sparse warning

 include/linux/netfilter/ipset/ip_set.h         |  25 +-
 include/linux/netfilter/ipset/ip_set_timeout.h |  27 +-
 include/uapi/linux/netfilter/ipset/ip_set.h    |   8 +-
 include/uapi/linux/netfilter/xt_set.h          |  13 +-
 net/netfilter/ipset/ip_set_bitmap_gen.h        |   4 +-
 net/netfilter/ipset/ip_set_core.c              |  72 +--
 net/netfilter/ipset/ip_set_hash_gen.h          | 669 +++++++++++++++----------
 net/netfilter/ipset/ip_set_hash_ipmark.c       |   4 +-
 net/netfilter/ipset/ip_set_hash_ipportnet.c    |   2 +
 net/netfilter/ipset/ip_set_hash_net.c          |   4 +-
 net/netfilter/ipset/ip_set_hash_netiface.c     | 159 +-----
 net/netfilter/ipset/ip_set_hash_netnet.c       |   4 +
 net/netfilter/ipset/ip_set_hash_netport.c      |   2 +
 net/netfilter/ipset/ip_set_hash_netportnet.c   |   4 +
 net/netfilter/ipset/ip_set_list_set.c          | 386 +++++++-------
 net/netfilter/xt_set.c                         |  80 ++-
 net/sched/em_ipset.c                           |   5 +-
 17 files changed, 791 insertions(+), 677 deletions(-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-12-02 18:46   ` Pablo Neira Ayuso
  2014-11-30 18:56 ` [PATCH 02/14] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

When the set was full (hash type and maxelem reached), it was not
possible to update the extension part of already existing elements.
The patch removes this limitation. (Fixes netfilter bugzilla id 880.)

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 40 +++++++++++++++--------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index fee7c64e..a12ee04 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -633,29 +633,6 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
 	u32 key, multi = 0;
 
-	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set)) {
-		rcu_read_lock_bh();
-		t = rcu_dereference_bh(h->table);
-		key = HKEY(value, h->initval, t->htable_bits);
-		n = hbucket(t,key);
-		if (n->pos) {
-			/* Choosing the first entry in the array to replace */
-			j = 0;
-			goto reuse_slot;
-		}
-		rcu_read_unlock_bh();
-	}
-	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
-		/* FIXME: when set is full, we slow down here */
-		mtype_expire(set, h, NLEN(set->family), set->dsize);
-
-	if (h->elements >= h->maxelem) {
-		if (net_ratelimit())
-			pr_warn("Set %s is full, maxelem %u reached\n",
-				set->name, h->maxelem);
-		return -IPSET_ERR_HASH_FULL;
-	}
-
 	rcu_read_lock_bh();
 	t = rcu_dereference_bh(h->table);
 	key = HKEY(value, h->initval, t->htable_bits);
@@ -680,6 +657,23 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		    j != AHASH_MAX(h) + 1)
 			j = i;
 	}
+	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set) && n->pos) {
+		/* Choosing the first entry in the array to replace */
+		j = 0;
+		goto reuse_slot;
+	}
+	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
+		/* FIXME: when set is full, we slow down here */
+		mtype_expire(set, h, NLEN(set->family), set->dsize);
+
+	if (h->elements >= h->maxelem) {
+		if (net_ratelimit())
+			pr_warn("Set %s is full, maxelem %u reached\n",
+				set->name, h->maxelem);
+		ret = -IPSET_ERR_HASH_FULL;
+		goto out;
+	}
+
 reuse_slot:
 	if (j != AHASH_MAX(h) + 1) {
 		/* Fill out reused slot */
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/14] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 03/14] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Sven-Haegar Koch reported the issue:

sims:~# iptables -A OUTPUT -m set --match-set testset src -j ACCEPT
iptables: Invalid argument. Run `dmesg' for more information.

In syslog:
x_tables: ip_tables: set.3 match: invalid size 48 (kernel) != (user) 32

which was introduced by the counter extension in ipset.

The patch fixes the alignment issue with introducing a new set match
revision with the fixed underlying 'struct ip_set_counter_match'
structure.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/uapi/linux/netfilter/ipset/ip_set.h |  8 +++-
 include/uapi/linux/netfilter/xt_set.h       | 13 ++++-
 net/netfilter/xt_set.c                      | 73 +++++++++++++++++++++++++++--
 3 files changed, 88 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/netfilter/ipset/ip_set.h b/include/uapi/linux/netfilter/ipset/ip_set.h
index ca03119..5ab4e60 100644
--- a/include/uapi/linux/netfilter/ipset/ip_set.h
+++ b/include/uapi/linux/netfilter/ipset/ip_set.h
@@ -256,11 +256,17 @@ enum {
 	IPSET_COUNTER_GT,
 };
 
-struct ip_set_counter_match {
+/* Backward compatibility for set match v3 */
+struct ip_set_counter_match0 {
 	__u8 op;
 	__u64 value;
 };
 
+struct ip_set_counter_match {
+	__aligned_u64 value;
+	__u8 op;
+};
+
 /* Interface to iptables/ip6tables */
 
 #define SO_IP_SET		83
diff --git a/include/uapi/linux/netfilter/xt_set.h b/include/uapi/linux/netfilter/xt_set.h
index d6a1df1..d4e0234 100644
--- a/include/uapi/linux/netfilter/xt_set.h
+++ b/include/uapi/linux/netfilter/xt_set.h
@@ -66,8 +66,8 @@ struct xt_set_info_target_v2 {
 
 struct xt_set_info_match_v3 {
 	struct xt_set_info match_set;
-	struct ip_set_counter_match packets;
-	struct ip_set_counter_match bytes;
+	struct ip_set_counter_match0 packets;
+	struct ip_set_counter_match0 bytes;
 	__u32 flags;
 };
 
@@ -81,4 +81,13 @@ struct xt_set_info_target_v3 {
 	__u32 timeout;
 };
 
+/* Revision 4 match */
+
+struct xt_set_info_match_v4 {
+	struct xt_set_info match_set;
+	struct ip_set_counter_match packets;
+	struct ip_set_counter_match bytes;
+	__u32 flags;
+};
+
 #endif /*_XT_SET_H*/
diff --git a/net/netfilter/xt_set.c b/net/netfilter/xt_set.c
index 5732cd6..0d47afe 100644
--- a/net/netfilter/xt_set.c
+++ b/net/netfilter/xt_set.c
@@ -157,7 +157,7 @@ set_match_v1_destroy(const struct xt_mtdtor_param *par)
 /* Revision 3 match */
 
 static bool
-match_counter(u64 counter, const struct ip_set_counter_match *info)
+match_counter0(u64 counter, const struct ip_set_counter_match0 *info)
 {
 	switch (info->op) {
 	case IPSET_COUNTER_NONE:
@@ -192,14 +192,60 @@ set_match_v3(const struct sk_buff *skb, struct xt_action_param *par)
 	if (!(ret && opt.cmdflags & IPSET_FLAG_MATCH_COUNTERS))
 		return ret;
 
-	if (!match_counter(opt.ext.packets, &info->packets))
+	if (!match_counter0(opt.ext.packets, &info->packets))
 		return 0;
-	return match_counter(opt.ext.bytes, &info->bytes);
+	return match_counter0(opt.ext.bytes, &info->bytes);
 }
 
 #define set_match_v3_checkentry	set_match_v1_checkentry
 #define set_match_v3_destroy	set_match_v1_destroy
 
+/* Revision 4 match */
+
+static bool
+match_counter(u64 counter, const struct ip_set_counter_match *info)
+{
+	switch (info->op) {
+	case IPSET_COUNTER_NONE:
+		return true;
+	case IPSET_COUNTER_EQ:
+		return counter == info->value;
+	case IPSET_COUNTER_NE:
+		return counter != info->value;
+	case IPSET_COUNTER_LT:
+		return counter < info->value;
+	case IPSET_COUNTER_GT:
+		return counter > info->value;
+	}
+	return false;
+}
+
+static bool
+set_match_v4(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_set_info_match_v4 *info = par->matchinfo;
+	ADT_OPT(opt, par->family, info->match_set.dim,
+		info->match_set.flags, info->flags, UINT_MAX);
+	int ret;
+
+	if (info->packets.op != IPSET_COUNTER_NONE ||
+	    info->bytes.op != IPSET_COUNTER_NONE)
+		opt.cmdflags |= IPSET_FLAG_MATCH_COUNTERS;
+
+	ret = match_set(info->match_set.index, skb, par, &opt,
+			info->match_set.flags & IPSET_INV_MATCH);
+
+	if (!(ret && opt.cmdflags & IPSET_FLAG_MATCH_COUNTERS))
+		return ret;
+
+	if (!match_counter(opt.ext.packets, &info->packets))
+		return 0;
+	return match_counter(opt.ext.bytes, &info->bytes);
+}
+
+#define set_match_v4_checkentry	set_match_v1_checkentry
+#define set_match_v4_destroy	set_match_v1_destroy
+
 /* Revision 0 interface: backward compatible with netfilter/iptables */
 
 static unsigned int
@@ -573,6 +619,27 @@ static struct xt_match set_matches[] __read_mostly = {
 		.destroy	= set_match_v3_destroy,
 		.me		= THIS_MODULE
 	},
+	/* new revision for counters support: update, match */
+	{
+		.name		= "set",
+		.family		= NFPROTO_IPV4,
+		.revision	= 4,
+		.match		= set_match_v4,
+		.matchsize	= sizeof(struct xt_set_info_match_v4),
+		.checkentry	= set_match_v4_checkentry,
+		.destroy	= set_match_v4_destroy,
+		.me		= THIS_MODULE
+	},
+	{
+		.name		= "set",
+		.family		= NFPROTO_IPV6,
+		.revision	= 4,
+		.match		= set_match_v4,
+		.matchsize	= sizeof(struct xt_set_info_match_v4),
+		.checkentry	= set_match_v4_checkentry,
+		.destroy	= set_match_v4_destroy,
+		.me		= THIS_MODULE
+	},
 };
 
 static struct xt_target set_targets[] __read_mostly = {
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/14] netfilter: ipset: Indicate when /0 networks are supported
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 02/14] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 04/14] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h      | 2 +-
 net/netfilter/ipset/ip_set_hash_netiface.c | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index a12ee04..9428fa5 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -156,7 +156,7 @@ hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
 
 #define SET_HOST_MASK(family)	(family == AF_INET ? 32 : 128)
 
-#ifdef IP_SET_HASH_WITH_MULTI
+#ifdef IP_SET_HASH_WITH_NET0
 #define NLEN(family)		(SET_HOST_MASK(family) + 1)
 #else
 #define NLEN(family)		SET_HOST_MASK(family)
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index 35dd358..758b002 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -115,6 +115,7 @@ iface_add(struct rb_root *root, const char **iface)
 #define IP_SET_HASH_WITH_NETS
 #define IP_SET_HASH_WITH_RBTREE
 #define IP_SET_HASH_WITH_MULTI
+#define IP_SET_HASH_WITH_NET0
 
 #define STREQ(a, b)	(strcmp(a, b) == 0)
 
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/14] netfilter: ipset: Simplify cidr handling for hash:*net* types
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (2 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 03/14] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 05/14] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 56 +++++++++++++++++------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 9428fa5..8ef9135 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -147,11 +147,17 @@ hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
 #else
 #define __CIDR(cidr, i)		(cidr)
 #endif
+
+/* cidr + 1 is stored in net_prefixes to support /0 */
+#define SCIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+
 #ifdef IP_SET_HASH_WITH_NETS_PACKED
-/* When cidr is packed with nomatch, cidr - 1 is stored in the entry */
-#define CIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+/* When cidr is packed with nomatch, cidr - 1 is stored in the data entry */
+#define GCIDR(cidr, i)		(__CIDR(cidr, i) + 1)
+#define NCIDR(cidr)		(cidr)
 #else
-#define CIDR(cidr, i)		(__CIDR(cidr, i))
+#define GCIDR(cidr, i)		(__CIDR(cidr, i))
+#define NCIDR(cidr)		(cidr - 1)
 #endif
 
 #define SET_HOST_MASK(family)	(family == AF_INET ? 32 : 128)
@@ -292,24 +298,22 @@ mtype_add_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	int i, j;
 
 	/* Add in increasing prefix order, so larger cidr first */
-	for (i = 0, j = -1; i < nets_length && h->nets[i].nets[n]; i++) {
+	for (i = 0, j = -1; i < nets_length && h->nets[i].cidr[n]; i++) {
 		if (j != -1)
 			continue;
 		else if (h->nets[i].cidr[n] < cidr)
 			j = i;
 		else if (h->nets[i].cidr[n] == cidr) {
-			h->nets[i].nets[n]++;
+			h->nets[cidr - 1].nets[n]++;
 			return;
 		}
 	}
 	if (j != -1) {
-		for (; i > j; i--) {
+		for (; i > j; i--)
 			h->nets[i].cidr[n] = h->nets[i - 1].cidr[n];
-			h->nets[i].nets[n] = h->nets[i - 1].nets[n];
-		}
 	}
 	h->nets[i].cidr[n] = cidr;
-	h->nets[i].nets[n] = 1;
+	h->nets[cidr - 1].nets[n] = 1;
 }
 
 static void
@@ -320,16 +324,12 @@ mtype_del_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	for (i = 0; i < nets_length; i++) {
 	        if (h->nets[i].cidr[n] != cidr)
 	                continue;
-                if (h->nets[i].nets[n] > 1 || i == net_end ||
-                    h->nets[i + 1].nets[n] == 0) {
-                        h->nets[i].nets[n]--;
+		h->nets[cidr -1].nets[n]--;
+		if (h->nets[cidr -1].nets[n] > 0)
                         return;
-                }
-                for (j = i; j < net_end && h->nets[j].nets[n]; j++) {
+		for (j = i; j < net_end && h->nets[j].cidr[n]; j++)
 		        h->nets[j].cidr[n] = h->nets[j + 1].cidr[n];
-		        h->nets[j].nets[n] = h->nets[j + 1].nets[n];
-                }
-                h->nets[j].nets[n] = 0;
+		h->nets[j].cidr[n] = 0;
                 return;
 	}
 }
@@ -486,7 +486,7 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 				pr_debug("expired %u/%u\n", i, j);
 #ifdef IP_SET_HASH_WITH_NETS
 				for (k = 0; k < IPSET_NET_COUNT; k++)
-					mtype_del_cidr(h, CIDR(data->cidr, k),
+					mtype_del_cidr(h, SCIDR(data->cidr, k),
 						       nets_length, k);
 #endif
 				ip_set_ext_destroy(set, data);
@@ -680,9 +680,9 @@ reuse_slot:
 		data = ahash_data(n, j, set->dsize);
 #ifdef IP_SET_HASH_WITH_NETS
 		for (i = 0; i < IPSET_NET_COUNT; i++) {
-			mtype_del_cidr(h, CIDR(data->cidr, i),
+			mtype_del_cidr(h, SCIDR(data->cidr, i),
 				       NLEN(set->family), i);
-			mtype_add_cidr(h, CIDR(d->cidr, i),
+			mtype_add_cidr(h, SCIDR(d->cidr, i),
 				       NLEN(set->family), i);
 		}
 #endif
@@ -699,7 +699,7 @@ reuse_slot:
 		data = ahash_data(n, n->pos++, set->dsize);
 #ifdef IP_SET_HASH_WITH_NETS
 		for (i = 0; i < IPSET_NET_COUNT; i++)
-			mtype_add_cidr(h, CIDR(d->cidr, i), NLEN(set->family),
+			mtype_add_cidr(h, SCIDR(d->cidr, i), NLEN(set->family),
 				       i);
 #endif
 		h->elements++;
@@ -760,7 +760,7 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		h->elements--;
 #ifdef IP_SET_HASH_WITH_NETS
 		for (j = 0; j < IPSET_NET_COUNT; j++)
-			mtype_del_cidr(h, CIDR(d->cidr, j), NLEN(set->family),
+			mtype_del_cidr(h, SCIDR(d->cidr, j), NLEN(set->family),
 				       j);
 #endif
 		ip_set_ext_destroy(set, data);
@@ -821,15 +821,15 @@ mtype_test_cidrs(struct ip_set *set, struct mtype_elem *d,
 	u8 nets_length = NLEN(set->family);
 
 	pr_debug("test by nets\n");
-	for (; j < nets_length && h->nets[j].nets[0] && !multi; j++) {
+	for (; j < nets_length && h->nets[j].cidr[0] && !multi; j++) {
 #if IPSET_NET_COUNT == 2
 		mtype_data_reset_elem(d, &orig);
-		mtype_data_netmask(d, h->nets[j].cidr[0], false);
-		for (k = 0; k < nets_length && h->nets[k].nets[1] && !multi;
+		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]), false);
+		for (k = 0; k < nets_length && h->nets[k].cidr[1] && !multi;
 		     k++) {
-			mtype_data_netmask(d, h->nets[k].cidr[1], true);
+			mtype_data_netmask(d, NCIDR(h->nets[k].cidr[1]), true);
 #else
-		mtype_data_netmask(d, h->nets[j].cidr[0]);
+		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]));
 #endif
 		key = HKEY(d, h->initval, t->htable_bits);
 		n = hbucket(t, key);
@@ -877,7 +877,7 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	/* If we test an IP address and not a network address,
 	 * try all possible network sizes */
 	for (i = 0; i < IPSET_NET_COUNT; i++)
-		if (CIDR(d->cidr, i) != SET_HOST_MASK(set->family))
+		if (GCIDR(d->cidr, i) != SET_HOST_MASK(set->family))
 			break;
 	if (i == IPSET_NET_COUNT) {
 		ret = mtype_test_cidrs(set, d, ext, mext, flags);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/14] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (3 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 04/14] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 06/14] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 8ef9135..974ff38 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -1101,8 +1101,7 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 
 	hsize = sizeof(*h);
 #ifdef IP_SET_HASH_WITH_NETS
-	hsize += sizeof(struct net_prefixes) *
-		(set->family == NFPROTO_IPV4 ? 32 : 128);
+	hsize += sizeof(struct net_prefixes) * NLEN(set->family);
 #endif
 	h = kzalloc(hsize, GFP_KERNEL);
 	if (!h)
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/14] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (4 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 05/14] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-11-30 18:56 ` [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

The elements must be u32 sized for the used hash function.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_netnet.c     | 2 ++
 net/netfilter/ipset/ip_set_hash_netportnet.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index da00284..ea8772a 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -46,6 +46,7 @@ struct hash_netnet4_elem {
 		__be64 ipcmp;
 	};
 	u8 nomatch;
+	u8 padding;
 	union {
 		u8 cidr[2];
 		u16 ccmp;
@@ -271,6 +272,7 @@ hash_netnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 struct hash_netnet6_elem {
 	union nf_inet_addr ip[2];
 	u8 nomatch;
+	u8 padding;
 	union {
 		u8 cidr[2];
 		u16 ccmp;
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index b8053d6..bfaa94c 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -53,6 +53,7 @@ struct hash_netportnet4_elem {
 		u8 cidr[2];
 		u16 ccmp;
 	};
+	u16 padding;
 	u8 nomatch:1;
 	u8 proto;
 };
@@ -324,6 +325,7 @@ struct hash_netportnet6_elem {
 		u8 cidr[2];
 		u16 ccmp;
 	};
+	u16 padding;
 	u8 nomatch:1;
 	u8 proto;
 };
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (5 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 06/14] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-12-02 18:23   ` Pablo Neira Ayuso
  2014-11-30 18:56 ` [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core Jozsef Kadlecsik
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_netiface.c | 156 ++++-------------------------
 1 file changed, 17 insertions(+), 139 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index 758b002..d8a53ec 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -13,7 +13,6 @@
 #include <linux/skbuff.h>
 #include <linux/errno.h>
 #include <linux/random.h>
-#include <linux/rbtree.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
 #include <net/netlink.h>
@@ -36,88 +35,14 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
 IP_SET_MODULE_DESC("hash:net,iface", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX);
 MODULE_ALIAS("ip_set_hash:net,iface");
 
-/* Interface name rbtree */
-
-struct iface_node {
-	struct rb_node node;
-	char iface[IFNAMSIZ];
-};
-
-#define iface_data(n)	(rb_entry(n, struct iface_node, node)->iface)
-
-static void
-rbtree_destroy(struct rb_root *root)
-{
-	struct iface_node *node, *next;
-
-	rbtree_postorder_for_each_entry_safe(node, next, root, node)
-		kfree(node);
-
-	*root = RB_ROOT;
-}
-
-static int
-iface_test(struct rb_root *root, const char **iface)
-{
-	struct rb_node *n = root->rb_node;
-
-	while (n) {
-		const char *d = iface_data(n);
-		int res = strcmp(*iface, d);
-
-		if (res < 0)
-			n = n->rb_left;
-		else if (res > 0)
-			n = n->rb_right;
-		else {
-			*iface = d;
-			return 1;
-		}
-	}
-	return 0;
-}
-
-static int
-iface_add(struct rb_root *root, const char **iface)
-{
-	struct rb_node **n = &(root->rb_node), *p = NULL;
-	struct iface_node *d;
-
-	while (*n) {
-		char *ifname = iface_data(*n);
-		int res = strcmp(*iface, ifname);
-
-		p = *n;
-		if (res < 0)
-			n = &((*n)->rb_left);
-		else if (res > 0)
-			n = &((*n)->rb_right);
-		else {
-			*iface = ifname;
-			return 0;
-		}
-	}
-
-	d = kzalloc(sizeof(*d), GFP_ATOMIC);
-	if (!d)
-		return -ENOMEM;
-	strcpy(d->iface, *iface);
-
-	rb_link_node(&d->node, p, n);
-	rb_insert_color(&d->node, root);
-
-	*iface = d->iface;
-	return 0;
-}
-
 /* Type specific function prefix */
 #define HTYPE		hash_netiface
 #define IP_SET_HASH_WITH_NETS
-#define IP_SET_HASH_WITH_RBTREE
 #define IP_SET_HASH_WITH_MULTI
 #define IP_SET_HASH_WITH_NET0
 
 #define STREQ(a, b)	(strcmp(a, b) == 0)
+#define IFNAMCPY(a, b)	strncpy(a, b, IFNAMSIZ)
 
 /* IPv4 variant */
 
@@ -136,7 +61,7 @@ struct hash_netiface4_elem {
 	u8 cidr;
 	u8 nomatch;
 	u8 elem;
-	const char *iface;
+	char iface[IFNAMSIZ];
 };
 
 /* Common functions */
@@ -150,7 +75,7 @@ hash_netiface4_data_equal(const struct hash_netiface4_elem *ip1,
 	       ip1->cidr == ip2->cidr &&
 	       (++*multi) &&
 	       ip1->physdev == ip2->physdev &&
-	       ip1->iface == ip2->iface;
+	       STREQ(ip1->iface, ip2->iface);
 }
 
 static inline int
@@ -223,7 +148,6 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 		.elem = 1,
 	};
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
-	int ret;
 
 	if (e.cidr == 0)
 		return -EINVAL;
@@ -233,8 +157,8 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 	ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
 	e.ip &= ip_set_netmask(e.cidr);
 
-#define IFACE(dir)	(par->dir ? par->dir->name : NULL)
-#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : NULL)
+#define IFACE(dir)	(par->dir ? par->dir->name : "")
+#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : "")
 #define SRCDIR		(opt->flags & IPSET_DIM_TWO_SRC)
 
 	if (opt->cmdflags & IPSET_FLAG_PHYSDEV) {
@@ -243,26 +167,15 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
 
 		if (!nf_bridge)
 			return -EINVAL;
-		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
+		IFNAMCPY(e.iface,
+			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
 		e.physdev = 1;
-#else
-		e.iface = NULL;
 #endif
 	} else
-		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
+		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
 
-	if (!e.iface)
+	if (strlen(e.iface) == 0)
 		return -EINVAL;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
-
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
 }
 
@@ -275,7 +188,6 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 	struct hash_netiface4_elem e = { .cidr = HOST_MASK, .elem = 1 };
 	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
 	u32 ip = 0, ip_to = 0, last;
-	char iface[IFNAMSIZ];
 	int ret;
 
 	if (unlikely(!tb[IPSET_ATTR_IP] ||
@@ -302,18 +214,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 		if (e.cidr > HOST_MASK)
 			return -IPSET_ERR_INVALID_CIDR;
 	}
-
-	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
-	e.iface = iface;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
+	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
@@ -372,7 +273,7 @@ struct hash_netiface6_elem {
 	u8 cidr;
 	u8 nomatch;
 	u8 elem;
-	const char *iface;
+	char iface[IFNAMSIZ];
 };
 
 /* Common functions */
@@ -386,7 +287,7 @@ hash_netiface6_data_equal(const struct hash_netiface6_elem *ip1,
 	       ip1->cidr == ip2->cidr &&
 	       (++*multi) &&
 	       ip1->physdev == ip2->physdev &&
-	       ip1->iface == ip2->iface;
+	       STREQ(ip1->iface, ip2->iface);
 }
 
 static inline int
@@ -464,7 +365,6 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
 		.elem = 1,
 	};
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
-	int ret;
 
 	if (e.cidr == 0)
 		return -EINVAL;
@@ -480,25 +380,15 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
 
 		if (!nf_bridge)
 			return -EINVAL;
-		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
+		IFNAMCPY(e.iface,
+			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
 		e.physdev = 1;
-#else
-		e.iface = NULL;
 #endif
 	} else
-		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
+		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
 
-	if (!e.iface)
+	if (strlen(e.iface) == 0)
 		return -EINVAL;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
 
 	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
 }
@@ -507,11 +397,9 @@ static int
 hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 		   enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
 {
-	struct hash_netiface *h = set->data;
 	ipset_adtfn adtfn = set->variant->adt[adt];
 	struct hash_netiface6_elem e = { .cidr = HOST_MASK, .elem = 1 };
 	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
-	char iface[IFNAMSIZ];
 	int ret;
 
 	if (unlikely(!tb[IPSET_ATTR_IP] ||
@@ -541,17 +429,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 		return -IPSET_ERR_INVALID_CIDR;
 	ip6_netmask(&e.ip, e.cidr);
 
-	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
-	e.iface = iface;
-	ret = iface_test(&h->rbtree, &e.iface);
-	if (adt == IPSET_ADD) {
-		if (!ret) {
-			ret = iface_add(&h->rbtree, &e.iface);
-			if (ret)
-				return ret;
-		}
-	} else if (!ret)
-		return ret;
+	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (6 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
@ 2014-11-30 18:56 ` Jozsef Kadlecsik
  2014-12-02 18:25   ` Pablo Neira Ayuso
  2014-11-30 18:57 ` [PATCH 09/14] netfilter: ipset: Introduce RCU locking in the bitmap types Jozsef Kadlecsik
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:56 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h         |  6 ++++-
 include/linux/netfilter/ipset/ip_set_timeout.h | 27 ++++++++------------
 net/netfilter/ipset/ip_set_core.c              | 35 +++++++++++++-------------
 3 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index f1606fa..d5d5bcd 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -223,7 +223,7 @@ struct ip_set {
 	/* The name of the set */
 	char name[IPSET_MAXNAMELEN];
 	/* Lock protecting the set data */
-	rwlock_t lock;
+	spinlock_t lock;
 	/* References to the set */
 	u32 ref;
 	/* The core set type */
@@ -322,6 +322,10 @@ ip_set_update_counter(struct ip_set_counter *counter,
 	}
 }
 
+#define ip_set_rcu_deref(t)		\
+	rcu_dereference_index_check(t,	\
+		rcu_read_lock_held() || rcu_read_lock_bh_held())
+
 static inline void
 ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
 		      const struct ip_set_ext *ext,
diff --git a/include/linux/netfilter/ipset/ip_set_timeout.h b/include/linux/netfilter/ipset/ip_set_timeout.h
index 83c2f9e..1d6a935 100644
--- a/include/linux/netfilter/ipset/ip_set_timeout.h
+++ b/include/linux/netfilter/ipset/ip_set_timeout.h
@@ -40,38 +40,33 @@ ip_set_timeout_uget(struct nlattr *tb)
 }
 
 static inline bool
-ip_set_timeout_test(unsigned long timeout)
+ip_set_timeout_expired(unsigned long *t)
 {
-	return timeout == IPSET_ELEM_PERMANENT ||
-	       time_is_after_jiffies(timeout);
-}
-
-static inline bool
-ip_set_timeout_expired(unsigned long *timeout)
-{
-	return *timeout != IPSET_ELEM_PERMANENT &&
-	       time_is_before_jiffies(*timeout);
+	return *t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(*t);
 }
 
 static inline void
-ip_set_timeout_set(unsigned long *timeout, u32 t)
+ip_set_timeout_set(unsigned long *timeout, u32 value)
 {
-	if (!t) {
+	unsigned long t;
+
+	if (!value) {
 		*timeout = IPSET_ELEM_PERMANENT;
 		return;
 	}
 
-	*timeout = msecs_to_jiffies(t * 1000) + jiffies;
-	if (*timeout == IPSET_ELEM_PERMANENT)
+	t = msecs_to_jiffies(value * MSEC_PER_SEC) + jiffies;
+	if (t == IPSET_ELEM_PERMANENT)
 		/* Bingo! :-) */
-		(*timeout)--;
+		t--;
+	*timeout = t;
 }
 
 static inline u32
 ip_set_timeout_get(unsigned long *timeout)
 {
 	return *timeout == IPSET_ELEM_PERMANENT ? 0 :
-		jiffies_to_msecs(*timeout - jiffies)/1000;
+		jiffies_to_msecs(*timeout - jiffies)/MSEC_PER_SEC;
 }
 
 #endif	/* __KERNEL__ */
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 912e5a0..9fb2610 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -217,6 +217,7 @@ ip_set_type_register(struct ip_set_type *type)
 		 type->revision_min, type->revision_max);
 unlock:
 	ip_set_type_unlock();
+	synchronize_rcu();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(ip_set_type_register);
@@ -496,16 +497,16 @@ ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return 0;
 
-	read_lock_bh(&set->lock);
+	rcu_read_lock_bh();
 	ret = set->variant->kadt(set, skb, par, IPSET_TEST, opt);
-	read_unlock_bh(&set->lock);
+	rcu_read_unlock_bh();
 
 	if (ret == -EAGAIN) {
 		/* Type requests element to be completed */
 		pr_debug("element must be completed, ADD is triggered\n");
-		write_lock_bh(&set->lock);
+		spin_lock_bh(&set->lock);
 		set->variant->kadt(set, skb, par, IPSET_ADD, opt);
-		write_unlock_bh(&set->lock);
+		spin_unlock_bh(&set->lock);
 		ret = 1;
 	} else {
 		/* --return-nomatch: invert matched element */
@@ -535,9 +536,9 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return -IPSET_ERR_TYPE_MISMATCH;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_ADD, opt);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	return ret;
 }
@@ -558,9 +559,9 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb,
 	    !(opt->family == set->family || set->family == NFPROTO_UNSPEC))
 		return -IPSET_ERR_TYPE_MISMATCH;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	ret = set->variant->kadt(set, skb, par, IPSET_DEL, opt);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	return ret;
 }
@@ -845,7 +846,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 	set = kzalloc(sizeof(struct ip_set), GFP_KERNEL);
 	if (!set)
 		return -ENOMEM;
-	rwlock_init(&set->lock);
+	spin_lock_init(&set->lock);
 	strlcpy(set->name, name, IPSET_MAXNAMELEN);
 	set->family = family;
 	set->revision = revision;
@@ -1024,9 +1025,9 @@ ip_set_flush_set(struct ip_set *set)
 {
 	pr_debug("set: %s\n",  set->name);
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	set->variant->flush(set);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 }
 
 static int
@@ -1327,9 +1328,9 @@ dump_last:
 				goto next_set;
 			/* Fall through and add elements */
 		default:
-			read_lock_bh(&set->lock);
+			rcu_read_lock_bh();
 			ret = set->variant->list(set, skb, cb);
-			read_unlock_bh(&set->lock);
+			rcu_read_unlock_bh();
 			if (!cb->args[IPSET_CB_ARG0])
 				/* Set is done, proceed with next one */
 				goto next_set;
@@ -1407,9 +1408,9 @@ call_ad(struct sock *ctnl, struct sk_buff *skb, struct ip_set *set,
 	bool eexist = flags & IPSET_FLAG_EXIST, retried = false;
 
 	do {
-		write_lock_bh(&set->lock);
+		spin_lock_bh(&set->lock);
 		ret = set->variant->uadt(set, tb, adt, &lineno, flags, retried);
-		write_unlock_bh(&set->lock);
+		spin_unlock_bh(&set->lock);
 		retried = true;
 	} while (ret == -EAGAIN &&
 		 set->variant->resize &&
@@ -1589,9 +1590,9 @@ ip_set_utest(struct sock *ctnl, struct sk_buff *skb,
 			     set->type->adt_policy))
 		return -IPSET_ERR_PROTOCOL;
 
-	read_lock_bh(&set->lock);
+	rcu_read_lock_bh();
 	ret = set->variant->uadt(set, tb, IPSET_TEST, NULL, 0, 0);
-	read_unlock_bh(&set->lock);
+	rcu_read_unlock_bh();
 	/* Userspace can't trigger element to be re-added */
 	if (ret == -EAGAIN)
 		ret = 1;
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/14] netfilter: ipset: Introduce RCU locking in the bitmap types
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (7 preceding siblings ...)
  2014-11-30 18:56 ` [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  2014-11-30 18:57 ` [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type Jozsef Kadlecsik
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_bitmap_gen.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_bitmap_gen.h b/net/netfilter/ipset/ip_set_bitmap_gen.h
index 6f024a8..136f20b 100644
--- a/net/netfilter/ipset/ip_set_bitmap_gen.h
+++ b/net/netfilter/ipset/ip_set_bitmap_gen.h
@@ -260,7 +260,7 @@ mtype_gc(unsigned long ul_set)
 
 	/* We run parallel with other readers (test element)
 	 * but adding/deleting new entries is locked out */
-	read_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	for (id = 0; id < map->elements; id++)
 		if (mtype_gc_test(id, map, set->dsize)) {
 			x = get_ext(set, map, id);
@@ -269,7 +269,7 @@ mtype_gc(unsigned long ul_set)
 				ip_set_ext_destroy(set, x);
 			}
 		}
-	read_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&map->gc);
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (8 preceding siblings ...)
  2014-11-30 18:57 ` [PATCH 09/14] netfilter: ipset: Introduce RCU locking in the bitmap types Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  2014-12-02 18:35   ` Pablo Neira Ayuso
  2014-11-30 18:57 ` [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types Jozsef Kadlecsik
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_list_set.c | 386 ++++++++++++++++------------------
 1 file changed, 182 insertions(+), 204 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
index f8f6828..323115a 100644
--- a/net/netfilter/ipset/ip_set_list_set.c
+++ b/net/netfilter/ipset/ip_set_list_set.c
@@ -9,6 +9,7 @@
 
 #include <linux/module.h>
 #include <linux/ip.h>
+#include <linux/rculist.h>
 #include <linux/skbuff.h>
 #include <linux/errno.h>
 
@@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
 
 /* Member elements  */
 struct set_elem {
+	struct rcu_head rcu;
+	struct list_head list;
 	ip_set_id_t id;
 };
 
@@ -41,12 +44,9 @@ struct list_set {
 	u32 size;		/* size of set list array */
 	struct timer_list gc;	/* garbage collection */
 	struct net *net;	/* namespace */
-	struct set_elem members[0]; /* the set members */
+	struct list_head members; /* the set members */
 };
 
-#define list_set_elem(set, map, id)	\
-	(struct set_elem *)((void *)(map)->members + (id) * (set)->dsize)
-
 static int
 list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
 	       const struct xt_action_param *par,
@@ -54,17 +54,14 @@ list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i, cmdflags = opt->cmdflags;
+	u32 cmdflags = opt->cmdflags;
 	int ret;
 
 	/* Don't lookup sub-counters at all */
 	opt->cmdflags &= ~IPSET_FLAG_MATCH_COUNTERS;
 	if (opt->cmdflags & IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE)
 		opt->cmdflags &= ~IPSET_FLAG_SKIP_COUNTER_UPDATE;
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -91,13 +88,9 @@ list_set_kadd(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -115,13 +108,9 @@ list_set_kdel(struct ip_set *set, const struct sk_buff *skb,
 {
 	struct list_set *map = set->data;
 	struct set_elem *e;
-	u32 i;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
+	list_for_each_entry_rcu(e, &map->members, list) {
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -138,110 +127,65 @@ list_set_kadt(struct ip_set *set, const struct sk_buff *skb,
 	      enum ipset_adt adt, struct ip_set_adt_opt *opt)
 {
 	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
+	int ret = -EINVAL;
 
+	rcu_read_lock();
 	switch (adt) {
 	case IPSET_TEST:
-		return list_set_ktest(set, skb, par, opt, &ext);
+		ret = list_set_ktest(set, skb, par, opt, &ext);
+		break;
 	case IPSET_ADD:
-		return list_set_kadd(set, skb, par, opt, &ext);
+		ret = list_set_kadd(set, skb, par, opt, &ext);
+		break;
 	case IPSET_DEL:
-		return list_set_kdel(set, skb, par, opt, &ext);
+		ret = list_set_kdel(set, skb, par, opt, &ext);
+		break;
 	default:
 		break;
 	}
-	return -EINVAL;
-}
+	rcu_read_unlock();
 
-static bool
-id_eq(const struct ip_set *set, u32 i, ip_set_id_t id)
-{
-	const struct list_set *map = set->data;
-	const struct set_elem *e;
-
-	if (i >= map->size)
-		return 0;
-
-	e = list_set_elem(set, map, i);
-	return !!(e->id == id &&
-		 !(SET_WITH_TIMEOUT(set) &&
-		   ip_set_timeout_expired(ext_timeout(e, set))));
+	return ret;
 }
 
-static int
-list_set_add(struct ip_set *set, u32 i, struct set_adt_elem *d,
-	     const struct ip_set_ext *ext)
-{
-	struct list_set *map = set->data;
-	struct set_elem *e = list_set_elem(set, map, i);
-
-	if (e->id != IPSET_INVALID_ID) {
-		if (i == map->size - 1) {
-			/* Last element replaced: e.g. add new,before,last */
-			ip_set_put_byindex(map->net, e->id);
-			ip_set_ext_destroy(set, e);
-		} else {
-			struct set_elem *x = list_set_elem(set, map,
-							   map->size - 1);
-
-			/* Last element pushed off */
-			if (x->id != IPSET_INVALID_ID) {
-				ip_set_put_byindex(map->net, x->id);
-				ip_set_ext_destroy(set, x);
-			}
-			memmove(list_set_elem(set, map, i + 1), e,
-				set->dsize * (map->size - (i + 1)));
-			/* Extensions must be initialized to zero */
-			memset(e, 0, set->dsize);
-		}
-	}
-
-	e->id = d->id;
-	if (SET_WITH_TIMEOUT(set))
-		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
-	if (SET_WITH_COUNTER(set))
-		ip_set_init_counter(ext_counter(e, set), ext);
-	if (SET_WITH_COMMENT(set))
-		ip_set_init_comment(ext_comment(e, set), ext);
-	if (SET_WITH_SKBINFO(set))
-		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
-	return 0;
-}
+/* Userspace interfaces: we are protected by the nfnl mutex */
 
-static int
-list_set_del(struct ip_set *set, u32 i)
+static void
+__list_set_del(struct ip_set *set, struct set_elem *e)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e = list_set_elem(set, map, i);
 
 	ip_set_put_byindex(map->net, e->id);
+	/* We may call it, because we don't have a to be destroyed
+	 * extension which is used by the kernel.
+	 */
 	ip_set_ext_destroy(set, e);
+	kfree_rcu(e, rcu);
+}
 
-	if (i < map->size - 1)
-		memmove(e, list_set_elem(set, map, i + 1),
-			set->dsize * (map->size - (i + 1)));
+static inline void
+list_set_del(struct ip_set *set, struct set_elem *e)
+{
+	list_del_rcu(&e->list);
+	__list_set_del(set, e);
+}
 
-	/* Last element */
-	e = list_set_elem(set, map, map->size - 1);
-	e->id = IPSET_INVALID_ID;
-	return 0;
+static inline void
+list_set_replace(struct ip_set *set, struct set_elem *e, struct set_elem *old)
+{
+	list_replace_rcu(&old->list, &e->list);
+	__list_set_del(set, old);
 }
 
 static void
 set_cleanup_entries(struct ip_set *set)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e;
-	u32 i = 0;
+	struct set_elem *e, *n;
 
-	while (i < map->size) {
-		e = list_set_elem(set, map, i);
-		if (e->id != IPSET_INVALID_ID &&
-		    ip_set_timeout_expired(ext_timeout(e, set)))
-			list_set_del(set, i);
-			/* Check element moved to position i in next loop */
-		else
-			i++;
-	}
+	list_for_each_entry_safe(e, n, &map->members, list)
+		if (ip_set_timeout_expired(ext_timeout(e, set)))
+			list_set_del(set, e);
 }
 
 static int
@@ -250,31 +194,45 @@ list_set_utest(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
-	u32 i;
+	struct set_elem *e, *next, *prev = NULL;
 	int ret;
 
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return 0;
-		else if (SET_WITH_TIMEOUT(set) &&
-			 ip_set_timeout_expired(ext_timeout(e, set)))
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-		else if (e->id != d->id)
+		else if (e->id != d->id) {
+			prev = e;
 			continue;
+		}
 
 		if (d->before == 0)
-			return 1;
-		else if (d->before > 0)
-			ret = id_eq(set, i + 1, d->refid);
-		else
-			ret = i > 0 && id_eq(set, i - 1, d->refid);
+			ret = 1;
+		else if (d->before > 0) {
+			next = list_next_entry(e, list);
+			ret = !list_is_last(&e->list, &map->members) &&
+			      next->id == d->refid;
+		} else
+			ret = prev != NULL && prev->id == d->refid;
 		return ret;
 	}
 	return 0;
 }
 
+static void
+list_set_init_extensions(struct ip_set *set, const struct ip_set_ext *ext,
+			 struct set_elem *e)
+{
+	if (SET_WITH_COUNTER(set))
+		ip_set_init_counter(ext_counter(e, set), ext);
+	if (SET_WITH_COMMENT(set))
+		ip_set_init_comment(ext_comment(e, set), ext);
+	if (SET_WITH_SKBINFO(set))
+		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
+	/* Update timeout last */
+	if (SET_WITH_TIMEOUT(set))
+		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
+}
 
 static int
 list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
@@ -282,60 +240,82 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
+	struct set_elem *e, *n, *prev, *next;
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
-	u32 i, ret = 0;
 
 	if (SET_WITH_TIMEOUT(set))
 		set_cleanup_entries(set);
 
-	/* Check already added element */
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			goto insert;
-		else if (e->id != d->id)
+	/* Find where to add the new entry */
+	n = prev = next = NULL;
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-
-		if ((d->before > 1 && !id_eq(set, i + 1, d->refid)) ||
-		    (d->before < 0 &&
-		     (i == 0 || !id_eq(set, i - 1, d->refid))))
-			/* Before/after doesn't match */
+		else if (d->id == e->id)
+			n = e;
+		else if (d->before == 0 || e->id != d->refid)
+			continue;
+		else if (d->before > 0)
+			next = e;
+		else
+			prev = e;
+	}
+	/* Re-add already existing element */
+	if (n) {
+		if ((d->before > 0 && !next) ||
+		    (d->before < 0 && !prev))
 			return -IPSET_ERR_REF_EXIST;
 		if (!flag_exist)
-			/* Can't re-add */
 			return -IPSET_ERR_EXIST;
 		/* Update extensions */
-		ip_set_ext_destroy(set, e);
+		ip_set_ext_destroy(set, n);
+		list_set_init_extensions(set, ext, n);
 
-		if (SET_WITH_TIMEOUT(set))
-			ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
-		if (SET_WITH_COUNTER(set))
-			ip_set_init_counter(ext_counter(e, set), ext);
-		if (SET_WITH_COMMENT(set))
-			ip_set_init_comment(ext_comment(e, set), ext);
-		if (SET_WITH_SKBINFO(set))
-			ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
 		/* Set is already added to the list */
 		ip_set_put_byindex(map->net, d->id);
 		return 0;
 	}
-insert:
-	ret = -IPSET_ERR_LIST_FULL;
-	for (i = 0; i < map->size && ret == -IPSET_ERR_LIST_FULL; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			ret = d->before != 0 ? -IPSET_ERR_REF_EXIST
-				: list_set_add(set, i, d, ext);
-		else if (e->id != d->refid)
-			continue;
-		else if (d->before > 0)
-			ret = list_set_add(set, i, d, ext);
-		else if (i + 1 < map->size)
-			ret = list_set_add(set, i + 1, d, ext);
+	/* Add new entry */
+	if (d->before == 0) {
+		/* Append  */
+		n = list_empty(&map->members) ? NULL :
+		    list_last_entry(&map->members, struct set_elem, list);
+	} else if (d->before > 0) {
+		/* Insert after next element */
+		if (!list_is_last(&next->list, &map->members))
+			n = list_next_entry(next, list);
+	} else {
+		/* Insert before prev element */
+		if (prev->list.prev != &map->members)
+			n = list_prev_entry(prev, list);
 	}
-
-	return ret;
+	/* Can we replace a timed out entry? */
+	if (n != NULL &&
+	    !(SET_WITH_TIMEOUT(set) &&
+	      ip_set_timeout_expired(ext_timeout(n, set))))
+		n =  NULL;
+
+	e = kzalloc(set->dsize, GFP_KERNEL);
+	if (!e)
+		return -ENOMEM;
+	e->id = d->id;
+	INIT_LIST_HEAD(&e->list);
+	list_set_init_extensions(set, ext, e);
+	if (n)
+		list_set_replace(set, e, n);
+	else if (next)
+		list_add_tail_rcu(&e->list, &next->list);
+	else if (prev)
+		list_add_rcu(&e->list, &prev->list);
+	else
+		list_add_tail_rcu(&e->list, &map->members);
+	spin_unlock_bh(&set->lock);
+
+	synchronize_rcu_bh();
+
+	spin_lock_bh(&set->lock);
+	return 0;
 }
 
 static int
@@ -344,32 +324,30 @@ list_set_udel(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 {
 	struct list_set *map = set->data;
 	struct set_adt_elem *d = value;
-	struct set_elem *e;
-	u32 i;
-
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			return d->before != 0 ? -IPSET_ERR_REF_EXIST
-					      : -IPSET_ERR_EXIST;
-		else if (SET_WITH_TIMEOUT(set) &&
-			 ip_set_timeout_expired(ext_timeout(e, set)))
+	struct set_elem *e, *next, *prev = NULL;
+
+	list_for_each_entry(e, &map->members, list) {
+		if (SET_WITH_TIMEOUT(set) &&
+		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
-		else if (e->id != d->id)
+		else if (e->id != d->id) {
+			prev = e;
 			continue;
+		}
 
-		if (d->before == 0)
-			return list_set_del(set, i);
-		else if (d->before > 0) {
-			if (!id_eq(set, i + 1, d->refid))
+		if (d->before > 0) {
+			next = list_next_entry(e, list);
+			if (list_is_last(&e->list, &map->members) ||
+			    next->id != d->refid)
 				return -IPSET_ERR_REF_EXIST;
-			return list_set_del(set, i);
-		} else if (i == 0 || !id_eq(set, i - 1, d->refid))
-			return -IPSET_ERR_REF_EXIST;
-		else
-			return list_set_del(set, i);
+		} else if (d->before < 0) {
+			if (prev == NULL || prev->id != d->refid)
+				return -IPSET_ERR_REF_EXIST;
+		}
+		list_set_del(set, e);
+		return 0;
 	}
-	return -IPSET_ERR_EXIST;
+	return d->before != 0 ? -IPSET_ERR_REF_EXIST : -IPSET_ERR_EXIST;
 }
 
 static int
@@ -410,6 +388,7 @@ list_set_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 f = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		e.before = f & IPSET_FLAG_BEFORE;
 	}
 
@@ -447,27 +426,26 @@ static void
 list_set_flush(struct ip_set *set)
 {
 	struct list_set *map = set->data;
-	struct set_elem *e;
-	u32 i;
-
-	for (i = 0; i < map->size; i++) {
-		e = list_set_elem(set, map, i);
-		if (e->id != IPSET_INVALID_ID) {
-			ip_set_put_byindex(map->net, e->id);
-			ip_set_ext_destroy(set, e);
-			e->id = IPSET_INVALID_ID;
-		}
-	}
+	struct set_elem *e, *n;
+
+	list_for_each_entry_safe(e, n, &map->members, list)
+		list_set_del(set, e);
 }
 
 static void
 list_set_destroy(struct ip_set *set)
 {
 	struct list_set *map = set->data;
+	struct set_elem *e, *n;
 
 	if (SET_WITH_TIMEOUT(set))
 		del_timer_sync(&map->gc);
-	list_set_flush(set);
+	list_for_each_entry_safe(e, n, &map->members, list) {
+		list_del(&e->list);
+		ip_set_put_byindex(map->net, e->id);
+		ip_set_ext_destroy(set, e);
+		kfree(e);
+	}
 	kfree(map);
 
 	set->data = NULL;
@@ -478,6 +456,11 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
 {
 	const struct list_set *map = set->data;
 	struct nlattr *nested;
+	struct set_elem *e;
+	u32 n = 0;
+
+	list_for_each_entry(e, &map->members, list)
+		n++;
 
 	nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
 	if (!nested)
@@ -485,7 +468,7 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
 	if (nla_put_net32(skb, IPSET_ATTR_SIZE, htonl(map->size)) ||
 	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE,
-			  htonl(sizeof(*map) + map->size * set->dsize)))
+			  htonl(sizeof(*map) + n * set->dsize)))
 		goto nla_put_failure;
 	if (unlikely(ip_set_put_flags(skb, set)))
 		goto nla_put_failure;
@@ -502,18 +485,20 @@ list_set_list(const struct ip_set *set,
 {
 	const struct list_set *map = set->data;
 	struct nlattr *atd, *nested;
-	u32 i, first = cb->args[IPSET_CB_ARG0];
-	const struct set_elem *e;
+	u32 i = 0, first = cb->args[IPSET_CB_ARG0];
+	struct set_elem *e;
 
 	atd = ipset_nest_start(skb, IPSET_ATTR_ADT);
 	if (!atd)
 		return -EMSGSIZE;
-	for (; cb->args[IPSET_CB_ARG0] < map->size;
-	     cb->args[IPSET_CB_ARG0]++) {
-		i = cb->args[IPSET_CB_ARG0];
-		e = list_set_elem(set, map, i);
-		if (e->id == IPSET_INVALID_ID)
-			goto finish;
+	list_for_each_entry(e, &map->members, list) {
+		if (i == first)
+			break;
+		i++;
+	}
+
+	list_for_each_entry_from(e, &map->members, list) {
+		i++;
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(e, set)))
 			continue;
@@ -532,7 +517,7 @@ list_set_list(const struct ip_set *set,
 			goto nla_put_failure;
 		ipset_nest_end(skb, nested);
 	}
-finish:
+
 	ipset_nest_end(skb, atd);
 	/* Set listing finished */
 	cb->args[IPSET_CB_ARG0] = 0;
@@ -544,6 +529,7 @@ nla_put_failure:
 		cb->args[IPSET_CB_ARG0] = 0;
 		return -EMSGSIZE;
 	}
+	cb->args[IPSET_CB_ARG0] = i - 1;
 	ipset_nest_end(skb, atd);
 	return 0;
 }
@@ -580,9 +566,9 @@ list_set_gc(unsigned long ul_set)
 	struct ip_set *set = (struct ip_set *) ul_set;
 	struct list_set *map = set->data;
 
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	set_cleanup_entries(set);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&map->gc);
@@ -606,24 +592,16 @@ static bool
 init_list_set(struct net *net, struct ip_set *set, u32 size)
 {
 	struct list_set *map;
-	struct set_elem *e;
-	u32 i;
 
-	map = kzalloc(sizeof(*map) +
-		      min_t(u32, size, IP_SET_LIST_MAX_SIZE) * set->dsize,
-		      GFP_KERNEL);
+	map = kzalloc(sizeof(*map), GFP_KERNEL);
 	if (!map)
 		return false;
 
 	map->size = size;
 	map->net = net;
+	INIT_LIST_HEAD(&map->members);
 	set->data = map;
 
-	for (i = 0; i < size; i++) {
-		e = list_set_elem(set, map, i);
-		e->id = IPSET_INVALID_ID;
-	}
-
 	return true;
 }
 
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (9 preceding siblings ...)
  2014-11-30 18:57 ` [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  2014-12-01  7:59   ` Jesper Dangaard Brouer
  2014-12-02 18:40   ` Pablo Neira Ayuso
  2014-11-30 18:57 ` [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Performance is tested by Jesper Dangaard Brouer:

Simple drop in FORWARD
~~~~~~~~~~~~~~~~~~~~

Dropping via simple iptables net-mask match::

 iptables -t raw -N simple || iptables -t raw -F simple
 iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
 iptables -t raw -D PREROUTING -j simple
 iptables -t raw -I PREROUTING -j simple

Drop performance in "raw": 11.3Mpps

Generator: sending 12.2Mpps (tx:12264083 pps)

Drop via original ipset in RAW table
~~~~~~~~~~~~~~~~~~~~~~~~~

Create a set with lots of elements::
 sudo ./ipset destroy test
 echo "create test hash:ip hashsize 65536" > test.set
 for x in `seq 0 255`; do
    for y in `seq 0 255`; do
        echo "add test 198.18.$x.$y" >> test.set
    done
 done
 sudo ./ipset restore < test.set

Dropping via ipset::

 iptables -t raw -F
 iptables -t raw -N net198 || iptables -t raw -F net198
 iptables -t raw -I net198 -m set --match-set test src -j DROP
 iptables -t raw -I PREROUTING -j net198

Drop performance in "raw" with ipset: 8Mpps

Perf report numbers ipset drop in "raw"::

 +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
 -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
    - _raw_read_lock_bh
       + 99.88% ip_set_test
 -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
    - _raw_read_unlock_bh
       + 99.72% ip_set_test
 +    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
 +    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
 +    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
 +    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
 +    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
 +    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
 +    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
 +    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
 +    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
 +    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
 +    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
 +    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
 +    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
 +    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
 +    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
 +    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
 +    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
 +    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock

Drop via ipset in RAW table with RCU-locking
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

With RCU locking, the RW-lock is gone.

Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps

Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_gen.h | 580 ++++++++++++++++++++--------------
 1 file changed, 344 insertions(+), 236 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 974ff38..8f51ba4 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -10,19 +10,19 @@
 
 #include <linux/rcupdate.h>
 #include <linux/jhash.h>
+#include <linux/types.h>
 #include <linux/netfilter/ipset/ip_set_timeout.h>
-#ifndef rcu_dereference_bh
-#define rcu_dereference_bh(p)	rcu_dereference(p)
-#endif
+
+#define __ipset_dereference_protected(p, c)	rcu_dereference_protected(p, c)
+#define ipset_dereference_protected(p, set) \
+	__ipset_dereference_protected(p, spin_is_locked(&(set)->lock))
 
 #define rcu_dereference_bh_nfnl(p)	rcu_dereference_bh_check(p, 1)
 
 /* Hashing which uses arrays to resolve clashing. The hash table is resized
  * (doubled) when searching becomes too long.
  * Internally jhash is used with the assumption that the size of the
- * stored data is a multiple of sizeof(u32). If storage supports timeout,
- * the timeout field must be the last one in the data structure - that field
- * is ignored when computing the hash key.
+ * stored data is a multiple of sizeof(u32).
  *
  * Readers and resizing
  *
@@ -36,6 +36,8 @@
 #define AHASH_INIT_SIZE			4
 /* Max number of elements to store in an array block */
 #define AHASH_MAX_SIZE			(3*AHASH_INIT_SIZE)
+/* Max muber of elements in the array block when tuned */
+#define AHASH_MAX_TUNED			64
 
 /* Max number of elements can be tuned */
 #ifdef IP_SET_HASH_WITH_MULTI
@@ -53,7 +55,7 @@ tune_ahash_max(u8 curr, u32 multi)
 	/* Currently, at listing one hash bucket must fit into a message.
 	 * Therefore we have a hard limit here.
 	 */
-	return n > curr && n <= 64 ? n : curr;
+	return n > curr && n <= AHASH_MAX_TUNED ? n : curr;
 }
 #define TUNE_AHASH_MAX(h, multi)	\
 	((h)->ahash_max = tune_ahash_max((h)->ahash_max, multi))
@@ -64,18 +66,21 @@ tune_ahash_max(u8 curr, u32 multi)
 
 /* A hash bucket */
 struct hbucket {
-	void *value;		/* the array of the values */
+	struct rcu_head rcu;	/* for call_rcu_bh */
+	/* Which positions are used in the array */
+	DECLARE_BITMAP(used, AHASH_MAX_TUNED);
 	u8 size;		/* size of the array */
 	u8 pos;			/* position of the first free entry */
-};
+	unsigned char value[0];	/* the array of the values */
+} __attribute__ ((aligned));
 
 /* The hash table: the table size stored here in order to make resizing easy */
 struct htable {
 	u8 htable_bits;		/* size of hash table == 2^htable_bits */
-	struct hbucket bucket[0]; /* hashtable buckets */
+	struct hbucket __rcu * bucket[0]; /* hashtable buckets */
 };
 
-#define hbucket(h, i)		(&((h)->bucket[i]))
+#define hbucket(h, i)		((h)->bucket[i])
 
 #ifndef IPSET_NET_COUNT
 #define IPSET_NET_COUNT		1
@@ -83,8 +88,8 @@ struct htable {
 
 /* Book-keeping of the prefixes added to the set */
 struct net_prefixes {
-	u32 nets[IPSET_NET_COUNT]; /* number of elements per cidr */
-	u8 cidr[IPSET_NET_COUNT];  /* the different cidr values in the set */
+	u32 nets[IPSET_NET_COUNT]; /* number of elements for this cidr */
+	u8 cidr[IPSET_NET_COUNT];  /* the cidr value */
 };
 
 /* Compute the hash table size */
@@ -97,11 +102,11 @@ htable_size(u8 hbits)
 	if (hbits > 31)
 		return 0;
 	hsize = jhash_size(hbits);
-	if ((((size_t)-1) - sizeof(struct htable))/sizeof(struct hbucket)
+	if ((((size_t)-1) - sizeof(struct htable))/sizeof(struct hbucket *)
 	    < hsize)
 		return 0;
 
-	return hsize * sizeof(struct hbucket) + sizeof(struct htable);
+	return hsize * sizeof(struct hbucket *) + sizeof(struct htable);
 }
 
 /* Compute htable_bits from the user input parameter hashsize */
@@ -110,6 +115,7 @@ htable_bits(u32 hashsize)
 {
 	/* Assume that hashsize == 2^htable_bits */
 	u8 bits = fls(hashsize - 1);
+
 	if (jhash_size(bits) != hashsize)
 		/* Round up to the first 2^n value */
 		bits = fls(hashsize);
@@ -117,30 +123,6 @@ htable_bits(u32 hashsize)
 	return bits;
 }
 
-static int
-hbucket_elem_add(struct hbucket *n, u8 ahash_max, size_t dsize)
-{
-	if (n->pos >= n->size) {
-		void *tmp;
-
-		if (n->size >= ahash_max)
-			/* Trigger rehashing */
-			return -EAGAIN;
-
-		tmp = kzalloc((n->size + AHASH_INIT_SIZE) * dsize,
-			      GFP_ATOMIC);
-		if (!tmp)
-			return -ENOMEM;
-		if (n->size) {
-			memcpy(tmp, n->value, n->size * dsize);
-			kfree(n->value);
-		}
-		n->value = tmp;
-		n->size += AHASH_INIT_SIZE;
-	}
-	return 0;
-}
-
 #ifdef IP_SET_HASH_WITH_NETS
 #if IPSET_NET_COUNT > 1
 #define __CIDR(cidr, i)		(cidr[i])
@@ -280,9 +262,6 @@ struct htype {
 #ifdef IP_SET_HASH_WITH_NETMASK
 	u8 netmask;		/* netmask value for subnets to store */
 #endif
-#ifdef IP_SET_HASH_WITH_RBTREE
-	struct rb_root rbtree;
-#endif
 #ifdef IP_SET_HASH_WITH_NETS
 	struct net_prefixes nets[0]; /* book-keeping of prefixes */
 #endif
@@ -324,8 +303,8 @@ mtype_del_cidr(struct htype *h, u8 cidr, u8 nets_length, u8 n)
 	for (i = 0; i < nets_length; i++) {
 	        if (h->nets[i].cidr[n] != cidr)
 	                continue;
-		h->nets[cidr -1].nets[n]--;
-		if (h->nets[cidr -1].nets[n] > 0)
+		h->nets[cidr - 1].nets[n]--;
+		if (h->nets[cidr - 1].nets[n] > 0)
                         return;
 		for (j = i; j < net_end && h->nets[j].cidr[n]; j++)
 		        h->nets[j].cidr[n] = h->nets[j + 1].cidr[n];
@@ -341,15 +320,18 @@ mtype_ahash_memsize(const struct htype *h, const struct htable *t,
 		    u8 nets_length, size_t dsize)
 {
 	u32 i;
-	size_t memsize = sizeof(*h)
-			 + sizeof(*t)
+	struct hbucket *n;
+	size_t memsize = sizeof(*h) + sizeof(*t);
+
 #ifdef IP_SET_HASH_WITH_NETS
-			 + sizeof(struct net_prefixes) * nets_length
+	memsize += sizeof(struct net_prefixes) * nets_length;
 #endif
-			 + jhash_size(t->htable_bits) * sizeof(struct hbucket);
-
-	for (i = 0; i < jhash_size(t->htable_bits); i++)
-		memsize += t->bucket[i].size * dsize;
+	for (i = 0; i < jhash_size(t->htable_bits); i++) {
+		n = rcu_dereference_bh(hbucket(t, i));
+		if (n == NULL)
+			continue;
+		memsize += sizeof(struct hbucket) + n->size * dsize;
+	}
 
 	return memsize;
 }
@@ -364,7 +346,8 @@ mtype_ext_cleanup(struct ip_set *set, struct hbucket *n)
 	int i;
 
 	for (i = 0; i < n->pos; i++)
-		ip_set_ext_destroy(set, ahash_data(n, i, set->dsize));
+		if (test_bit(i, n->used))
+			ip_set_ext_destroy(set, ahash_data(n, i, set->dsize));
 }
 
 /* Flush a hash type of set: destroy all elements */
@@ -376,16 +359,16 @@ mtype_flush(struct ip_set *set)
 	struct hbucket *n;
 	u32 i;
 
-	t = rcu_dereference_bh_nfnl(h->table);
+	t = ipset_dereference_protected(h->table, set);
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
-		n = hbucket(t, i);
-		if (n->size) {
-			if (set->extensions & IPSET_EXT_DESTROY)
-				mtype_ext_cleanup(set, n);
-			n->size = n->pos = 0;
-			/* FIXME: use slab cache */
-			kfree(n->value);
-		}
+		n = __ipset_dereference_protected(hbucket(t, i), 1);
+		if (n == NULL)
+			continue;
+		if (set->extensions & IPSET_EXT_DESTROY)
+			mtype_ext_cleanup(set, n);
+		/* FIXME: use slab cache */
+		rcu_assign_pointer(hbucket(t, i), NULL);
+		kfree_rcu(n, rcu);
 	}
 #ifdef IP_SET_HASH_WITH_NETS
 	memset(h->nets, 0, sizeof(struct net_prefixes) * NLEN(set->family));
@@ -401,13 +384,13 @@ mtype_ahash_destroy(struct ip_set *set, struct htable *t, bool ext_destroy)
 	u32 i;
 
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
-		n = hbucket(t, i);
-		if (n->size) {
-			if (set->extensions & IPSET_EXT_DESTROY && ext_destroy)
-				mtype_ext_cleanup(set, n);
-			/* FIXME: use slab cache */
-			kfree(n->value);
-		}
+		n = __ipset_dereference_protected(hbucket(t, i), 1);
+		if (n == NULL)
+			continue;
+		if (set->extensions & IPSET_EXT_DESTROY && ext_destroy)
+			mtype_ext_cleanup(set, n);
+		/* FIXME: use slab cache */
+		kfree(n);
 	}
 
 	ip_set_free(t);
@@ -422,10 +405,8 @@ mtype_destroy(struct ip_set *set)
 	if (set->extensions & IPSET_EXT_TIMEOUT)
 		del_timer_sync(&h->gc);
 
-	mtype_ahash_destroy(set, rcu_dereference_bh_nfnl(h->table), true);
-#ifdef IP_SET_HASH_WITH_RBTREE
-	rbtree_destroy(&h->rbtree);
-#endif
+	mtype_ahash_destroy(set,
+		__ipset_dereference_protected(h->table, 1), true);
 	kfree(h);
 
 	set->data = NULL;
@@ -470,17 +451,21 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 	struct htable *t;
 	struct hbucket *n;
 	struct mtype_elem *data;
-	u32 i;
-	int j;
+	u32 i, j, d;
 #ifdef IP_SET_HASH_WITH_NETS
 	u8 k;
 #endif
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	t = ipset_dereference_protected(h->table, set);
 	for (i = 0; i < jhash_size(t->htable_bits); i++) {
-		n = hbucket(t, i);
-		for (j = 0; j < n->pos; j++) {
+		n = __ipset_dereference_protected(hbucket(t, i), 1);
+		if (n == NULL)
+			continue;
+		for (j = 0, d = 0; j < n->pos; j++) {
+			if (!test_bit(j, n->used)) {
+				d++;
+				continue;
+			}
 			data = ahash_data(n, j, dsize);
 			if (ip_set_timeout_expired(ext_timeout(data, set))) {
 				pr_debug("expired %u/%u\n", i, j);
@@ -490,29 +475,32 @@ mtype_expire(struct ip_set *set, struct htype *h, u8 nets_length, size_t dsize)
 						       nets_length, k);
 #endif
 				ip_set_ext_destroy(set, data);
-				if (j != n->pos - 1)
-					/* Not last one */
-					memcpy(data,
-					       ahash_data(n, n->pos - 1, dsize),
-					       dsize);
-				n->pos--;
+				clear_bit(j, n->used);
 				h->elements--;
+				d++;
 			}
 		}
-		if (n->pos + AHASH_INIT_SIZE < n->size) {
-			void *tmp = kzalloc((n->size - AHASH_INIT_SIZE)
-					    * dsize,
-					    GFP_ATOMIC);
+		if (d >= AHASH_INIT_SIZE) {
+			struct hbucket *tmp = kzalloc(sizeof(struct hbucket) +
+					(n->size - AHASH_INIT_SIZE) * dsize,
+					GFP_ATOMIC);
 			if (!tmp)
 				/* Still try to delete expired elements */
 				continue;
-			n->size -= AHASH_INIT_SIZE;
-			memcpy(tmp, n->value, n->size * dsize);
-			kfree(n->value);
-			n->value = tmp;
+			tmp->size = n->size - AHASH_INIT_SIZE;
+			for (j = 0, d = 0; j < n->pos; j++) {
+				if (!test_bit(j, n->used))
+					continue;
+				data = ahash_data(n, j, dsize);
+				memcpy(tmp->value + d * dsize, data, dsize);
+				set_bit(j, tmp->used);
+				d++;
+			}
+			tmp->pos = d;
+			rcu_assign_pointer(hbucket(t, i), tmp);
+			kfree_rcu(n, rcu);
 		}
 	}
-	rcu_read_unlock_bh();
 }
 
 static void
@@ -522,9 +510,9 @@ mtype_gc(unsigned long ul_set)
 	struct htype *h = set->data;
 
 	pr_debug("called\n");
-	write_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
 	mtype_expire(set, h, NLEN(set->family), set->dsize);
-	write_unlock_bh(&set->lock);
+	spin_unlock_bh(&set->lock);
 
 	h->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
 	add_timer(&h->gc);
@@ -537,75 +525,115 @@ static int
 mtype_resize(struct ip_set *set, bool retried)
 {
 	struct htype *h = set->data;
-	struct htable *t, *orig = rcu_dereference_bh_nfnl(h->table);
-	u8 htable_bits = orig->htable_bits;
+	struct htable *t, *orig;
+	u8 htable_bits;
+	size_t dsize = set->dsize;
 #ifdef IP_SET_HASH_WITH_NETS
 	u8 flags;
+	struct mtype_elem *tmp;
 #endif
 	struct mtype_elem *data;
 	struct mtype_elem *d;
 	struct hbucket *n, *m;
-	u32 i, j;
+	u32 i, j, key;
 	int ret;
 
-	/* Try to cleanup once */
-	if (SET_WITH_TIMEOUT(set) && !retried) {
-		i = h->elements;
-		write_lock_bh(&set->lock);
-		mtype_expire(set, set->data, NLEN(set->family), set->dsize);
-		write_unlock_bh(&set->lock);
-		if (h->elements < i)
-			return 0;
-	}
+#ifdef IP_SET_HASH_WITH_NETS
+	tmp = kmalloc(dsize, GFP_KERNEL);
+	if (!tmp)
+		return -ENOMEM;
+#endif
+	rcu_read_lock_bh();
+	orig = rcu_dereference_bh_nfnl(h->table);
+	htable_bits = orig->htable_bits;
+	rcu_read_unlock_bh();
 
 retry:
 	ret = 0;
 	htable_bits++;
-	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
-		 set->name, orig->htable_bits, htable_bits, orig);
 	if (!htable_bits) {
 		/* In case we have plenty of memory :-) */
 		pr_warn("Cannot increase the hashsize of set %s further\n",
 			set->name);
-		return -IPSET_ERR_HASH_FULL;
+		ret = -IPSET_ERR_HASH_FULL;
+		goto out;
+	}
+	t = ip_set_alloc(htable_size(htable_bits));
+	if (!t) {
+		ret = -ENOMEM;
+		goto out;
 	}
-	t = ip_set_alloc(sizeof(*t)
-			 + jhash_size(htable_bits) * sizeof(struct hbucket));
-	if (!t)
-		return -ENOMEM;
 	t->htable_bits = htable_bits;
 
-	read_lock_bh(&set->lock);
+	spin_lock_bh(&set->lock);
+	orig = __ipset_dereference_protected(h->table, 1);
+	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
+		 set->name, orig->htable_bits, htable_bits, orig);
 	for (i = 0; i < jhash_size(orig->htable_bits); i++) {
-		n = hbucket(orig, i);
+		n = __ipset_dereference_protected(hbucket(orig, i), 1);
+		if (n == NULL)
+			continue;
 		for (j = 0; j < n->pos; j++) {
-			data = ahash_data(n, j, set->dsize);
+			if (!test_bit(j, n->used))
+				continue;
+			data = ahash_data(n, j, dsize);
 #ifdef IP_SET_HASH_WITH_NETS
+			/* We have readers running parallel with us,
+			 * so the live data cannot be modified.
+			 */
 			flags = 0;
+			memcpy(tmp, data, dsize);
+			data = tmp;
 			mtype_data_reset_flags(data, &flags);
 #endif
-			m = hbucket(t, HKEY(data, h->initval, htable_bits));
-			ret = hbucket_elem_add(m, AHASH_MAX(h), set->dsize);
-			if (ret < 0) {
-#ifdef IP_SET_HASH_WITH_NETS
-				mtype_data_reset_flags(data, &flags);
-#endif
-				read_unlock_bh(&set->lock);
-				mtype_ahash_destroy(set, t, false);
-				if (ret == -EAGAIN)
-					goto retry;
-				return ret;
+			key = HKEY(data, h->initval, htable_bits);
+			m = __ipset_dereference_protected(hbucket(t, key), 1);
+			if (!m) {
+				m = kzalloc(sizeof(struct hbucket) +
+					    AHASH_INIT_SIZE * dsize,
+					    GFP_ATOMIC);
+				if (!m)
+					ret = -ENOMEM;
+				m->size = AHASH_INIT_SIZE;
+				RCU_INIT_POINTER(hbucket(t, key), m);
+			} else if (m->pos >= m->size) {
+				struct hbucket *ht;
+
+				if (m->size >= AHASH_MAX(h))
+					ret = -EAGAIN;
+				else {
+					ht = kzalloc(sizeof(struct hbucket) +
+						(m->size + AHASH_INIT_SIZE)
+						* dsize,
+						GFP_ATOMIC);
+					if (!ht)
+						ret = -ENOMEM;
+				}
+				if (ret < 0) {
+					spin_unlock_bh(&set->lock);
+					mtype_ahash_destroy(set, t, false);
+					if (ret == -EAGAIN)
+						goto retry;
+					goto out;
+				}
+				memcpy(ht, m, sizeof(struct hbucket) +
+					      m->size * dsize);
+				ht->size = m->size + AHASH_INIT_SIZE;
+				kfree(m);
+				m = ht;
+				RCU_INIT_POINTER(hbucket(t, key), ht);
 			}
-			d = ahash_data(m, m->pos++, set->dsize);
-			memcpy(d, data, set->dsize);
+			d = ahash_data(m, m->pos, dsize);
+			memcpy(d, data, dsize);
+			set_bit(m->pos++, m->used);
 #ifdef IP_SET_HASH_WITH_NETS
 			mtype_data_reset_flags(d, &flags);
 #endif
 		}
 	}
-
 	rcu_assign_pointer(h->table, t);
-	read_unlock_bh(&set->lock);
+
+	spin_unlock_bh(&set->lock);
 
 	/* Give time to other readers of the set */
 	synchronize_rcu_bh();
@@ -614,7 +642,12 @@ retry:
 		 orig->htable_bits, orig, t->htable_bits, t);
 	mtype_ahash_destroy(set, orig, false);
 
-	return 0;
+out:
+#ifdef IP_SET_HASH_WITH_NETS
+	kfree(tmp);
+#endif
+	return ret;
+
 }
 
 /* Add an element to a hash and update the internal counters when succeeded,
@@ -627,17 +660,49 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	struct htable *t;
 	const struct mtype_elem *d = value;
 	struct mtype_elem *data;
-	struct hbucket *n;
-	int i, ret = 0;
-	int j = AHASH_MAX(h) + 1;
+	struct hbucket *n, *old = ERR_PTR(-ENOENT);
+	int i, j = -1;
 	bool flag_exist = flags & IPSET_FLAG_EXIST;
+	bool deleted = false, forceadd = false, reuse = false;
 	u32 key, multi = 0;
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	if (h->elements >= h->maxelem) {
+		if (SET_WITH_TIMEOUT(set))
+			/* FIXME: when set is full, we slow down here */
+			mtype_expire(set, h, NLEN(set->family), set->dsize);
+		if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set))
+			forceadd = true;
+	}
+
+	t = ipset_dereference_protected(h->table, set);
 	key = HKEY(value, h->initval, t->htable_bits);
-	n = hbucket(t, key);
+	n = __ipset_dereference_protected(hbucket(t, key), 1);
+	if (n == NULL) {
+		if (forceadd) {
+			if (net_ratelimit())
+				pr_warn("Set %s is full, maxelem %u reached\n",
+					set->name, h->maxelem);
+			return -IPSET_ERR_HASH_FULL;
+		} else if (h->elements >= h->maxelem)
+			goto set_full;
+		old = NULL;
+		n = kzalloc(sizeof(struct hbucket) +
+			    AHASH_INIT_SIZE * set->dsize,
+			    GFP_ATOMIC);
+		if (n == NULL)
+			return -ENOMEM;
+		n->size = AHASH_INIT_SIZE;
+		goto copy_elem;
+	}
 	for (i = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used)) {
+			/* Reuse first deleted entry */
+			if (j == -1) {
+				deleted = reuse = true;
+				j = i;
+			}
+			continue;
+		}
 		data = ahash_data(n, i, set->dsize);
 		if (mtype_data_equal(data, d, &multi)) {
 			if (flag_exist ||
@@ -645,85 +710,92 @@ mtype_add(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 			     ip_set_timeout_expired(ext_timeout(data, set)))) {
 				/* Just the extensions could be overwritten */
 				j = i;
-				goto reuse_slot;
-			} else {
-				ret = -IPSET_ERR_EXIST;
-				goto out;
-			}
+				goto overwrite_extensions;
+			} else
+				return -IPSET_ERR_EXIST;
 		}
 		/* Reuse first timed out entry */
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(data, set)) &&
-		    j != AHASH_MAX(h) + 1)
+		    j == -1) {
 			j = i;
+			reuse = true;
+		}
 	}
-	if (h->elements >= h->maxelem && SET_WITH_FORCEADD(set) && n->pos) {
-		/* Choosing the first entry in the array to replace */
-		j = 0;
-		goto reuse_slot;
-	}
-	if (SET_WITH_TIMEOUT(set) && h->elements >= h->maxelem)
-		/* FIXME: when set is full, we slow down here */
-		mtype_expire(set, h, NLEN(set->family), set->dsize);
-
-	if (h->elements >= h->maxelem) {
-		if (net_ratelimit())
-			pr_warn("Set %s is full, maxelem %u reached\n",
-				set->name, h->maxelem);
-		ret = -IPSET_ERR_HASH_FULL;
-		goto out;
-	}
-
-reuse_slot:
-	if (j != AHASH_MAX(h) + 1) {
-		/* Fill out reused slot */
+	if (reuse || forceadd) {
 		data = ahash_data(n, j, set->dsize);
+		if (!deleted) {
 #ifdef IP_SET_HASH_WITH_NETS
-		for (i = 0; i < IPSET_NET_COUNT; i++) {
-			mtype_del_cidr(h, SCIDR(data->cidr, i),
-				       NLEN(set->family), i);
-			mtype_add_cidr(h, SCIDR(d->cidr, i),
-				       NLEN(set->family), i);
-		}
+			for (i = 0; i < IPSET_NET_COUNT; i++)
+				mtype_del_cidr(h, SCIDR(data->cidr, i),
+					       NLEN(set->family), i);
 #endif
-		ip_set_ext_destroy(set, data);
-	} else {
-		/* Use/create a new slot */
+			ip_set_ext_destroy(set, data);
+			h->elements--;
+		}
+		goto copy_data;
+	}
+	if (h->elements >= h->maxelem)
+		goto set_full;
+	/* Create a new slot */
+	if (n->pos >= n->size) {
 		TUNE_AHASH_MAX(h, multi);
-		ret = hbucket_elem_add(n, AHASH_MAX(h), set->dsize);
-		if (ret != 0) {
-			if (ret == -EAGAIN)
-				mtype_data_next(&h->next, d);
-			goto out;
+		if (n->size >= AHASH_MAX(h)) {
+			/* Trigger rehashing */
+			mtype_data_next(&h->next, d);
+			return -EAGAIN;
 		}
-		data = ahash_data(n, n->pos++, set->dsize);
+		old = n;
+		n = kzalloc(sizeof(struct hbucket) +
+			    (old->size + AHASH_INIT_SIZE) * set->dsize,
+			    GFP_ATOMIC);
+		if (!n)
+			return -ENOMEM;
+		memcpy(n, old, sizeof(struct hbucket) +
+		       old->size * set->dsize);
+		n->size = old->size + AHASH_INIT_SIZE;
+	}
+
+copy_elem:
+	j = n->pos++;
+	data = ahash_data(n, j, set->dsize);
+copy_data:
+	h->elements++;
 #ifdef IP_SET_HASH_WITH_NETS
-		for (i = 0; i < IPSET_NET_COUNT; i++)
-			mtype_add_cidr(h, SCIDR(d->cidr, i), NLEN(set->family),
-				       i);
+	for (i = 0; i < IPSET_NET_COUNT; i++)
+		mtype_add_cidr(h, SCIDR(d->cidr, i),
+			       NLEN(set->family), i);
 #endif
-		h->elements++;
-	}
 	memcpy(data, d, sizeof(struct mtype_elem));
+overwrite_extensions:
 #ifdef IP_SET_HASH_WITH_NETS
 	mtype_data_set_flags(data, flags);
 #endif
-	if (SET_WITH_TIMEOUT(set))
-		ip_set_timeout_set(ext_timeout(data, set), ext->timeout);
 	if (SET_WITH_COUNTER(set))
 		ip_set_init_counter(ext_counter(data, set), ext);
 	if (SET_WITH_COMMENT(set))
 		ip_set_init_comment(ext_comment(data, set), ext);
 	if (SET_WITH_SKBINFO(set))
 		ip_set_init_skbinfo(ext_skbinfo(data, set), ext);
+	/* Must come last */
+	if (SET_WITH_TIMEOUT(set))
+		ip_set_timeout_set(ext_timeout(data, set), ext->timeout);
+	set_bit(j, n->used);
+	if (old != ERR_PTR(-ENOENT)) {
+		rcu_assign_pointer(hbucket(t, key), n);
+		if (old)
+			kfree_rcu(old, rcu);
+	}
 
-out:
-	rcu_read_unlock_bh();
-	return ret;
+	return 0;
+set_full:
+	if (net_ratelimit())
+		pr_warn("Set %s is full, maxelem %u reached\n",
+			set->name, h->maxelem);
+	return -IPSET_ERR_HASH_FULL;
 }
 
-/* Delete an element from the hash: swap it with the last element
- * and free up space if possible.
+/* Delete an element from the hash and free up space if possible.
  */
 static int
 mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
@@ -734,29 +806,31 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	const struct mtype_elem *d = value;
 	struct mtype_elem *data;
 	struct hbucket *n;
-	int i, ret = -IPSET_ERR_EXIST;
-#ifdef IP_SET_HASH_WITH_NETS
-	u8 j;
-#endif
+	int i, j, k, ret = -IPSET_ERR_EXIST;
 	u32 key, multi = 0;
+	size_t dsize = set->dsize;
 
-	rcu_read_lock_bh();
-	t = rcu_dereference_bh(h->table);
+	t = ipset_dereference_protected(h->table, set);
 	key = HKEY(value, h->initval, t->htable_bits);
-	n = hbucket(t, key);
-	for (i = 0; i < n->pos; i++) {
-		data = ahash_data(n, i, set->dsize);
+	n = __ipset_dereference_protected(hbucket(t, key), 1);
+	if (!n)
+		goto out;
+	for (i = 0, k = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used)) {
+			k++;
+			continue;
+		}
+		data = ahash_data(n, i, dsize);
 		if (!mtype_data_equal(data, d, &multi))
 			continue;
 		if (SET_WITH_TIMEOUT(set) &&
 		    ip_set_timeout_expired(ext_timeout(data, set)))
 			goto out;
-		if (i != n->pos - 1)
-			/* Not last one */
-			memcpy(data, ahash_data(n, n->pos - 1, set->dsize),
-			       set->dsize);
 
-		n->pos--;
+		ret = 0;
+		clear_bit(i, n->used);
+		if (i + 1 == n->pos)
+			n->pos--;
 		h->elements--;
 #ifdef IP_SET_HASH_WITH_NETS
 		for (j = 0; j < IPSET_NET_COUNT; j++)
@@ -764,25 +838,37 @@ mtype_del(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 				       j);
 #endif
 		ip_set_ext_destroy(set, data);
-		if (n->pos + AHASH_INIT_SIZE < n->size) {
-			void *tmp = kzalloc((n->size - AHASH_INIT_SIZE)
-					    * set->dsize,
-					    GFP_ATOMIC);
-			if (!tmp) {
-				ret = 0;
+
+		for (; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				k++;
+		}
+		if (n->pos == 0 && k == 0) {
+			rcu_assign_pointer(hbucket(t, key), NULL);
+			kfree_rcu(n, rcu);
+		} else if (k >= AHASH_INIT_SIZE) {
+			struct hbucket *tmp = kzalloc(sizeof(struct hbucket) +
+					(n->size - AHASH_INIT_SIZE) * dsize,
+					GFP_ATOMIC);
+			if (!tmp)
 				goto out;
+			tmp->size = n->size - AHASH_INIT_SIZE;
+			for (j = 0, k = 0; j < n->pos; j++) {
+				if (!test_bit(j, n->used))
+					continue;
+				data = ahash_data(n, j, dsize);
+				memcpy(tmp->value + k * dsize, data, dsize);
+				set_bit(j, tmp->used);
+				k++;
 			}
-			n->size -= AHASH_INIT_SIZE;
-			memcpy(tmp, n->value, n->size * set->dsize);
-			kfree(n->value);
-			n->value = tmp;
+			tmp->pos = k;
+			rcu_assign_pointer(hbucket(t, key), tmp);
+			kfree_rcu(n, rcu);
 		}
-		ret = 0;
 		goto out;
 	}
 
 out:
-	rcu_read_unlock_bh();
 	return ret;
 }
 
@@ -832,8 +918,12 @@ mtype_test_cidrs(struct ip_set *set, struct mtype_elem *d,
 		mtype_data_netmask(d, NCIDR(h->nets[j].cidr[0]));
 #endif
 		key = HKEY(d, h->initval, t->htable_bits);
-		n = hbucket(t, key);
+		n =  rcu_dereference_bh(hbucket(t, key));
+		if (n == NULL)
+			continue;
 		for (i = 0; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				continue;
 			data = ahash_data(n, i, set->dsize);
 			if (!mtype_data_equal(data, d, &multi))
 				continue;
@@ -871,7 +961,6 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 	int i, ret = 0;
 	u32 key, multi = 0;
 
-	rcu_read_lock_bh();
 	t = rcu_dereference_bh(h->table);
 #ifdef IP_SET_HASH_WITH_NETS
 	/* If we test an IP address and not a network address,
@@ -886,8 +975,14 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 #endif
 
 	key = HKEY(d, h->initval, t->htable_bits);
-	n = hbucket(t, key);
+	n = rcu_dereference_bh(hbucket(t, key));
+	if (n == NULL) {
+		ret = 0;
+		goto out;
+	}
 	for (i = 0; i < n->pos; i++) {
+		if (!test_bit(i, n->used))
+			continue;
 		data = ahash_data(n, i, set->dsize);
 		if (mtype_data_equal(data, d, &multi) &&
 		    !(SET_WITH_TIMEOUT(set) &&
@@ -897,7 +992,6 @@ mtype_test(struct ip_set *set, void *value, const struct ip_set_ext *ext,
 		}
 	}
 out:
-	rcu_read_unlock_bh();
 	return ret;
 }
 
@@ -909,15 +1003,19 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
 	const struct htable *t;
 	struct nlattr *nested;
 	size_t memsize;
+	u8 htable_bits;
 
+	rcu_read_lock_bh();
 	t = rcu_dereference_bh_nfnl(h->table);
 	memsize = mtype_ahash_memsize(h, t, NLEN(set->family), set->dsize);
+	htable_bits = t->htable_bits;
+	rcu_read_unlock_bh();
 
 	nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
 	if (!nested)
 		goto nla_put_failure;
 	if (nla_put_net32(skb, IPSET_ATTR_HASHSIZE,
-			  htonl(jhash_size(t->htable_bits))) ||
+			  htonl(jhash_size(htable_bits))) ||
 	    nla_put_net32(skb, IPSET_ATTR_MAXELEM, htonl(h->maxelem)))
 		goto nla_put_failure;
 #ifdef IP_SET_HASH_WITH_NETMASK
@@ -947,26 +1045,32 @@ mtype_list(const struct ip_set *set,
 	   struct sk_buff *skb, struct netlink_callback *cb)
 {
 	const struct htype *h = set->data;
-	const struct htable *t = rcu_dereference_bh_nfnl(h->table);
+	const struct htable *t;
 	struct nlattr *atd, *nested;
 	const struct hbucket *n;
 	const struct mtype_elem *e;
 	u32 first = cb->args[IPSET_CB_ARG0];
 	/* We assume that one hash bucket fills into one page */
 	void *incomplete;
-	int i;
+	int i, ret = 0;
 
 	atd = ipset_nest_start(skb, IPSET_ATTR_ADT);
 	if (!atd)
 		return -EMSGSIZE;
+
 	pr_debug("list hash set %s\n", set->name);
+	t = rcu_dereference_bh_nfnl(h->table);
 	for (; cb->args[IPSET_CB_ARG0] < jhash_size(t->htable_bits);
 	     cb->args[IPSET_CB_ARG0]++) {
 		incomplete = skb_tail_pointer(skb);
-		n = hbucket(t, cb->args[IPSET_CB_ARG0]);
+		n = rcu_dereference(hbucket(t, cb->args[IPSET_CB_ARG0]));
 		pr_debug("cb->arg bucket: %lu, t %p n %p\n",
 			 cb->args[IPSET_CB_ARG0], t, n);
+		if (n == NULL)
+			continue;
 		for (i = 0; i < n->pos; i++) {
+			if (!test_bit(i, n->used))
+				continue;
 			e = ahash_data(n, i, set->dsize);
 			if (SET_WITH_TIMEOUT(set) &&
 			    ip_set_timeout_expired(ext_timeout(e, set)))
@@ -977,7 +1081,8 @@ mtype_list(const struct ip_set *set,
 			if (!nested) {
 				if (cb->args[IPSET_CB_ARG0] == first) {
 					nla_nest_cancel(skb, atd);
-					return -EMSGSIZE;
+					ret = -EMSGSIZE;
+					goto out;
 				} else
 					goto nla_put_failure;
 			}
@@ -992,7 +1097,7 @@ mtype_list(const struct ip_set *set,
 	/* Set listing finished */
 	cb->args[IPSET_CB_ARG0] = 0;
 
-	return 0;
+	goto out;
 
 nla_put_failure:
 	nlmsg_trim(skb, incomplete);
@@ -1000,10 +1105,11 @@ nla_put_failure:
 		pr_warn("Can't list set %s: one bucket does not fit into a message. Please report it!\n",
 			set->name);
 		cb->args[IPSET_CB_ARG0] = 0;
-		return -EMSGSIZE;
-	}
-	ipset_nest_end(skb, atd);
-	return 0;
+		ret = -EMSGSIZE;
+	} else
+		ipset_nest_end(skb, atd);
+out:
+	return ret;
 }
 
 static int
@@ -1064,12 +1170,14 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 
 	if (unlikely(!ip_set_optattr_netorder(tb, IPSET_ATTR_HASHSIZE) ||
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_MAXELEM) ||
-#ifdef IP_SET_HASH_WITH_MARKMASK
-		     !ip_set_optattr_netorder(tb, IPSET_ATTR_MARKMASK) ||
-#endif
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_TIMEOUT) ||
 		     !ip_set_optattr_netorder(tb, IPSET_ATTR_CADT_FLAGS)))
 		return -IPSET_ERR_PROTOCOL;
+#ifdef IP_SET_HASH_WITH_MARKMASK
+	/* Separated condition in order to avoid directive in argument list */
+	if (unlikely(!ip_set_optattr_netorder(tb, IPSET_ATTR_MARKMASK)))
+		return -IPSET_ERR_PROTOCOL;
+#endif
 
 	if (tb[IPSET_ATTR_HASHSIZE]) {
 		hashsize = ip_set_get_h32(tb[IPSET_ATTR_HASHSIZE]);
@@ -1092,7 +1200,7 @@ IPSET_TOKEN(HTYPE, _create)(struct net *net, struct ip_set *set,
 #endif
 #ifdef IP_SET_HASH_WITH_MARKMASK
 	if (tb[IPSET_ATTR_MARKMASK]) {
-		markmask = ntohl(nla_get_u32(tb[IPSET_ATTR_MARKMASK]));
+		markmask = ntohl(nla_get_be32(tb[IPSET_ATTR_MARKMASK]));
 
 		if (markmask == 0)
 			return -IPSET_ERR_INVALID_MARKMASK;
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (10 preceding siblings ...)
  2014-11-30 18:57 ` [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  2014-12-02 18:43   ` Pablo Neira Ayuso
  2014-11-30 18:57 ` [PATCH 13/14] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik
  2014-11-30 18:57 ` [PATCH 14/14] netfilter: ipset: Fix sparse warning Jozsef Kadlecsik
  13 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h       |  6 +++---
 net/netfilter/ipset/ip_set_core.c            | 11 ++++++++---
 net/netfilter/ipset/ip_set_hash_ipportnet.c  |  2 ++
 net/netfilter/ipset/ip_set_hash_net.c        |  4 +++-
 net/netfilter/ipset/ip_set_hash_netiface.c   |  2 ++
 net/netfilter/ipset/ip_set_hash_netnet.c     |  2 ++
 net/netfilter/ipset/ip_set_hash_netport.c    |  2 ++
 net/netfilter/ipset/ip_set_hash_netportnet.c |  2 ++
 net/netfilter/xt_set.c                       |  7 +++++++
 net/sched/em_ipset.c                         |  5 ++++-
 10 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index d5d5bcd..f7fcb6e 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -345,10 +345,10 @@ ip_set_put_skbinfo(struct sk_buff *skb, struct ip_set_skbinfo *skbinfo)
 			      cpu_to_be64((u64)skbinfo->skbmark << 32 |
 					  skbinfo->skbmarkmask))) ||
 	       (skbinfo->skbprio &&
-	        nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
+		nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
 			      cpu_to_be32(skbinfo->skbprio))) ||
 	       (skbinfo->skbqueue &&
-	        nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
+		nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
 			     cpu_to_be16(skbinfo->skbqueue)));
 
 }
@@ -547,7 +547,7 @@ ip_set_put_extensions(struct sk_buff *skb, const struct ip_set *set,
 
 		if (nla_put_net32(skb, IPSET_ATTR_TIMEOUT,
 			htonl(active ? ip_set_timeout_get(timeout)
-				: *timeout)))
+			      : *timeout)))
 			return -EMSGSIZE;
 	}
 	if (SET_WITH_COUNTER(set) &&
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 9fb2610..957ec91 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -390,6 +390,7 @@ ip_set_get_extensions(struct ip_set *set, struct nlattr *tb[],
 		      struct ip_set_ext *ext)
 {
 	u64 fullmark;
+
 	if (tb[IPSET_ATTR_TIMEOUT]) {
 		if (!(set->extensions & IPSET_EXT_TIMEOUT))
 			return -IPSET_ERR_TIMEOUT;
@@ -903,7 +904,7 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 			/* Wraparound */
 			goto cleanup;
 
-		list = kzalloc(sizeof(struct ip_set *) * i, GFP_KERNEL);
+		list = kcalloc(i, sizeof(struct ip_set *), GFP_KERNEL);
 		if (!list)
 			goto cleanup;
 		/* nfnl mutex is held, both lists are valid */
@@ -1179,6 +1180,7 @@ static int
 ip_set_dump_done(struct netlink_callback *cb)
 {
 	struct ip_set_net *inst = (struct ip_set_net *)cb->args[IPSET_CB_NET];
+
 	if (cb->args[IPSET_CB_ARG0]) {
 		pr_debug("release set %s\n",
 			 ip_set(inst, cb->args[IPSET_CB_INDEX])->name);
@@ -1216,7 +1218,7 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 
 	/* cb->args[IPSET_CB_NET]:	net namespace
 	 *         [IPSET_CB_DUMP]:	dump single set/all sets
-	 *         [IPSET_CB_INDEX]: 	set index
+	 *         [IPSET_CB_INDEX]:	set index
 	 *         [IPSET_CB_ARG0]:	type specific
 	 */
 
@@ -1235,6 +1237,7 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 
 	if (cda[IPSET_ATTR_FLAGS]) {
 		u32 f = ip_set_get_h32(cda[IPSET_ATTR_FLAGS]);
+
 		dump_type |= (f << 16);
 	}
 	cb->args[IPSET_CB_NET] = (unsigned long)inst;
@@ -1864,6 +1867,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	if (*op < IP_SET_OP_VERSION) {
 		/* Check the version at the beginning of operations */
 		struct ip_set_req_version *req_version = data;
+
 		if (req_version->version != IPSET_PROTOCOL) {
 			ret = -EPROTO;
 			goto done;
@@ -1965,7 +1969,7 @@ ip_set_net_init(struct net *net)
 	if (inst->ip_set_max >= IPSET_INVALID_ID)
 		inst->ip_set_max = IPSET_INVALID_ID - 1;
 
-	list = kzalloc(sizeof(struct ip_set *) * inst->ip_set_max, GFP_KERNEL);
+	list = kcalloc(inst->ip_set_max, sizeof(struct ip_set *), GFP_KERNEL);
 	if (!list)
 		return -ENOMEM;
 	inst->is_deleted = 0;
@@ -2003,6 +2007,7 @@ static int __init
 ip_set_init(void)
 {
 	int ret = nfnetlink_subsys_register(&ip_set_netlink_subsys);
+
 	if (ret != 0) {
 		pr_err("ip_set: cannot register with nfnetlink.\n");
 		return ret;
diff --git a/net/netfilter/ipset/ip_set_hash_ipportnet.c b/net/netfilter/ipset/ip_set_hash_ipportnet.c
index b6012ad..3d07f93 100644
--- a/net/netfilter/ipset/ip_set_hash_ipportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_ipportnet.c
@@ -224,6 +224,7 @@ hash_ipportnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -485,6 +486,7 @@ hash_ipportnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_net.c b/net/netfilter/ipset/ip_set_hash_net.c
index 6b3ac10..cb95064 100644
--- a/net/netfilter/ipset/ip_set_hash_net.c
+++ b/net/netfilter/ipset/ip_set_hash_net.c
@@ -173,6 +173,7 @@ hash_net4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -180,7 +181,7 @@ hash_net4_uadt(struct ip_set *set, struct nlattr *tb[],
 	if (adt == IPSET_TEST || !tb[IPSET_ATTR_IP_TO]) {
 		e.ip = htonl(ip & ip_set_hostmask(e.cidr));
 		ret = adtfn(set, &e, &ext, &ext, flags);
-		return ip_set_enomatch(ret, flags, adt, set) ? -ret:
+		return ip_set_enomatch(ret, flags, adt, set) ? -ret :
 		       ip_set_eexist(ret, flags) ? 0 : ret;
 	}
 
@@ -348,6 +349,7 @@ hash_net6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
index d8a53ec..d7bdb06 100644
--- a/net/netfilter/ipset/ip_set_hash_netiface.c
+++ b/net/netfilter/ipset/ip_set_hash_netiface.c
@@ -218,6 +218,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_PHYSDEV)
 			e.physdev = 1;
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
@@ -433,6 +434,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_PHYSDEV)
 			e.physdev = 1;
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
diff --git a/net/netfilter/ipset/ip_set_hash_netnet.c b/net/netfilter/ipset/ip_set_hash_netnet.c
index ea8772a..ea4d06b 100644
--- a/net/netfilter/ipset/ip_set_hash_netnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netnet.c
@@ -204,6 +204,7 @@ hash_netnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -432,6 +433,7 @@ hash_netnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netport.c b/net/netfilter/ipset/ip_set_hash_netport.c
index c0ddb58..bc7a1c2 100644
--- a/net/netfilter/ipset/ip_set_hash_netport.c
+++ b/net/netfilter/ipset/ip_set_hash_netport.c
@@ -215,6 +215,7 @@ hash_netport4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -436,6 +437,7 @@ hash_netport6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_netportnet.c b/net/netfilter/ipset/ip_set_hash_netportnet.c
index bfaa94c..f98d3f6 100644
--- a/net/netfilter/ipset/ip_set_hash_netportnet.c
+++ b/net/netfilter/ipset/ip_set_hash_netportnet.c
@@ -239,6 +239,7 @@ hash_netportnet4_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
@@ -515,6 +516,7 @@ hash_netportnet6_uadt(struct ip_set *set, struct nlattr *tb[],
 
 	if (tb[IPSET_ATTR_CADT_FLAGS]) {
 		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
+
 		if (cadt_flags & IPSET_FLAG_NOMATCH)
 			flags |= (IPSET_FLAG_NOMATCH << 16);
 	}
diff --git a/net/netfilter/xt_set.c b/net/netfilter/xt_set.c
index 0d47afe..bec8a51 100644
--- a/net/netfilter/xt_set.c
+++ b/net/netfilter/xt_set.c
@@ -52,6 +52,7 @@ static bool
 set_match_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v0 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.u.compat.dim,
 		info->match_set.u.compat.flags, 0, UINT_MAX);
 
@@ -114,6 +115,7 @@ static bool
 set_match_v1(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v1 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.dim,
 		info->match_set.flags, 0, UINT_MAX);
 
@@ -178,6 +180,7 @@ static bool
 set_match_v3(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_set_info_match_v3 *info = par->matchinfo;
+
 	ADT_OPT(opt, par->family, info->match_set.dim,
 		info->match_set.flags, info->flags, UINT_MAX);
 	int ret;
@@ -252,6 +255,7 @@ static unsigned int
 set_target_v0(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v0 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.u.compat.dim,
 		info->add_set.u.compat.flags, 0, UINT_MAX);
 	ADT_OPT(del_opt, par->family, info->del_set.u.compat.dim,
@@ -324,6 +328,7 @@ static unsigned int
 set_target_v1(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v1 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, 0, UINT_MAX);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
@@ -392,6 +397,7 @@ static unsigned int
 set_target_v2(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v2 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, info->flags, info->timeout);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
@@ -418,6 +424,7 @@ static unsigned int
 set_target_v3(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_set_info_target_v3 *info = par->targinfo;
+
 	ADT_OPT(add_opt, par->family, info->add_set.dim,
 		info->add_set.flags, info->flags, info->timeout);
 	ADT_OPT(del_opt, par->family, info->del_set.dim,
diff --git a/net/sched/em_ipset.c b/net/sched/em_ipset.c
index 5b4a4ef..c3dba5e 100644
--- a/net/sched/em_ipset.c
+++ b/net/sched/em_ipset.c
@@ -44,6 +44,7 @@ static int em_ipset_change(struct net *net, void *data, int data_len,
 static void em_ipset_destroy(struct tcf_ematch *em)
 {
 	const struct xt_set_info *set = (const void *) em->data;
+
 	if (set) {
 		ip_set_nfnl_put(em->net, set->index);
 		kfree((void *) em->data);
@@ -70,7 +71,9 @@ static int em_ipset_match(struct sk_buff *skb, struct tcf_ematch *em,
 		acpar.family = NFPROTO_IPV6;
 		if (!pskb_network_may_pull(skb, sizeof(struct ipv6hdr)))
 			return 0;
-		/* doesn't call ipv6_find_hdr() because ipset doesn't use thoff, yet */
+		/* doesn't call ipv6_find_hdr() because ipset doesn't use
+		 * thoff, yet
+		 */
 		acpar.thoff = sizeof(struct ipv6hdr);
 		break;
 	default:
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 13/14] netfilter: ipset: Fix parallel resizing and listing of the same set
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (11 preceding siblings ...)
  2014-11-30 18:57 ` [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  2014-11-30 18:57 ` [PATCH 14/14] netfilter: ipset: Fix sparse warning Jozsef Kadlecsik
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

When elements added to a hash:* type of set and resizing triggered,
parallel listing could start to list the original set (before resizing)
and "continue" with listing the new set. Fix it by references and
using the original hash table for listing. Therefore the destroying
the original hash table may happen from the resizing or listing functions.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 include/linux/netfilter/ipset/ip_set.h | 13 ++++++----
 net/netfilter/ipset/ip_set_core.c      | 30 +++++++++++++----------
 net/netfilter/ipset/ip_set_hash_gen.h  | 44 ++++++++++++++++++++++++++++++----
 3 files changed, 65 insertions(+), 22 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index f7fcb6e..517362b 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -176,6 +176,9 @@ struct ip_set_type_variant {
 	/* List elements */
 	int (*list)(const struct ip_set *set, struct sk_buff *skb,
 		    struct netlink_callback *cb);
+	/* Keep listing private when resizing runs parallel */
+	void (*uref)(struct ip_set *set, struct netlink_callback *cb,
+		     bool start);
 
 	/* Return true if "b" set is the same as "a"
 	 * according to the create set parameters */
@@ -384,12 +387,12 @@ ip_set_init_counter(struct ip_set_counter *counter,
 
 /* Netlink CB args */
 enum {
-	IPSET_CB_NET = 0,
-	IPSET_CB_DUMP,
-	IPSET_CB_INDEX,
-	IPSET_CB_ARG0,
+	IPSET_CB_NET = 0,	/* net namespace */
+	IPSET_CB_DUMP,		/* dump single set/all sets */
+	IPSET_CB_INDEX,		/* set index */
+	IPSET_CB_PRIVATE,	/* set private data */
+	IPSET_CB_ARG0,		/* type specific */
 	IPSET_CB_ARG1,
-	IPSET_CB_ARG2,
 };
 
 /* register and unregister set references */
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 957ec91..5dafbb7 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1179,13 +1179,16 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 static int
 ip_set_dump_done(struct netlink_callback *cb)
 {
-	struct ip_set_net *inst = (struct ip_set_net *)cb->args[IPSET_CB_NET];
-
 	if (cb->args[IPSET_CB_ARG0]) {
-		pr_debug("release set %s\n",
-			 ip_set(inst, cb->args[IPSET_CB_INDEX])->name);
-		__ip_set_put_byindex(inst,
-			(ip_set_id_t) cb->args[IPSET_CB_INDEX]);
+		struct ip_set_net *inst =
+			(struct ip_set_net *)cb->args[IPSET_CB_NET];
+		ip_set_id_t index = (ip_set_id_t) cb->args[IPSET_CB_INDEX];
+		struct ip_set *set = ip_set(inst, index);
+
+		if (set->variant->uref)
+			set->variant->uref(set, cb, false);
+		pr_debug("release set %s\n", set->name);
+		__ip_set_put_byindex(inst, index);
 	}
 	return 0;
 }
@@ -1216,12 +1219,6 @@ dump_init(struct netlink_callback *cb, struct ip_set_net *inst)
 	nla_parse(cda, IPSET_ATTR_CMD_MAX,
 		  attr, nlh->nlmsg_len - min_len, ip_set_setname_policy);
 
-	/* cb->args[IPSET_CB_NET]:	net namespace
-	 *         [IPSET_CB_DUMP]:	dump single set/all sets
-	 *         [IPSET_CB_INDEX]:	set index
-	 *         [IPSET_CB_ARG0]:	type specific
-	 */
-
 	if (cda[IPSET_ATTR_SETNAME]) {
 		struct ip_set *set;
 
@@ -1329,6 +1326,8 @@ dump_last:
 				goto release_refcount;
 			if (dump_flags & IPSET_FLAG_LIST_HEADER)
 				goto next_set;
+			if (set->variant->uref)
+				set->variant->uref(set, cb, true);
 			/* Fall through and add elements */
 		default:
 			rcu_read_lock_bh();
@@ -1345,6 +1344,8 @@ dump_last:
 		dump_type = DUMP_LAST;
 		cb->args[IPSET_CB_DUMP] = dump_type | (dump_flags << 16);
 		cb->args[IPSET_CB_INDEX] = 0;
+		if (set && set->variant->uref)
+			set->variant->uref(set, cb, false);
 		goto dump_last;
 	}
 	goto out;
@@ -1359,7 +1360,10 @@ next_set:
 release_refcount:
 	/* If there was an error or set is done, release set */
 	if (ret || !cb->args[IPSET_CB_ARG0]) {
-		pr_debug("release set %s\n", ip_set(inst, index)->name);
+		set = ip_set(inst, index);
+		if (set->variant->uref)
+			set->variant->uref(set, cb, false);
+		pr_debug("release set %s\n", set->name);
 		__ip_set_put_byindex(inst, index);
 		cb->args[IPSET_CB_ARG0] = 0;
 	}
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index 8f51ba4..91e5327 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -76,6 +76,8 @@ struct hbucket {
 
 /* The hash table: the table size stored here in order to make resizing easy */
 struct htable {
+	atomic_t ref;		/* References for resizing */
+	atomic_t uref;		/* References for dumping */
 	u8 htable_bits;		/* size of hash table == 2^htable_bits */
 	struct hbucket __rcu * bucket[0]; /* hashtable buckets */
 };
@@ -185,6 +187,7 @@ htable_bits(u32 hashsize)
 #undef mtype_del
 #undef mtype_test_cidrs
 #undef mtype_test
+#undef mtype_uref
 #undef mtype_expire
 #undef mtype_resize
 #undef mtype_head
@@ -226,6 +229,7 @@ htable_bits(u32 hashsize)
 #define mtype_del		IPSET_TOKEN(MTYPE, _del)
 #define mtype_test_cidrs	IPSET_TOKEN(MTYPE, _test_cidrs)
 #define mtype_test		IPSET_TOKEN(MTYPE, _test)
+#define mtype_uref		IPSET_TOKEN(MTYPE, _uref)
 #define mtype_expire		IPSET_TOKEN(MTYPE, _expire)
 #define mtype_resize		IPSET_TOKEN(MTYPE, _resize)
 #define mtype_head		IPSET_TOKEN(MTYPE, _head)
@@ -567,6 +571,8 @@ retry:
 
 	spin_lock_bh(&set->lock);
 	orig = __ipset_dereference_protected(h->table, 1);
+	atomic_set(&orig->ref, 1);
+	atomic_inc(&orig->uref);
 	pr_debug("attempt to resize set %s from %u to %u, t %p\n",
 		 set->name, orig->htable_bits, htable_bits, orig);
 	for (i = 0; i < jhash_size(orig->htable_bits); i++) {
@@ -610,6 +616,8 @@ retry:
 						ret = -ENOMEM;
 				}
 				if (ret < 0) {
+					atomic_set(&orig->ref, 0);
+					atomic_dec(&orig->uref);
 					spin_unlock_bh(&set->lock);
 					mtype_ahash_destroy(set, t, false);
 					if (ret == -EAGAIN)
@@ -640,7 +648,11 @@ retry:
 
 	pr_debug("set %s resized from %u (%p) to %u (%p)\n", set->name,
 		 orig->htable_bits, orig, t->htable_bits, t);
-	mtype_ahash_destroy(set, orig, false);
+	/* If there's nobody else dumping the table, destroy it */
+	if (atomic_dec_and_test(&orig->uref)) {
+		pr_debug("Table destroy by resize %p\n", orig);
+		mtype_ahash_destroy(set, orig, false);
+	}
 
 out:
 #ifdef IP_SET_HASH_WITH_NETS
@@ -1039,12 +1051,35 @@ nla_put_failure:
 	return -EMSGSIZE;
 }
 
+/* Make possible to run dumping parallel with resizing */
+static void
+mtype_uref(struct ip_set *set, struct netlink_callback *cb, bool start)
+{
+	struct htype *h = set->data;
+	struct htable *t;
+
+	if (start) {
+		rcu_read_lock_bh();
+		t = rcu_dereference_bh_nfnl(h->table);
+		atomic_inc(&t->uref);
+		cb->args[IPSET_CB_PRIVATE] = (unsigned long) t;
+		rcu_read_unlock_bh();
+	} else if (cb->args[IPSET_CB_PRIVATE]) {
+		t = (struct htable *) cb->args[IPSET_CB_PRIVATE];
+		if (atomic_dec_and_test(&t->uref) && atomic_read(&t->ref)) {
+			/* Resizing didn't destroy the hash table */
+			pr_debug("Table destroy by dump: %p\n", t);
+			mtype_ahash_destroy(set, t, false);
+		}
+		cb->args[IPSET_CB_PRIVATE] = 0;
+	}
+}
+
 /* Reply a LIST/SAVE request: dump the elements of the specified set */
 static int
 mtype_list(const struct ip_set *set,
 	   struct sk_buff *skb, struct netlink_callback *cb)
 {
-	const struct htype *h = set->data;
 	const struct htable *t;
 	struct nlattr *atd, *nested;
 	const struct hbucket *n;
@@ -1059,11 +1094,11 @@ mtype_list(const struct ip_set *set,
 		return -EMSGSIZE;
 
 	pr_debug("list hash set %s\n", set->name);
-	t = rcu_dereference_bh_nfnl(h->table);
+	t = (const struct htable *) cb->args[IPSET_CB_PRIVATE];
 	for (; cb->args[IPSET_CB_ARG0] < jhash_size(t->htable_bits);
 	     cb->args[IPSET_CB_ARG0]++) {
 		incomplete = skb_tail_pointer(skb);
-		n = rcu_dereference(hbucket(t, cb->args[IPSET_CB_ARG0]));
+		n = hbucket(t, cb->args[IPSET_CB_ARG0]);
 		pr_debug("cb->arg bucket: %lu, t %p n %p\n",
 			 cb->args[IPSET_CB_ARG0], t, n);
 		if (n == NULL)
@@ -1133,6 +1168,7 @@ static const struct ip_set_type_variant mtype_variant = {
 	.flush	= mtype_flush,
 	.head	= mtype_head,
 	.list	= mtype_list,
+	.uref	= mtype_uref,
 	.resize	= mtype_resize,
 	.same_set = mtype_same_set,
 };
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 14/14] netfilter: ipset: Fix sparse warning
  2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
                   ` (12 preceding siblings ...)
  2014-11-30 18:57 ` [PATCH 13/14] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik
@ 2014-11-30 18:57 ` Jozsef Kadlecsik
  13 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-11-30 18:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso

"warning: cast to restricted __be32" warnings are fixed

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
---
 net/netfilter/ipset/ip_set_hash_ipmark.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_hash_ipmark.c b/net/netfilter/ipset/ip_set_hash_ipmark.c
index 7abf978..2ec4ac5 100644
--- a/net/netfilter/ipset/ip_set_hash_ipmark.c
+++ b/net/netfilter/ipset/ip_set_hash_ipmark.c
@@ -128,7 +128,7 @@ hash_ipmark4_uadt(struct ip_set *set, struct nlattr *tb[],
 	if (ret)
 		return ret;
 
-	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+	e.mark = ntohl(nla_get_be32(tb[IPSET_ATTR_MARK]));
 	e.mark &= h->markmask;
 
 	if (adt == IPSET_TEST ||
@@ -263,7 +263,7 @@ hash_ipmark6_uadt(struct ip_set *set, struct nlattr *tb[],
 	if (ret)
 		return ret;
 
-	e.mark = ntohl(nla_get_u32(tb[IPSET_ATTR_MARK]));
+	e.mark = ntohl(nla_get_be32(tb[IPSET_ATTR_MARK]));
 	e.mark &= h->markmask;
 
 	if (adt == IPSET_TEST) {
-- 
1.8.5.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types
  2014-11-30 18:57 ` [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types Jozsef Kadlecsik
@ 2014-12-01  7:59   ` Jesper Dangaard Brouer
  2014-12-02 18:40   ` Pablo Neira Ayuso
  1 sibling, 0 replies; 32+ messages in thread
From: Jesper Dangaard Brouer @ 2014-12-01  7:59 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel, Pablo Neira Ayuso

On Sun, 30 Nov 2014 19:57:02 +0100
Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> wrote:

> Performance is tested by Jesper Dangaard Brouer:
> 
> Simple drop in FORWARD
> ~~~~~~~~~~~~~~~~~~~~
> 
> Dropping via simple iptables net-mask match::
> 
>  iptables -t raw -N simple || iptables -t raw -F simple
>  iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
>  iptables -t raw -D PREROUTING -j simple
>  iptables -t raw -I PREROUTING -j simple
> 
> Drop performance in "raw": 11.3Mpps

This is multiple CPUs receiving with correct IRQ alignment with an
Intel ixgbe 10G NIC.  The 11.3Mpps seems to be some hardware limit
related to the NIC or CPU.
Tuning according to:
 http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html


> Generator: sending 12.2Mpps (tx:12264083 pps)

Generator based on trafgen, random src 198.18.1.x, description see:
 http://netoptimizer.blogspot.dk/2014/04/trafgen-fast-packet-generator.html

> Drop via original ipset in RAW table
> ~~~~~~~~~~~~~~~~~~~~~~~~~
[...]
> Drop performance in "raw" with ipset: 8Mpps
> 
> Perf report numbers ipset drop in "raw"::
> 
>  +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
>  -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
>     - _raw_read_lock_bh
>        + 99.88% ip_set_test
>  -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
>     - _raw_read_unlock_bh
>        + 99.72% ip_set_test
[...]

The read-side-lock were clearly a limiting factor in this extreme network
overload scenario.  This can be a valid use-case when using ipset for
DDoS protection/mitigation.

> Drop via ipset in RAW table with RCU-locking
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With RCU locking, the RW-lock is gone.
> 
> Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
> 
> Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
  2014-11-30 18:56 ` [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
@ 2014-12-02 18:23   ` Pablo Neira Ayuso
  2014-12-03 10:54     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:23 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:56:58PM +0100, Jozsef Kadlecsik wrote:
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  net/netfilter/ipset/ip_set_hash_netiface.c | 156 ++++-------------------------
>  1 file changed, 17 insertions(+), 139 deletions(-)
> 
> diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
> index 758b002..d8a53ec 100644
> --- a/net/netfilter/ipset/ip_set_hash_netiface.c
> +++ b/net/netfilter/ipset/ip_set_hash_netiface.c
> @@ -13,7 +13,6 @@
>  #include <linux/skbuff.h>
>  #include <linux/errno.h>
>  #include <linux/random.h>
> -#include <linux/rbtree.h>
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>  #include <net/netlink.h>
> @@ -36,88 +35,14 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
>  IP_SET_MODULE_DESC("hash:net,iface", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX);
>  MODULE_ALIAS("ip_set_hash:net,iface");
>  
> -/* Interface name rbtree */
> -
> -struct iface_node {
> -	struct rb_node node;
> -	char iface[IFNAMSIZ];
> -};
> -
> -#define iface_data(n)	(rb_entry(n, struct iface_node, node)->iface)
> -
> -static void
> -rbtree_destroy(struct rb_root *root)
> -{
> -	struct iface_node *node, *next;
> -
> -	rbtree_postorder_for_each_entry_safe(node, next, root, node)
> -		kfree(node);
> -
> -	*root = RB_ROOT;
> -}
> -
> -static int
> -iface_test(struct rb_root *root, const char **iface)
> -{
> -	struct rb_node *n = root->rb_node;
> -
> -	while (n) {
> -		const char *d = iface_data(n);
> -		int res = strcmp(*iface, d);
> -
> -		if (res < 0)
> -			n = n->rb_left;
> -		else if (res > 0)
> -			n = n->rb_right;
> -		else {
> -			*iface = d;
> -			return 1;
> -		}
> -	}
> -	return 0;
> -}
> -
> -static int
> -iface_add(struct rb_root *root, const char **iface)
> -{
> -	struct rb_node **n = &(root->rb_node), *p = NULL;
> -	struct iface_node *d;
> -
> -	while (*n) {
> -		char *ifname = iface_data(*n);
> -		int res = strcmp(*iface, ifname);
> -
> -		p = *n;
> -		if (res < 0)
> -			n = &((*n)->rb_left);
> -		else if (res > 0)
> -			n = &((*n)->rb_right);
> -		else {
> -			*iface = ifname;
> -			return 0;
> -		}
> -	}
> -
> -	d = kzalloc(sizeof(*d), GFP_ATOMIC);
> -	if (!d)
> -		return -ENOMEM;
> -	strcpy(d->iface, *iface);
> -
> -	rb_link_node(&d->node, p, n);
> -	rb_insert_color(&d->node, root);
> -
> -	*iface = d->iface;
> -	return 0;
> -}
> -
>  /* Type specific function prefix */
>  #define HTYPE		hash_netiface
>  #define IP_SET_HASH_WITH_NETS
> -#define IP_SET_HASH_WITH_RBTREE
>  #define IP_SET_HASH_WITH_MULTI
>  #define IP_SET_HASH_WITH_NET0
>  
>  #define STREQ(a, b)	(strcmp(a, b) == 0)
> +#define IFNAMCPY(a, b)	strncpy(a, b, IFNAMSIZ)
>  
>  /* IPv4 variant */
>  
> @@ -136,7 +61,7 @@ struct hash_netiface4_elem {
>  	u8 cidr;
>  	u8 nomatch;
>  	u8 elem;
> -	const char *iface;
> +	char iface[IFNAMSIZ];
>  };
>  
>  /* Common functions */
> @@ -150,7 +75,7 @@ hash_netiface4_data_equal(const struct hash_netiface4_elem *ip1,
>  	       ip1->cidr == ip2->cidr &&
>  	       (++*multi) &&
>  	       ip1->physdev == ip2->physdev &&
> -	       ip1->iface == ip2->iface;
> +	       STREQ(ip1->iface, ip2->iface);

I'd really prefer if you use strcmp(a, b) == 0 instead here. This
makes the code less readable and we save nothing. You have to look to
the macro to understand what it does, which means scrolling up to see
what the non-standard STREQ() does.

I would really like to see these kind of macro usage reduced in ipset,
same thing with IFNAMCPY().

> @@ -223,7 +148,6 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
>  		.elem = 1,
>  	};
>  	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
> -	int ret;
>  
>  	if (e.cidr == 0)
>  		return -EINVAL;
> @@ -233,8 +157,8 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
>  	ip4addrptr(skb, opt->flags & IPSET_DIM_ONE_SRC, &e.ip);
>  	e.ip &= ip_set_netmask(e.cidr);
>  
> -#define IFACE(dir)	(par->dir ? par->dir->name : NULL)
> -#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : NULL)
> +#define IFACE(dir)	(par->dir ? par->dir->name : "")
> +#define PHYSDEV(dir)	(nf_bridge->dir ? nf_bridge->dir->name : "")
>  #define SRCDIR		(opt->flags & IPSET_DIM_TWO_SRC)
>  
>  	if (opt->cmdflags & IPSET_FLAG_PHYSDEV) {
> @@ -243,26 +167,15 @@ hash_netiface4_kadt(struct ip_set *set, const struct sk_buff *skb,
>  
>  		if (!nf_bridge)
>  			return -EINVAL;
> -		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
> +		IFNAMCPY(e.iface,
> +			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
>  		e.physdev = 1;
> -#else
> -		e.iface = NULL;
>  #endif
>  	} else
> -		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
> +		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
>  
> -	if (!e.iface)
> +	if (strlen(e.iface) == 0)
>  		return -EINVAL;
> -	ret = iface_test(&h->rbtree, &e.iface);
> -	if (adt == IPSET_ADD) {
> -		if (!ret) {
> -			ret = iface_add(&h->rbtree, &e.iface);
> -			if (ret)
> -				return ret;
> -		}
> -	} else if (!ret)
> -		return ret;
> -
>  	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
>  }
>  
> @@ -275,7 +188,6 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
>  	struct hash_netiface4_elem e = { .cidr = HOST_MASK, .elem = 1 };
>  	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
>  	u32 ip = 0, ip_to = 0, last;
> -	char iface[IFNAMSIZ];
>  	int ret;
>  
>  	if (unlikely(!tb[IPSET_ATTR_IP] ||
> @@ -302,18 +214,7 @@ hash_netiface4_uadt(struct ip_set *set, struct nlattr *tb[],
>  		if (e.cidr > HOST_MASK)
>  			return -IPSET_ERR_INVALID_CIDR;
>  	}
> -
> -	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
> -	e.iface = iface;
> -	ret = iface_test(&h->rbtree, &e.iface);
> -	if (adt == IPSET_ADD) {
> -		if (!ret) {
> -			ret = iface_add(&h->rbtree, &e.iface);
> -			if (ret)
> -				return ret;
> -		}
> -	} else if (!ret)
> -		return ret;
> +	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));

nla_strlcpy() ?

>  	if (tb[IPSET_ATTR_CADT_FLAGS]) {
>  		u32 cadt_flags = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
> @@ -372,7 +273,7 @@ struct hash_netiface6_elem {
>  	u8 cidr;
>  	u8 nomatch;
>  	u8 elem;
> -	const char *iface;
> +	char iface[IFNAMSIZ];
>  };
>  
>  /* Common functions */
> @@ -386,7 +287,7 @@ hash_netiface6_data_equal(const struct hash_netiface6_elem *ip1,
>  	       ip1->cidr == ip2->cidr &&
>  	       (++*multi) &&
>  	       ip1->physdev == ip2->physdev &&
> -	       ip1->iface == ip2->iface;
> +	       STREQ(ip1->iface, ip2->iface);
>  }
>  
>  static inline int
> @@ -464,7 +365,6 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
>  		.elem = 1,
>  	};
>  	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
> -	int ret;
>  
>  	if (e.cidr == 0)
>  		return -EINVAL;
> @@ -480,25 +380,15 @@ hash_netiface6_kadt(struct ip_set *set, const struct sk_buff *skb,
>  
>  		if (!nf_bridge)
>  			return -EINVAL;
> -		e.iface = SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev);
> +		IFNAMCPY(e.iface,
> +			 SRCDIR ? PHYSDEV(physindev) : PHYSDEV(physoutdev));
>  		e.physdev = 1;
> -#else
> -		e.iface = NULL;
>  #endif
>  	} else
> -		e.iface = SRCDIR ? IFACE(in) : IFACE(out);
> +		IFNAMCPY(e.iface, SRCDIR ? IFACE(in) : IFACE(out));
>  
> -	if (!e.iface)
> +	if (strlen(e.iface) == 0)
>  		return -EINVAL;
> -	ret = iface_test(&h->rbtree, &e.iface);
> -	if (adt == IPSET_ADD) {
> -		if (!ret) {
> -			ret = iface_add(&h->rbtree, &e.iface);
> -			if (ret)
> -				return ret;
> -		}
> -	} else if (!ret)
> -		return ret;
>  
>  	return adtfn(set, &e, &ext, &opt->ext, opt->cmdflags);
>  }
> @@ -507,11 +397,9 @@ static int
>  hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
>  		   enum ipset_adt adt, u32 *lineno, u32 flags, bool retried)
>  {
> -	struct hash_netiface *h = set->data;
>  	ipset_adtfn adtfn = set->variant->adt[adt];
>  	struct hash_netiface6_elem e = { .cidr = HOST_MASK, .elem = 1 };
>  	struct ip_set_ext ext = IP_SET_INIT_UEXT(set);
> -	char iface[IFNAMSIZ];
>  	int ret;
>  
>  	if (unlikely(!tb[IPSET_ATTR_IP] ||
> @@ -541,17 +429,7 @@ hash_netiface6_uadt(struct ip_set *set, struct nlattr *tb[],
>  		return -IPSET_ERR_INVALID_CIDR;
>  	ip6_netmask(&e.ip, e.cidr);
>  
> -	strcpy(iface, nla_data(tb[IPSET_ATTR_IFACE]));
> -	e.iface = iface;
> -	ret = iface_test(&h->rbtree, &e.iface);
> -	if (adt == IPSET_ADD) {
> -		if (!ret) {
> -			ret = iface_add(&h->rbtree, &e.iface);
> -			if (ret)
> -				return ret;
> -		}
> -	} else if (!ret)
> -		return ret;
> +	IFNAMCPY(e.iface, nla_data(tb[IPSET_ATTR_IFACE]));

nla_strlcpy() ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core
  2014-11-30 18:56 ` [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core Jozsef Kadlecsik
@ 2014-12-02 18:25   ` Pablo Neira Ayuso
  2014-12-03 11:01     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:25 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:56:59PM +0100, Jozsef Kadlecsik wrote:
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  include/linux/netfilter/ipset/ip_set.h         |  6 ++++-
>  include/linux/netfilter/ipset/ip_set_timeout.h | 27 ++++++++------------
>  net/netfilter/ipset/ip_set_core.c              | 35 +++++++++++++-------------
>  3 files changed, 34 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> index f1606fa..d5d5bcd 100644
> --- a/include/linux/netfilter/ipset/ip_set.h
> +++ b/include/linux/netfilter/ipset/ip_set.h
> @@ -223,7 +223,7 @@ struct ip_set {
>  	/* The name of the set */
>  	char name[IPSET_MAXNAMELEN];
>  	/* Lock protecting the set data */
> -	rwlock_t lock;
> +	spinlock_t lock;
>  	/* References to the set */
>  	u32 ref;
>  	/* The core set type */
> @@ -322,6 +322,10 @@ ip_set_update_counter(struct ip_set_counter *counter,
>  	}
>  }
>  
> +#define ip_set_rcu_deref(t)		\
> +	rcu_dereference_index_check(t,	\
> +		rcu_read_lock_held() || rcu_read_lock_bh_held())
> +

This is not used from this patch itself?

>  static inline void
>  ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
>  		      const struct ip_set_ext *ext,
> diff --git a/include/linux/netfilter/ipset/ip_set_timeout.h b/include/linux/netfilter/ipset/ip_set_timeout.h
> index 83c2f9e..1d6a935 100644
> --- a/include/linux/netfilter/ipset/ip_set_timeout.h
> +++ b/include/linux/netfilter/ipset/ip_set_timeout.h
> @@ -40,38 +40,33 @@ ip_set_timeout_uget(struct nlattr *tb)
>  }
>  
>  static inline bool
> -ip_set_timeout_test(unsigned long timeout)
> +ip_set_timeout_expired(unsigned long *t)
>  {
> -	return timeout == IPSET_ELEM_PERMANENT ||
> -	       time_is_after_jiffies(timeout);
> -}
> -
> -static inline bool
> -ip_set_timeout_expired(unsigned long *timeout)
> -{
> -	return *timeout != IPSET_ELEM_PERMANENT &&
> -	       time_is_before_jiffies(*timeout);
> +	return *t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(*t);
>  }
>  
>  static inline void
> -ip_set_timeout_set(unsigned long *timeout, u32 t)
> +ip_set_timeout_set(unsigned long *timeout, u32 value)
>  {
> -	if (!t) {
> +	unsigned long t;
> +
> +	if (!value) {
>  		*timeout = IPSET_ELEM_PERMANENT;
>  		return;
>  	}
>  
> -	*timeout = msecs_to_jiffies(t * 1000) + jiffies;
> -	if (*timeout == IPSET_ELEM_PERMANENT)
> +	t = msecs_to_jiffies(value * MSEC_PER_SEC) + jiffies;
> +	if (t == IPSET_ELEM_PERMANENT)
>  		/* Bingo! :-) */
> -		(*timeout)--;
> +		t--;
> +	*timeout = t;
>  }
>  
>  static inline u32
>  ip_set_timeout_get(unsigned long *timeout)
>  {
>  	return *timeout == IPSET_ELEM_PERMANENT ? 0 :
> -		jiffies_to_msecs(*timeout - jiffies)/1000;
> +		jiffies_to_msecs(*timeout - jiffies)/MSEC_PER_SEC;
>  }
>  
>  #endif	/* __KERNEL__ */
> diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> index 912e5a0..9fb2610 100644
> --- a/net/netfilter/ipset/ip_set_core.c
> +++ b/net/netfilter/ipset/ip_set_core.c
> @@ -217,6 +217,7 @@ ip_set_type_register(struct ip_set_type *type)
>  		 type->revision_min, type->revision_max);
>  unlock:
>  	ip_set_type_unlock();
> +	synchronize_rcu();

Why this synchronize_rcu()?

ip_set_type_register() didn't publish any new object in the unlock
path.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type
  2014-11-30 18:57 ` [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type Jozsef Kadlecsik
@ 2014-12-02 18:35   ` Pablo Neira Ayuso
  2014-12-02 18:52     ` Pablo Neira Ayuso
  2014-12-03 11:17     ` Jozsef Kadlecsik
  0 siblings, 2 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:35 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:57:01PM +0100, Jozsef Kadlecsik wrote:
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  net/netfilter/ipset/ip_set_list_set.c | 386 ++++++++++++++++------------------
>  1 file changed, 182 insertions(+), 204 deletions(-)
> 
> diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
> index f8f6828..323115a 100644
> --- a/net/netfilter/ipset/ip_set_list_set.c
> +++ b/net/netfilter/ipset/ip_set_list_set.c
> @@ -9,6 +9,7 @@
>  
>  #include <linux/module.h>
>  #include <linux/ip.h>
> +#include <linux/rculist.h>
>  #include <linux/skbuff.h>
>  #include <linux/errno.h>
>  
> @@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
>  
>  /* Member elements  */
>  struct set_elem {
> +	struct rcu_head rcu;
> +	struct list_head list;

I think rcu_barrier() in the module removal path is missing to make
sure call_rcu() is called before the module is gone.

>  	ip_set_id_t id;
>  };
>  
> @@ -41,12 +44,9 @@ struct list_set {
>  	u32 size;		/* size of set list array */
>  	struct timer_list gc;	/* garbage collection */
>  	struct net *net;	/* namespace */
> -	struct set_elem members[0]; /* the set members */
> +	struct list_head members; /* the set members */
>  };
>  
> -#define list_set_elem(set, map, id)	\
> -	(struct set_elem *)((void *)(map)->members + (id) * (set)->dsize)
> -
>  static int
>  list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
>  	       const struct xt_action_param *par,
> @@ -54,17 +54,14 @@ list_set_ktest(struct ip_set *set, const struct sk_buff *skb,
>  {
>  	struct list_set *map = set->data;
>  	struct set_elem *e;
> -	u32 i, cmdflags = opt->cmdflags;
> +	u32 cmdflags = opt->cmdflags;
>  	int ret;
>  
>  	/* Don't lookup sub-counters at all */
>  	opt->cmdflags &= ~IPSET_FLAG_MATCH_COUNTERS;
>  	if (opt->cmdflags & IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE)
>  		opt->cmdflags &= ~IPSET_FLAG_SKIP_COUNTER_UPDATE;
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			return 0;
> +	list_for_each_entry_rcu(e, &map->members, list) {
>  		if (SET_WITH_TIMEOUT(set) &&
>  		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> @@ -91,13 +88,9 @@ list_set_kadd(struct ip_set *set, const struct sk_buff *skb,
>  {
>  	struct list_set *map = set->data;
>  	struct set_elem *e;
> -	u32 i;
>  	int ret;
>  
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			return 0;
> +	list_for_each_entry_rcu(e, &map->members, list) {

>From net/netfilter/ipset/ip_set_core.c I can see this kadd() will be
called under spin_lock_bh(), so you can just use
list_for_each_entry(). The _rcu() variant protects the reader side,
but this code is only invoked from the writer side (no changes are
guaranteed to happen there).

>  		if (SET_WITH_TIMEOUT(set) &&
>  		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> @@ -115,13 +108,9 @@ list_set_kdel(struct ip_set *set, const struct sk_buff *skb,
>  {
>  	struct list_set *map = set->data;
>  	struct set_elem *e;
> -	u32 i;
>  	int ret;
>  
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			return 0;
> +	list_for_each_entry_rcu(e, &map->members, list) {
>  		if (SET_WITH_TIMEOUT(set) &&
>  		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> @@ -138,110 +127,65 @@ list_set_kadt(struct ip_set *set, const struct sk_buff *skb,
>  	      enum ipset_adt adt, struct ip_set_adt_opt *opt)
>  {
>  	struct ip_set_ext ext = IP_SET_INIT_KEXT(skb, opt, set);
> +	int ret = -EINVAL;
>  
> +	rcu_read_lock();
>  	switch (adt) {
>  	case IPSET_TEST:
> -		return list_set_ktest(set, skb, par, opt, &ext);
> +		ret = list_set_ktest(set, skb, par, opt, &ext);
> +		break;
>  	case IPSET_ADD:
> -		return list_set_kadd(set, skb, par, opt, &ext);
> +		ret = list_set_kadd(set, skb, par, opt, &ext);
> +		break;
>  	case IPSET_DEL:
> -		return list_set_kdel(set, skb, par, opt, &ext);
> +		ret = list_set_kdel(set, skb, par, opt, &ext);
> +		break;
>  	default:
>  		break;
>  	}
> -	return -EINVAL;
> -}
> +	rcu_read_unlock();
>  
> -static bool
> -id_eq(const struct ip_set *set, u32 i, ip_set_id_t id)
> -{
> -	const struct list_set *map = set->data;
> -	const struct set_elem *e;
> -
> -	if (i >= map->size)
> -		return 0;
> -
> -	e = list_set_elem(set, map, i);
> -	return !!(e->id == id &&
> -		 !(SET_WITH_TIMEOUT(set) &&
> -		   ip_set_timeout_expired(ext_timeout(e, set))));
> +	return ret;
>  }
>  
> -static int
> -list_set_add(struct ip_set *set, u32 i, struct set_adt_elem *d,
> -	     const struct ip_set_ext *ext)
> -{
> -	struct list_set *map = set->data;
> -	struct set_elem *e = list_set_elem(set, map, i);
> -
> -	if (e->id != IPSET_INVALID_ID) {
> -		if (i == map->size - 1) {
> -			/* Last element replaced: e.g. add new,before,last */
> -			ip_set_put_byindex(map->net, e->id);
> -			ip_set_ext_destroy(set, e);
> -		} else {
> -			struct set_elem *x = list_set_elem(set, map,
> -							   map->size - 1);
> -
> -			/* Last element pushed off */
> -			if (x->id != IPSET_INVALID_ID) {
> -				ip_set_put_byindex(map->net, x->id);
> -				ip_set_ext_destroy(set, x);
> -			}
> -			memmove(list_set_elem(set, map, i + 1), e,
> -				set->dsize * (map->size - (i + 1)));
> -			/* Extensions must be initialized to zero */
> -			memset(e, 0, set->dsize);
> -		}
> -	}
> -
> -	e->id = d->id;
> -	if (SET_WITH_TIMEOUT(set))
> -		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
> -	if (SET_WITH_COUNTER(set))
> -		ip_set_init_counter(ext_counter(e, set), ext);
> -	if (SET_WITH_COMMENT(set))
> -		ip_set_init_comment(ext_comment(e, set), ext);
> -	if (SET_WITH_SKBINFO(set))
> -		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
> -	return 0;
> -}
> +/* Userspace interfaces: we are protected by the nfnl mutex */
>  
> -static int
> -list_set_del(struct ip_set *set, u32 i)
> +static void
> +__list_set_del(struct ip_set *set, struct set_elem *e)
>  {
>  	struct list_set *map = set->data;
> -	struct set_elem *e = list_set_elem(set, map, i);
>  
>  	ip_set_put_byindex(map->net, e->id);
> +	/* We may call it, because we don't have a to be destroyed
> +	 * extension which is used by the kernel.
> +	 */
>  	ip_set_ext_destroy(set, e);
> +	kfree_rcu(e, rcu);
> +}
>  
> -	if (i < map->size - 1)
> -		memmove(e, list_set_elem(set, map, i + 1),
> -			set->dsize * (map->size - (i + 1)));
> +static inline void
> +list_set_del(struct ip_set *set, struct set_elem *e)
> +{
> +	list_del_rcu(&e->list);
> +	__list_set_del(set, e);
> +}
>  
> -	/* Last element */
> -	e = list_set_elem(set, map, map->size - 1);
> -	e->id = IPSET_INVALID_ID;
> -	return 0;
> +static inline void
> +list_set_replace(struct ip_set *set, struct set_elem *e, struct set_elem *old)
> +{
> +	list_replace_rcu(&old->list, &e->list);
> +	__list_set_del(set, old);
>  }
>  
>  static void
>  set_cleanup_entries(struct ip_set *set)
>  {
>  	struct list_set *map = set->data;
> -	struct set_elem *e;
> -	u32 i = 0;
> +	struct set_elem *e, *n;
>  
> -	while (i < map->size) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id != IPSET_INVALID_ID &&
> -		    ip_set_timeout_expired(ext_timeout(e, set)))
> -			list_set_del(set, i);
> -			/* Check element moved to position i in next loop */
> -		else
> -			i++;
> -	}
> +	list_for_each_entry_safe(e, n, &map->members, list)
> +		if (ip_set_timeout_expired(ext_timeout(e, set)))
> +			list_set_del(set, e);
>  }
>  
>  static int
> @@ -250,31 +194,45 @@ list_set_utest(struct ip_set *set, void *value, const struct ip_set_ext *ext,
>  {
>  	struct list_set *map = set->data;
>  	struct set_adt_elem *d = value;
> -	struct set_elem *e;
> -	u32 i;
> +	struct set_elem *e, *next, *prev = NULL;
>  	int ret;
>  
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			return 0;
> -		else if (SET_WITH_TIMEOUT(set) &&
> -			 ip_set_timeout_expired(ext_timeout(e, set)))
> +	list_for_each_entry(e, &map->members, list) {
> +		if (SET_WITH_TIMEOUT(set) &&
> +		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> -		else if (e->id != d->id)
> +		else if (e->id != d->id) {
> +			prev = e;
>  			continue;
> +		}
>  
>  		if (d->before == 0)
> -			return 1;
> -		else if (d->before > 0)
> -			ret = id_eq(set, i + 1, d->refid);
> -		else
> -			ret = i > 0 && id_eq(set, i - 1, d->refid);
> +			ret = 1;
> +		else if (d->before > 0) {
> +			next = list_next_entry(e, list);
> +			ret = !list_is_last(&e->list, &map->members) &&
> +			      next->id == d->refid;
> +		} else
> +			ret = prev != NULL && prev->id == d->refid;
>  		return ret;
>  	}
>  	return 0;
>  }
>  
> +static void
> +list_set_init_extensions(struct ip_set *set, const struct ip_set_ext *ext,
> +			 struct set_elem *e)
> +{
> +	if (SET_WITH_COUNTER(set))
> +		ip_set_init_counter(ext_counter(e, set), ext);
> +	if (SET_WITH_COMMENT(set))
> +		ip_set_init_comment(ext_comment(e, set), ext);
> +	if (SET_WITH_SKBINFO(set))
> +		ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
> +	/* Update timeout last */
> +	if (SET_WITH_TIMEOUT(set))
> +		ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
> +}
>  
>  static int
>  list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
> @@ -282,60 +240,82 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
>  {
>  	struct list_set *map = set->data;
>  	struct set_adt_elem *d = value;
> -	struct set_elem *e;
> +	struct set_elem *e, *n, *prev, *next;
>  	bool flag_exist = flags & IPSET_FLAG_EXIST;
> -	u32 i, ret = 0;
>  
>  	if (SET_WITH_TIMEOUT(set))
>  		set_cleanup_entries(set);
>  
> -	/* Check already added element */
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			goto insert;
> -		else if (e->id != d->id)
> +	/* Find where to add the new entry */
> +	n = prev = next = NULL;
> +	list_for_each_entry(e, &map->members, list) {
> +		if (SET_WITH_TIMEOUT(set) &&
> +		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> -
> -		if ((d->before > 1 && !id_eq(set, i + 1, d->refid)) ||
> -		    (d->before < 0 &&
> -		     (i == 0 || !id_eq(set, i - 1, d->refid))))
> -			/* Before/after doesn't match */
> +		else if (d->id == e->id)
> +			n = e;
> +		else if (d->before == 0 || e->id != d->refid)
> +			continue;
> +		else if (d->before > 0)
> +			next = e;
> +		else
> +			prev = e;
> +	}
> +	/* Re-add already existing element */
> +	if (n) {
> +		if ((d->before > 0 && !next) ||
> +		    (d->before < 0 && !prev))
>  			return -IPSET_ERR_REF_EXIST;
>  		if (!flag_exist)
> -			/* Can't re-add */
>  			return -IPSET_ERR_EXIST;
>  		/* Update extensions */
> -		ip_set_ext_destroy(set, e);
> +		ip_set_ext_destroy(set, n);
> +		list_set_init_extensions(set, ext, n);
>  
> -		if (SET_WITH_TIMEOUT(set))
> -			ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
> -		if (SET_WITH_COUNTER(set))
> -			ip_set_init_counter(ext_counter(e, set), ext);
> -		if (SET_WITH_COMMENT(set))
> -			ip_set_init_comment(ext_comment(e, set), ext);
> -		if (SET_WITH_SKBINFO(set))
> -			ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
>  		/* Set is already added to the list */
>  		ip_set_put_byindex(map->net, d->id);
>  		return 0;
>  	}
> -insert:
> -	ret = -IPSET_ERR_LIST_FULL;
> -	for (i = 0; i < map->size && ret == -IPSET_ERR_LIST_FULL; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			ret = d->before != 0 ? -IPSET_ERR_REF_EXIST
> -				: list_set_add(set, i, d, ext);
> -		else if (e->id != d->refid)
> -			continue;
> -		else if (d->before > 0)
> -			ret = list_set_add(set, i, d, ext);
> -		else if (i + 1 < map->size)
> -			ret = list_set_add(set, i + 1, d, ext);
> +	/* Add new entry */
> +	if (d->before == 0) {
> +		/* Append  */
> +		n = list_empty(&map->members) ? NULL :
> +		    list_last_entry(&map->members, struct set_elem, list);
> +	} else if (d->before > 0) {
> +		/* Insert after next element */
> +		if (!list_is_last(&next->list, &map->members))
> +			n = list_next_entry(next, list);
> +	} else {
> +		/* Insert before prev element */
> +		if (prev->list.prev != &map->members)
> +			n = list_prev_entry(prev, list);
>  	}
> -
> -	return ret;
> +	/* Can we replace a timed out entry? */
> +	if (n != NULL &&
> +	    !(SET_WITH_TIMEOUT(set) &&
> +	      ip_set_timeout_expired(ext_timeout(n, set))))
> +		n =  NULL;
> +
> +	e = kzalloc(set->dsize, GFP_KERNEL);
> +	if (!e)
> +		return -ENOMEM;
> +	e->id = d->id;
> +	INIT_LIST_HEAD(&e->list);
> +	list_set_init_extensions(set, ext, e);
> +	if (n)
> +		list_set_replace(set, e, n);
> +	else if (next)
> +		list_add_tail_rcu(&e->list, &next->list);
> +	else if (prev)
> +		list_add_rcu(&e->list, &prev->list);
> +	else
> +		list_add_tail_rcu(&e->list, &map->members);
> +	spin_unlock_bh(&set->lock);
> +
> +	synchronize_rcu_bh();

I suspect you don't need this. What is your intention here?

> +
> +	spin_lock_bh(&set->lock);
> +	return 0;
>  }
>  
>  static int
> @@ -344,32 +324,30 @@ list_set_udel(struct ip_set *set, void *value, const struct ip_set_ext *ext,
>  {
>  	struct list_set *map = set->data;
>  	struct set_adt_elem *d = value;
> -	struct set_elem *e;
> -	u32 i;
> -
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			return d->before != 0 ? -IPSET_ERR_REF_EXIST
> -					      : -IPSET_ERR_EXIST;
> -		else if (SET_WITH_TIMEOUT(set) &&
> -			 ip_set_timeout_expired(ext_timeout(e, set)))
> +	struct set_elem *e, *next, *prev = NULL;
> +
> +	list_for_each_entry(e, &map->members, list) {
> +		if (SET_WITH_TIMEOUT(set) &&
> +		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> -		else if (e->id != d->id)
> +		else if (e->id != d->id) {
> +			prev = e;
>  			continue;
> +		}
>  
> -		if (d->before == 0)
> -			return list_set_del(set, i);
> -		else if (d->before > 0) {
> -			if (!id_eq(set, i + 1, d->refid))
> +		if (d->before > 0) {
> +			next = list_next_entry(e, list);
> +			if (list_is_last(&e->list, &map->members) ||
> +			    next->id != d->refid)
>  				return -IPSET_ERR_REF_EXIST;
> -			return list_set_del(set, i);
> -		} else if (i == 0 || !id_eq(set, i - 1, d->refid))
> -			return -IPSET_ERR_REF_EXIST;
> -		else
> -			return list_set_del(set, i);
> +		} else if (d->before < 0) {
> +			if (prev == NULL || prev->id != d->refid)
> +				return -IPSET_ERR_REF_EXIST;
> +		}
> +		list_set_del(set, e);
> +		return 0;
>  	}
> -	return -IPSET_ERR_EXIST;
> +	return d->before != 0 ? -IPSET_ERR_REF_EXIST : -IPSET_ERR_EXIST;
>  }
>  
>  static int
> @@ -410,6 +388,7 @@ list_set_uadt(struct ip_set *set, struct nlattr *tb[],
>  
>  	if (tb[IPSET_ATTR_CADT_FLAGS]) {
>  		u32 f = ip_set_get_h32(tb[IPSET_ATTR_CADT_FLAGS]);
> +
>  		e.before = f & IPSET_FLAG_BEFORE;
>  	}
>  
> @@ -447,27 +426,26 @@ static void
>  list_set_flush(struct ip_set *set)
>  {
>  	struct list_set *map = set->data;
> -	struct set_elem *e;
> -	u32 i;
> -
> -	for (i = 0; i < map->size; i++) {
> -		e = list_set_elem(set, map, i);
> -		if (e->id != IPSET_INVALID_ID) {
> -			ip_set_put_byindex(map->net, e->id);
> -			ip_set_ext_destroy(set, e);
> -			e->id = IPSET_INVALID_ID;
> -		}
> -	}
> +	struct set_elem *e, *n;
> +
> +	list_for_each_entry_safe(e, n, &map->members, list)
> +		list_set_del(set, e);
>  }
>  
>  static void
>  list_set_destroy(struct ip_set *set)
>  {
>  	struct list_set *map = set->data;
> +	struct set_elem *e, *n;
>  
>  	if (SET_WITH_TIMEOUT(set))
>  		del_timer_sync(&map->gc);
> -	list_set_flush(set);
> +	list_for_each_entry_safe(e, n, &map->members, list) {
> +		list_del(&e->list);
> +		ip_set_put_byindex(map->net, e->id);
> +		ip_set_ext_destroy(set, e);
> +		kfree(e);
> +	}
>  	kfree(map);
>  
>  	set->data = NULL;
> @@ -478,6 +456,11 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
>  {
>  	const struct list_set *map = set->data;
>  	struct nlattr *nested;
> +	struct set_elem *e;
> +	u32 n = 0;
> +
> +	list_for_each_entry(e, &map->members, list)
> +		n++;
>  
>  	nested = ipset_nest_start(skb, IPSET_ATTR_DATA);
>  	if (!nested)
> @@ -485,7 +468,7 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
>  	if (nla_put_net32(skb, IPSET_ATTR_SIZE, htonl(map->size)) ||
>  	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
>  	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE,
> -			  htonl(sizeof(*map) + map->size * set->dsize)))
> +			  htonl(sizeof(*map) + n * set->dsize)))
>  		goto nla_put_failure;
>  	if (unlikely(ip_set_put_flags(skb, set)))
>  		goto nla_put_failure;
> @@ -502,18 +485,20 @@ list_set_list(const struct ip_set *set,
>  {
>  	const struct list_set *map = set->data;
>  	struct nlattr *atd, *nested;
> -	u32 i, first = cb->args[IPSET_CB_ARG0];
> -	const struct set_elem *e;
> +	u32 i = 0, first = cb->args[IPSET_CB_ARG0];
> +	struct set_elem *e;
>  
>  	atd = ipset_nest_start(skb, IPSET_ATTR_ADT);
>  	if (!atd)
>  		return -EMSGSIZE;
> -	for (; cb->args[IPSET_CB_ARG0] < map->size;
> -	     cb->args[IPSET_CB_ARG0]++) {
> -		i = cb->args[IPSET_CB_ARG0];
> -		e = list_set_elem(set, map, i);
> -		if (e->id == IPSET_INVALID_ID)
> -			goto finish;
> +	list_for_each_entry(e, &map->members, list) {
> +		if (i == first)
> +			break;
> +		i++;
> +	}
> +
> +	list_for_each_entry_from(e, &map->members, list) {
> +		i++;
>  		if (SET_WITH_TIMEOUT(set) &&
>  		    ip_set_timeout_expired(ext_timeout(e, set)))
>  			continue;
> @@ -532,7 +517,7 @@ list_set_list(const struct ip_set *set,
>  			goto nla_put_failure;
>  		ipset_nest_end(skb, nested);
>  	}
> -finish:
> +
>  	ipset_nest_end(skb, atd);
>  	/* Set listing finished */
>  	cb->args[IPSET_CB_ARG0] = 0;
> @@ -544,6 +529,7 @@ nla_put_failure:
>  		cb->args[IPSET_CB_ARG0] = 0;
>  		return -EMSGSIZE;
>  	}
> +	cb->args[IPSET_CB_ARG0] = i - 1;
>  	ipset_nest_end(skb, atd);
>  	return 0;
>  }
> @@ -580,9 +566,9 @@ list_set_gc(unsigned long ul_set)
>  	struct ip_set *set = (struct ip_set *) ul_set;
>  	struct list_set *map = set->data;
>  
> -	write_lock_bh(&set->lock);
> +	spin_lock_bh(&set->lock);
>  	set_cleanup_entries(set);
> -	write_unlock_bh(&set->lock);
> +	spin_unlock_bh(&set->lock);
>  
>  	map->gc.expires = jiffies + IPSET_GC_PERIOD(set->timeout) * HZ;
>  	add_timer(&map->gc);
> @@ -606,24 +592,16 @@ static bool
>  init_list_set(struct net *net, struct ip_set *set, u32 size)
>  {
>  	struct list_set *map;
> -	struct set_elem *e;
> -	u32 i;
>  
> -	map = kzalloc(sizeof(*map) +
> -		      min_t(u32, size, IP_SET_LIST_MAX_SIZE) * set->dsize,
> -		      GFP_KERNEL);
> +	map = kzalloc(sizeof(*map), GFP_KERNEL);
>  	if (!map)
>  		return false;
>  
>  	map->size = size;
>  	map->net = net;
> +	INIT_LIST_HEAD(&map->members);
>  	set->data = map;
>  
> -	for (i = 0; i < size; i++) {
> -		e = list_set_elem(set, map, i);
> -		e->id = IPSET_INVALID_ID;
> -	}
> -
>  	return true;
>  }
>  
> -- 
> 1.8.5.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types
  2014-11-30 18:57 ` [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types Jozsef Kadlecsik
  2014-12-01  7:59   ` Jesper Dangaard Brouer
@ 2014-12-02 18:40   ` Pablo Neira Ayuso
  2014-12-03 11:23     ` Jozsef Kadlecsik
  1 sibling, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:40 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:57:02PM +0100, Jozsef Kadlecsik wrote:
> Performance is tested by Jesper Dangaard Brouer:
> 
> Simple drop in FORWARD
> ~~~~~~~~~~~~~~~~~~~~
> 
> Dropping via simple iptables net-mask match::
> 
>  iptables -t raw -N simple || iptables -t raw -F simple
>  iptables -t raw -I simple  -s 198.18.0.0/15 -j DROP
>  iptables -t raw -D PREROUTING -j simple
>  iptables -t raw -I PREROUTING -j simple
> 
> Drop performance in "raw": 11.3Mpps
> 
> Generator: sending 12.2Mpps (tx:12264083 pps)
> 
> Drop via original ipset in RAW table
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Create a set with lots of elements::
>  sudo ./ipset destroy test
>  echo "create test hash:ip hashsize 65536" > test.set
>  for x in `seq 0 255`; do
>     for y in `seq 0 255`; do
>         echo "add test 198.18.$x.$y" >> test.set
>     done
>  done
>  sudo ./ipset restore < test.set
> 
> Dropping via ipset::
> 
>  iptables -t raw -F
>  iptables -t raw -N net198 || iptables -t raw -F net198
>  iptables -t raw -I net198 -m set --match-set test src -j DROP
>  iptables -t raw -I PREROUTING -j net198
> 
> Drop performance in "raw" with ipset: 8Mpps
> 
> Perf report numbers ipset drop in "raw"::
> 
>  +   24.65%  ksoftirqd/1  [ip_set]           [k] ip_set_test
>  -   21.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_lock_bh
>     - _raw_read_lock_bh
>        + 99.88% ip_set_test
>  -   19.42%  ksoftirqd/1  [kernel.kallsyms]  [k] _raw_read_unlock_bh
>     - _raw_read_unlock_bh
>        + 99.72% ip_set_test
>  +    4.31%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_kadt
>  +    2.27%  ksoftirqd/1  [ixgbe]            [k] ixgbe_fetch_rx_buffer
>  +    2.18%  ksoftirqd/1  [ip_tables]        [k] ipt_do_table
>  +    1.81%  ksoftirqd/1  [ip_set_hash_ip]   [k] hash_ip4_test
>  +    1.61%  ksoftirqd/1  [kernel.kallsyms]  [k] __netif_receive_skb_core
>  +    1.44%  ksoftirqd/1  [kernel.kallsyms]  [k] build_skb
>  +    1.42%  ksoftirqd/1  [kernel.kallsyms]  [k] ip_rcv
>  +    1.36%  ksoftirqd/1  [kernel.kallsyms]  [k] __local_bh_enable_ip
>  +    1.16%  ksoftirqd/1  [kernel.kallsyms]  [k] dev_gro_receive
>  +    1.09%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_unlock
>  +    0.96%  ksoftirqd/1  [ixgbe]            [k] ixgbe_clean_rx_irq
>  +    0.95%  ksoftirqd/1  [kernel.kallsyms]  [k] __netdev_alloc_frag
>  +    0.88%  ksoftirqd/1  [kernel.kallsyms]  [k] kmem_cache_alloc
>  +    0.87%  ksoftirqd/1  [xt_set]           [k] set_match_v3
>  +    0.85%  ksoftirqd/1  [kernel.kallsyms]  [k] inet_gro_receive
>  +    0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] nf_iterate
>  +    0.76%  ksoftirqd/1  [kernel.kallsyms]  [k] put_compound_page
>  +    0.75%  ksoftirqd/1  [kernel.kallsyms]  [k] __rcu_read_lock
> 
> Drop via ipset in RAW table with RCU-locking
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> With RCU locking, the RW-lock is gone.
> 
> Drop performance in "raw" with ipset with RCU-locking: 11.3Mpps
> 
> Performance-tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  net/netfilter/ipset/ip_set_hash_gen.h | 580 ++++++++++++++++++++--------------
>  1 file changed, 344 insertions(+), 236 deletions(-)
> 
> diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
> index 974ff38..8f51ba4 100644
> --- a/net/netfilter/ipset/ip_set_hash_gen.h
> +++ b/net/netfilter/ipset/ip_set_hash_gen.h
> @@ -10,19 +10,19 @@
>  
>  #include <linux/rcupdate.h>
>  #include <linux/jhash.h>
> +#include <linux/types.h>
>  #include <linux/netfilter/ipset/ip_set_timeout.h>
> -#ifndef rcu_dereference_bh
> -#define rcu_dereference_bh(p)	rcu_dereference(p)
> -#endif
> +
> +#define __ipset_dereference_protected(p, c)	rcu_dereference_protected(p, c)
> +#define ipset_dereference_protected(p, set) \
> +	__ipset_dereference_protected(p, spin_is_locked(&(set)->lock))
>  
>  #define rcu_dereference_bh_nfnl(p)	rcu_dereference_bh_check(p, 1)
>  
[...]
>  /* Flush a hash type of set: destroy all elements */
> @@ -376,16 +359,16 @@ mtype_flush(struct ip_set *set)
>  	struct hbucket *n;
>  	u32 i;
>  
> -	t = rcu_dereference_bh_nfnl(h->table);
> +	t = ipset_dereference_protected(h->table, set);
>  	for (i = 0; i < jhash_size(t->htable_bits); i++) {
> -		n = hbucket(t, i);
> -		if (n->size) {
> -			if (set->extensions & IPSET_EXT_DESTROY)
> -				mtype_ext_cleanup(set, n);
> -			n->size = n->pos = 0;
> -			/* FIXME: use slab cache */
> -			kfree(n->value);
> -		}
> +		n = __ipset_dereference_protected(hbucket(t, i), 1);

What is your intention with these macros?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed
  2014-11-30 18:57 ` [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
@ 2014-12-02 18:43   ` Pablo Neira Ayuso
  2014-12-03 11:25     ` Jozsef Kadlecsik
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:43 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:57:03PM +0100, Jozsef Kadlecsik wrote:
> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> ---
>  include/linux/netfilter/ipset/ip_set.h       |  6 +++---
>  net/netfilter/ipset/ip_set_core.c            | 11 ++++++++---
>  net/netfilter/ipset/ip_set_hash_ipportnet.c  |  2 ++
>  net/netfilter/ipset/ip_set_hash_net.c        |  4 +++-
>  net/netfilter/ipset/ip_set_hash_netiface.c   |  2 ++
>  net/netfilter/ipset/ip_set_hash_netnet.c     |  2 ++
>  net/netfilter/ipset/ip_set_hash_netport.c    |  2 ++
>  net/netfilter/ipset/ip_set_hash_netportnet.c |  2 ++
>  net/netfilter/xt_set.c                       |  7 +++++++
>  net/sched/em_ipset.c                         |  5 ++++-
>  10 files changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> index d5d5bcd..f7fcb6e 100644
> --- a/include/linux/netfilter/ipset/ip_set.h
> +++ b/include/linux/netfilter/ipset/ip_set.h
> @@ -345,10 +345,10 @@ ip_set_put_skbinfo(struct sk_buff *skb, struct ip_set_skbinfo *skbinfo)
>  			      cpu_to_be64((u64)skbinfo->skbmark << 32 |
>  					  skbinfo->skbmarkmask))) ||
>  	       (skbinfo->skbprio &&
> -	        nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
> +		nla_put_net32(skb, IPSET_ATTR_SKBPRIO,
>  			      cpu_to_be32(skbinfo->skbprio))) ||
>  	       (skbinfo->skbqueue &&
> -	        nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
> +		nla_put_net16(skb, IPSET_ATTR_SKBQUEUE,
>  			     cpu_to_be16(skbinfo->skbqueue)));
>  
>  }
> @@ -547,7 +547,7 @@ ip_set_put_extensions(struct sk_buff *skb, const struct ip_set *set,
>  
>  		if (nla_put_net32(skb, IPSET_ATTR_TIMEOUT,
>  			htonl(active ? ip_set_timeout_get(timeout)
> -				: *timeout)))
> +			      : *timeout)))

This seems to fit here in the 80-chars line, please recheck.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full
  2014-11-30 18:56 ` [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
@ 2014-12-02 18:46   ` Pablo Neira Ayuso
  2014-12-02 18:50     ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:46 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Sun, Nov 30, 2014 at 07:56:52PM +0100, Jozsef Kadlecsik wrote:
> When the set was full (hash type and maxelem reached), it was not
> possible to update the extension part of already existing elements.
> The patch removes this limitation. (Fixes netfilter bugzilla id 880.)

Could you please add this:

https://bugzilla.netfilter.org/show_bug.cgi?id=880

for quick browsing. Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full
  2014-12-02 18:46   ` Pablo Neira Ayuso
@ 2014-12-02 18:50     ` Pablo Neira Ayuso
  2014-12-03 11:26       ` Jozsef Kadlecsik
  0 siblings, 1 reply; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:50 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Tue, Dec 02, 2014 at 07:46:44PM +0100, Pablo Neira Ayuso wrote:
> On Sun, Nov 30, 2014 at 07:56:52PM +0100, Jozsef Kadlecsik wrote:
> > When the set was full (hash type and maxelem reached), it was not
> > possible to update the extension part of already existing elements.
> > The patch removes this limitation. (Fixes netfilter bugzilla id 880.)
> 
> Could you please add this:
> 
> https://bugzilla.netfilter.org/show_bug.cgi?id=880
> 
> for quick browsing. Thanks.

I can fix this here.

Actually, I can manually apply from 1 to 6 in the next batch I'm going
to send to David.

I would like to make sure at least those get to him in time, -rc7 is
already out so we merge window may close by this weekend / beginning
next week (just predicting).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type
  2014-12-02 18:35   ` Pablo Neira Ayuso
@ 2014-12-02 18:52     ` Pablo Neira Ayuso
  2014-12-03 11:17     ` Jozsef Kadlecsik
  1 sibling, 0 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-02 18:52 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Tue, Dec 02, 2014 at 07:35:39PM +0100, Pablo Neira Ayuso wrote:
> On Sun, Nov 30, 2014 at 07:57:01PM +0100, Jozsef Kadlecsik wrote:
> > Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > ---
> >  net/netfilter/ipset/ip_set_list_set.c | 386 ++++++++++++++++------------------
> >  1 file changed, 182 insertions(+), 204 deletions(-)
> > 
> > diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
> > index f8f6828..323115a 100644
> > --- a/net/netfilter/ipset/ip_set_list_set.c
> > +++ b/net/netfilter/ipset/ip_set_list_set.c
> > @@ -9,6 +9,7 @@
> >  
> >  #include <linux/module.h>
> >  #include <linux/ip.h>
> > +#include <linux/rculist.h>
> >  #include <linux/skbuff.h>
> >  #include <linux/errno.h>
> >  
> > @@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
> >  
> >  /* Member elements  */
> >  struct set_elem {
> > +	struct rcu_head rcu;
> > +	struct list_head list;
> 
> I think rcu_barrier() in the module removal path is missing to make
> sure call_rcu() is called before the module is gone.

I mean, we make sure rcu softirq runs to release these objects before
the module is gone.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU
  2014-12-02 18:23   ` Pablo Neira Ayuso
@ 2014-12-03 10:54     ` Jozsef Kadlecsik
  0 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 10:54 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> On Sun, Nov 30, 2014 at 07:56:58PM +0100, Jozsef Kadlecsik wrote:
> > Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > ---
> >  net/netfilter/ipset/ip_set_hash_netiface.c | 156 ++++-------------------------
> >  1 file changed, 17 insertions(+), 139 deletions(-)
> > 
> > diff --git a/net/netfilter/ipset/ip_set_hash_netiface.c b/net/netfilter/ipset/ip_set_hash_netiface.c
> > index 758b002..d8a53ec 100644
> > --- a/net/netfilter/ipset/ip_set_hash_netiface.c
> > +++ b/net/netfilter/ipset/ip_set_hash_netiface.c
> > @@ -13,7 +13,6 @@
> >  #include <linux/skbuff.h>
> >  #include <linux/errno.h>
> >  #include <linux/random.h>
> > -#include <linux/rbtree.h>
> >  #include <net/ip.h>
> >  #include <net/ipv6.h>
> >  #include <net/netlink.h>
> > @@ -36,88 +35,14 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
> >  IP_SET_MODULE_DESC("hash:net,iface", IPSET_TYPE_REV_MIN, IPSET_TYPE_REV_MAX);
> >  MODULE_ALIAS("ip_set_hash:net,iface");
> >  
> > -/* Interface name rbtree */
> > -
> > -struct iface_node {
> > -	struct rb_node node;
> > -	char iface[IFNAMSIZ];
> > -};
> > -
> > -#define iface_data(n)	(rb_entry(n, struct iface_node, node)->iface)
> > -
> > -static void
> > -rbtree_destroy(struct rb_root *root)
> > -{
> > -	struct iface_node *node, *next;
> > -
> > -	rbtree_postorder_for_each_entry_safe(node, next, root, node)
> > -		kfree(node);
> > -
> > -	*root = RB_ROOT;
> > -}
> > -
> > -static int
> > -iface_test(struct rb_root *root, const char **iface)
> > -{
> > -	struct rb_node *n = root->rb_node;
> > -
> > -	while (n) {
> > -		const char *d = iface_data(n);
> > -		int res = strcmp(*iface, d);
> > -
> > -		if (res < 0)
> > -			n = n->rb_left;
> > -		else if (res > 0)
> > -			n = n->rb_right;
> > -		else {
> > -			*iface = d;
> > -			return 1;
> > -		}
> > -	}
> > -	return 0;
> > -}
> > -
> > -static int
> > -iface_add(struct rb_root *root, const char **iface)
> > -{
> > -	struct rb_node **n = &(root->rb_node), *p = NULL;
> > -	struct iface_node *d;
> > -
> > -	while (*n) {
> > -		char *ifname = iface_data(*n);
> > -		int res = strcmp(*iface, ifname);
> > -
> > -		p = *n;
> > -		if (res < 0)
> > -			n = &((*n)->rb_left);
> > -		else if (res > 0)
> > -			n = &((*n)->rb_right);
> > -		else {
> > -			*iface = ifname;
> > -			return 0;
> > -		}
> > -	}
> > -
> > -	d = kzalloc(sizeof(*d), GFP_ATOMIC);
> > -	if (!d)
> > -		return -ENOMEM;
> > -	strcpy(d->iface, *iface);
> > -
> > -	rb_link_node(&d->node, p, n);
> > -	rb_insert_color(&d->node, root);
> > -
> > -	*iface = d->iface;
> > -	return 0;
> > -}
> > -
> >  /* Type specific function prefix */
> >  #define HTYPE		hash_netiface
> >  #define IP_SET_HASH_WITH_NETS
> > -#define IP_SET_HASH_WITH_RBTREE
> >  #define IP_SET_HASH_WITH_MULTI
> >  #define IP_SET_HASH_WITH_NET0
> >  
> >  #define STREQ(a, b)	(strcmp(a, b) == 0)
> > +#define IFNAMCPY(a, b)	strncpy(a, b, IFNAMSIZ)
> >  
> >  /* IPv4 variant */
> >  
> > @@ -136,7 +61,7 @@ struct hash_netiface4_elem {
> >  	u8 cidr;
> >  	u8 nomatch;
> >  	u8 elem;
> > -	const char *iface;
> > +	char iface[IFNAMSIZ];
> >  };
> >  
> >  /* Common functions */
> > @@ -150,7 +75,7 @@ hash_netiface4_data_equal(const struct hash_netiface4_elem *ip1,
> >  	       ip1->cidr == ip2->cidr &&
> >  	       (++*multi) &&
> >  	       ip1->physdev == ip2->physdev &&
> > -	       ip1->iface == ip2->iface;
> > +	       STREQ(ip1->iface, ip2->iface);
> 
> I'd really prefer if you use strcmp(a, b) == 0 instead here. This
> makes the code less readable and we save nothing. You have to look to
> the macro to understand what it does, which means scrolling up to see
> what the non-standard STREQ() does.
> 
> I would really like to see these kind of macro usage reduced in ipset,
> same thing with IFNAMCPY().

Right, I'll remove these macros.

Best regards,
Jozsef 
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core
  2014-12-02 18:25   ` Pablo Neira Ayuso
@ 2014-12-03 11:01     ` Jozsef Kadlecsik
  0 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 11:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> On Sun, Nov 30, 2014 at 07:56:59PM +0100, Jozsef Kadlecsik wrote:
> > Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > ---
> >  include/linux/netfilter/ipset/ip_set.h         |  6 ++++-
> >  include/linux/netfilter/ipset/ip_set_timeout.h | 27 ++++++++------------
> >  net/netfilter/ipset/ip_set_core.c              | 35 +++++++++++++-------------
> >  3 files changed, 34 insertions(+), 34 deletions(-)
> > 
> > diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
> > index f1606fa..d5d5bcd 100644
> > --- a/include/linux/netfilter/ipset/ip_set.h
> > +++ b/include/linux/netfilter/ipset/ip_set.h
> > @@ -223,7 +223,7 @@ struct ip_set {
> >  	/* The name of the set */
> >  	char name[IPSET_MAXNAMELEN];
> >  	/* Lock protecting the set data */
> > -	rwlock_t lock;
> > +	spinlock_t lock;
> >  	/* References to the set */
> >  	u32 ref;
> >  	/* The core set type */
> > @@ -322,6 +322,10 @@ ip_set_update_counter(struct ip_set_counter *counter,
> >  	}
> >  }
> >  
> > +#define ip_set_rcu_deref(t)		\
> > +	rcu_dereference_index_check(t,	\
> > +		rcu_read_lock_held() || rcu_read_lock_bh_held())
> > +
> 
> This is not used from this patch itself?

That is something leftover from an earlier phase, I remove it.
 
> >  static inline void
> >  ip_set_get_skbinfo(struct ip_set_skbinfo *skbinfo,
> >  		      const struct ip_set_ext *ext,
> > diff --git a/include/linux/netfilter/ipset/ip_set_timeout.h b/include/linux/netfilter/ipset/ip_set_timeout.h
> > index 83c2f9e..1d6a935 100644
> > --- a/include/linux/netfilter/ipset/ip_set_timeout.h
> > +++ b/include/linux/netfilter/ipset/ip_set_timeout.h
> > @@ -40,38 +40,33 @@ ip_set_timeout_uget(struct nlattr *tb)
> >  }
> >  
> >  static inline bool
> > -ip_set_timeout_test(unsigned long timeout)
> > +ip_set_timeout_expired(unsigned long *t)
> >  {
> > -	return timeout == IPSET_ELEM_PERMANENT ||
> > -	       time_is_after_jiffies(timeout);
> > -}
> > -
> > -static inline bool
> > -ip_set_timeout_expired(unsigned long *timeout)
> > -{
> > -	return *timeout != IPSET_ELEM_PERMANENT &&
> > -	       time_is_before_jiffies(*timeout);
> > +	return *t != IPSET_ELEM_PERMANENT && time_is_before_jiffies(*t);
> >  }
> >  
> >  static inline void
> > -ip_set_timeout_set(unsigned long *timeout, u32 t)
> > +ip_set_timeout_set(unsigned long *timeout, u32 value)
> >  {
> > -	if (!t) {
> > +	unsigned long t;
> > +
> > +	if (!value) {
> >  		*timeout = IPSET_ELEM_PERMANENT;
> >  		return;
> >  	}
> >  
> > -	*timeout = msecs_to_jiffies(t * 1000) + jiffies;
> > -	if (*timeout == IPSET_ELEM_PERMANENT)
> > +	t = msecs_to_jiffies(value * MSEC_PER_SEC) + jiffies;
> > +	if (t == IPSET_ELEM_PERMANENT)
> >  		/* Bingo! :-) */
> > -		(*timeout)--;
> > +		t--;
> > +	*timeout = t;
> >  }
> >  
> >  static inline u32
> >  ip_set_timeout_get(unsigned long *timeout)
> >  {
> >  	return *timeout == IPSET_ELEM_PERMANENT ? 0 :
> > -		jiffies_to_msecs(*timeout - jiffies)/1000;
> > +		jiffies_to_msecs(*timeout - jiffies)/MSEC_PER_SEC;
> >  }
> >  
> >  #endif	/* __KERNEL__ */
> > diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> > index 912e5a0..9fb2610 100644
> > --- a/net/netfilter/ipset/ip_set_core.c
> > +++ b/net/netfilter/ipset/ip_set_core.c
> > @@ -217,6 +217,7 @@ ip_set_type_register(struct ip_set_type *type)
> >  		 type->revision_min, type->revision_max);
> >  unlock:
> >  	ip_set_type_unlock();
> > +	synchronize_rcu();
> 
> Why this synchronize_rcu()?
> 
> ip_set_type_register() didn't publish any new object in the unlock
> path.

That's there just to have less line of code :-). I'll reorganize it so the 
unlock path won't call unnecessarily synchronize_rcu().

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type
  2014-12-02 18:35   ` Pablo Neira Ayuso
  2014-12-02 18:52     ` Pablo Neira Ayuso
@ 2014-12-03 11:17     ` Jozsef Kadlecsik
  2014-12-03 11:36       ` Pablo Neira Ayuso
  1 sibling, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 11:17 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> On Sun, Nov 30, 2014 at 07:57:01PM +0100, Jozsef Kadlecsik wrote:
> > Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > ---
> >  net/netfilter/ipset/ip_set_list_set.c | 386 ++++++++++++++++------------------
> >  1 file changed, 182 insertions(+), 204 deletions(-)
> > 
> > diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
> > index f8f6828..323115a 100644
> > --- a/net/netfilter/ipset/ip_set_list_set.c
> > +++ b/net/netfilter/ipset/ip_set_list_set.c
> > @@ -9,6 +9,7 @@
> >  
> >  #include <linux/module.h>
> >  #include <linux/ip.h>
> > +#include <linux/rculist.h>
> >  #include <linux/skbuff.h>
> >  #include <linux/errno.h>
> >  
> > @@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
> >  
> >  /* Member elements  */
> >  struct set_elem {
> > +	struct rcu_head rcu;
> > +	struct list_head list;
> 
> I think rcu_barrier() in the module removal path is missing to make
> sure call_rcu() is called before the module is gone.

The module can be removed only when there isn't a single set of the given 
type. That means there are no elements to be removed by kfree_rcu(). 
Therefore I think rcu_barrier() is not required in the module removal 
path.

> > @@ -91,13 +88,9 @@ list_set_kadd(struct ip_set *set, const struct sk_buff *skb,
> >  {
> >  	struct list_set *map = set->data;
> >  	struct set_elem *e;
> > -	u32 i;
> >  	int ret;
> >  
> > -	for (i = 0; i < map->size; i++) {
> > -		e = list_set_elem(set, map, i);
> > -		if (e->id == IPSET_INVALID_ID)
> > -			return 0;
> > +	list_for_each_entry_rcu(e, &map->members, list) {
> 
> >From net/netfilter/ipset/ip_set_core.c I can see this kadd() will be
> called under spin_lock_bh(), so you can just use
> list_for_each_entry(). The _rcu() variant protects the reader side,
> but this code is only invoked from the writer side (no changes are
> guaranteed to happen there).

Yes, you are right! I'll correct it.

> >  static int
> >  list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
> > @@ -282,60 +240,82 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext,
> >  {
> >  	struct list_set *map = set->data;
> >  	struct set_adt_elem *d = value;
> > -	struct set_elem *e;
> > +	struct set_elem *e, *n, *prev, *next;
> >  	bool flag_exist = flags & IPSET_FLAG_EXIST;
> > -	u32 i, ret = 0;
> >  
> >  	if (SET_WITH_TIMEOUT(set))
> >  		set_cleanup_entries(set);
> >  
> > -	/* Check already added element */
> > -	for (i = 0; i < map->size; i++) {
> > -		e = list_set_elem(set, map, i);
> > -		if (e->id == IPSET_INVALID_ID)
> > -			goto insert;
> > -		else if (e->id != d->id)
> > +	/* Find where to add the new entry */
> > +	n = prev = next = NULL;
> > +	list_for_each_entry(e, &map->members, list) {
> > +		if (SET_WITH_TIMEOUT(set) &&
> > +		    ip_set_timeout_expired(ext_timeout(e, set)))
> >  			continue;
> > -
> > -		if ((d->before > 1 && !id_eq(set, i + 1, d->refid)) ||
> > -		    (d->before < 0 &&
> > -		     (i == 0 || !id_eq(set, i - 1, d->refid))))
> > -			/* Before/after doesn't match */
> > +		else if (d->id == e->id)
> > +			n = e;
> > +		else if (d->before == 0 || e->id != d->refid)
> > +			continue;
> > +		else if (d->before > 0)
> > +			next = e;
> > +		else
> > +			prev = e;
> > +	}
> > +	/* Re-add already existing element */
> > +	if (n) {
> > +		if ((d->before > 0 && !next) ||
> > +		    (d->before < 0 && !prev))
> >  			return -IPSET_ERR_REF_EXIST;
> >  		if (!flag_exist)
> > -			/* Can't re-add */
> >  			return -IPSET_ERR_EXIST;
> >  		/* Update extensions */
> > -		ip_set_ext_destroy(set, e);
> > +		ip_set_ext_destroy(set, n);
> > +		list_set_init_extensions(set, ext, n);
> >  
> > -		if (SET_WITH_TIMEOUT(set))
> > -			ip_set_timeout_set(ext_timeout(e, set), ext->timeout);
> > -		if (SET_WITH_COUNTER(set))
> > -			ip_set_init_counter(ext_counter(e, set), ext);
> > -		if (SET_WITH_COMMENT(set))
> > -			ip_set_init_comment(ext_comment(e, set), ext);
> > -		if (SET_WITH_SKBINFO(set))
> > -			ip_set_init_skbinfo(ext_skbinfo(e, set), ext);
> >  		/* Set is already added to the list */
> >  		ip_set_put_byindex(map->net, d->id);
> >  		return 0;
> >  	}
> > -insert:
> > -	ret = -IPSET_ERR_LIST_FULL;
> > -	for (i = 0; i < map->size && ret == -IPSET_ERR_LIST_FULL; i++) {
> > -		e = list_set_elem(set, map, i);
> > -		if (e->id == IPSET_INVALID_ID)
> > -			ret = d->before != 0 ? -IPSET_ERR_REF_EXIST
> > -				: list_set_add(set, i, d, ext);
> > -		else if (e->id != d->refid)
> > -			continue;
> > -		else if (d->before > 0)
> > -			ret = list_set_add(set, i, d, ext);
> > -		else if (i + 1 < map->size)
> > -			ret = list_set_add(set, i + 1, d, ext);
> > +	/* Add new entry */
> > +	if (d->before == 0) {
> > +		/* Append  */
> > +		n = list_empty(&map->members) ? NULL :
> > +		    list_last_entry(&map->members, struct set_elem, list);
> > +	} else if (d->before > 0) {
> > +		/* Insert after next element */
> > +		if (!list_is_last(&next->list, &map->members))
> > +			n = list_next_entry(next, list);
> > +	} else {
> > +		/* Insert before prev element */
> > +		if (prev->list.prev != &map->members)
> > +			n = list_prev_entry(prev, list);
> >  	}
> > -
> > -	return ret;
> > +	/* Can we replace a timed out entry? */
> > +	if (n != NULL &&
> > +	    !(SET_WITH_TIMEOUT(set) &&
> > +	      ip_set_timeout_expired(ext_timeout(n, set))))
> > +		n =  NULL;
> > +
> > +	e = kzalloc(set->dsize, GFP_KERNEL);
> > +	if (!e)
> > +		return -ENOMEM;
> > +	e->id = d->id;
> > +	INIT_LIST_HEAD(&e->list);
> > +	list_set_init_extensions(set, ext, e);
> > +	if (n)
> > +		list_set_replace(set, e, n);
> > +	else if (next)
> > +		list_add_tail_rcu(&e->list, &next->list);
> > +	else if (prev)
> > +		list_add_rcu(&e->list, &prev->list);
> > +	else
> > +		list_add_tail_rcu(&e->list, &map->members);
> > +	spin_unlock_bh(&set->lock);
> > +
> > +	synchronize_rcu_bh();
> 
> I suspect you don't need this. What is your intention here?

Here the userspace adds/deletes/replaces an element in the list type of 
set and in the meantime the kernel module can traverse the same linked 
list. In the replace case we remove and delete the old entry, therefore 
the call to synchronize_rcu_bh(). That could be called from a condition 
then, to express the case.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types
  2014-12-02 18:40   ` Pablo Neira Ayuso
@ 2014-12-03 11:23     ` Jozsef Kadlecsik
  0 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 11:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> > diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
> > index 974ff38..8f51ba4 100644
> > --- a/net/netfilter/ipset/ip_set_hash_gen.h
> > +++ b/net/netfilter/ipset/ip_set_hash_gen.h
> > @@ -10,19 +10,19 @@
> >  
> >  #include <linux/rcupdate.h>
> >  #include <linux/jhash.h>
> > +#include <linux/types.h>
> >  #include <linux/netfilter/ipset/ip_set_timeout.h>
> > -#ifndef rcu_dereference_bh
> > -#define rcu_dereference_bh(p)	rcu_dereference(p)
> > -#endif
> > +
> > +#define __ipset_dereference_protected(p, c)	rcu_dereference_protected(p, c)
> > +#define ipset_dereference_protected(p, set) \
> > +	__ipset_dereference_protected(p, spin_is_locked(&(set)->lock))
> >  
> >  #define rcu_dereference_bh_nfnl(p)	rcu_dereference_bh_check(p, 1)
> >  
> [...]
> >  /* Flush a hash type of set: destroy all elements */
> > @@ -376,16 +359,16 @@ mtype_flush(struct ip_set *set)
> >  	struct hbucket *n;
> >  	u32 i;
> >  
> > -	t = rcu_dereference_bh_nfnl(h->table);
> > +	t = ipset_dereference_protected(h->table, set);
> >  	for (i = 0; i < jhash_size(t->htable_bits); i++) {
> > -		n = hbucket(t, i);
> > -		if (n->size) {
> > -			if (set->extensions & IPSET_EXT_DESTROY)
> > -				mtype_ext_cleanup(set, n);
> > -			n->size = n->pos = 0;
> > -			/* FIXME: use slab cache */
> > -			kfree(n->value);
> > -		}
> > +		n = __ipset_dereference_protected(hbucket(t, i), 1);
> 
> What is your intention with these macros?

The macros serve to solve sparse checking warnings. Without the macros one 
can get:

/usr/src/git/ipset/ipset/kernel/net/netfilter/ipset/ip_set_hash_gen.h:369:19: 
warning: incorrect type in assignment (different address spaces)
/usr/src/git/ipset/ipset/kernel/net/netfilter/ipset/ip_set_hash_gen.h:369:19:    
expected struct hbucket *n
/usr/src/git/ipset/ipset/kernel/net/netfilter/ipset/ip_set_hash_gen.h:369:19:    
got struct hbucket [noderef] <asn:4>*<noident>

The real condition (spin is locked) is always checked before the loop, in 
ipset_dereference_protected. Inside the loop it's not checked again and 
again.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed
  2014-12-02 18:43   ` Pablo Neira Ayuso
@ 2014-12-03 11:25     ` Jozsef Kadlecsik
  0 siblings, 0 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 11:25 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> > @@ -547,7 +547,7 @@ ip_set_put_extensions(struct sk_buff *skb, const struct ip_set *set,
> >  
> >  		if (nla_put_net32(skb, IPSET_ATTR_TIMEOUT,
> >  			htonl(active ? ip_set_timeout_get(timeout)
> > -				: *timeout)))
> > +			      : *timeout)))
> 
> This seems to fit here in the 80-chars line, please recheck.

Yes, I'll do.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full
  2014-12-02 18:50     ` Pablo Neira Ayuso
@ 2014-12-03 11:26       ` Jozsef Kadlecsik
  2014-12-03 11:56         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2014-12-03 11:26 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:

> On Tue, Dec 02, 2014 at 07:46:44PM +0100, Pablo Neira Ayuso wrote:
> > On Sun, Nov 30, 2014 at 07:56:52PM +0100, Jozsef Kadlecsik wrote:
> > > When the set was full (hash type and maxelem reached), it was not
> > > possible to update the extension part of already existing elements.
> > > The patch removes this limitation. (Fixes netfilter bugzilla id 880.)
> > 
> > Could you please add this:
> > 
> > https://bugzilla.netfilter.org/show_bug.cgi?id=880
> > 
> > for quick browsing. Thanks.
> 
> I can fix this here.
> 
> Actually, I can manually apply from 1 to 6 in the next batch I'm going
> to send to David.
> 
> I would like to make sure at least those get to him in time, -rc7 is
> already out so we merge window may close by this weekend / beginning
> next week (just predicting).

Thanks, Pablo indeed! Then I'll focus on the second part of the patches.

Best regards,
jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type
  2014-12-03 11:17     ` Jozsef Kadlecsik
@ 2014-12-03 11:36       ` Pablo Neira Ayuso
  0 siblings, 0 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-03 11:36 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Wed, Dec 03, 2014 at 12:17:36PM +0100, Jozsef Kadlecsik wrote:
> On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:
> 
> > On Sun, Nov 30, 2014 at 07:57:01PM +0100, Jozsef Kadlecsik wrote:
> > > Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
> > > ---
> > >  net/netfilter/ipset/ip_set_list_set.c | 386 ++++++++++++++++------------------
> > >  1 file changed, 182 insertions(+), 204 deletions(-)
> > > 
> > > diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
> > > index f8f6828..323115a 100644
> > > --- a/net/netfilter/ipset/ip_set_list_set.c
> > > +++ b/net/netfilter/ipset/ip_set_list_set.c
> > > @@ -9,6 +9,7 @@
> > >  
> > >  #include <linux/module.h>
> > >  #include <linux/ip.h>
> > > +#include <linux/rculist.h>
> > >  #include <linux/skbuff.h>
> > >  #include <linux/errno.h>
> > >  
> > > @@ -27,6 +28,8 @@ MODULE_ALIAS("ip_set_list:set");
> > >  
> > >  /* Member elements  */
> > >  struct set_elem {
> > > +	struct rcu_head rcu;
> > > +	struct list_head list;
> > 
> > I think rcu_barrier() in the module removal path is missing to make
> > sure call_rcu() is called before the module is gone.
> 
> The module can be removed only when there isn't a single set of the given 
> type. That means there are no elements to be removed by kfree_rcu(). 
> Therefore I think rcu_barrier() is not required in the module removal 
> path.

I think this can race with the rcu callback execution. See this:

https://www.kernel.org/doc/Documentation/RCU/rcubarrier.txt

specifically: Unloading Modules That Use call_rcu()

[...]
> > > +	/* Can we replace a timed out entry? */
> > > +	if (n != NULL &&
> > > +	    !(SET_WITH_TIMEOUT(set) &&
> > > +	      ip_set_timeout_expired(ext_timeout(n, set))))
> > > +		n =  NULL;
> > > +
> > > +	e = kzalloc(set->dsize, GFP_KERNEL);
> > > +	if (!e)
> > > +		return -ENOMEM;
> > > +	e->id = d->id;
> > > +	INIT_LIST_HEAD(&e->list);
> > > +	list_set_init_extensions(set, ext, e);
> > > +	if (n)
> > > +		list_set_replace(set, e, n);
> > > +	else if (next)
> > > +		list_add_tail_rcu(&e->list, &next->list);
> > > +	else if (prev)
> > > +		list_add_rcu(&e->list, &prev->list);
> > > +	else
> > > +		list_add_tail_rcu(&e->list, &map->members);
> > > +	spin_unlock_bh(&set->lock);
> > > +
> > > +	synchronize_rcu_bh();
> > 
> > I suspect you don't need this. What is your intention here?
> 
> Here the userspace adds/deletes/replaces an element in the list type of 
> set and in the meantime the kernel module can traverse the same linked 
> list. In the replace case we remove and delete the old entry, therefore 
> the call to synchronize_rcu_bh(). That could be called from a condition 
> then, to express the case.

But you're releasing objects via kfree_rcu(), right? Then you don't
need to wait for the rcu grace state.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full
  2014-12-03 11:26       ` Jozsef Kadlecsik
@ 2014-12-03 11:56         ` Pablo Neira Ayuso
  0 siblings, 0 replies; 32+ messages in thread
From: Pablo Neira Ayuso @ 2014-12-03 11:56 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: netfilter-devel

On Wed, Dec 03, 2014 at 12:26:52PM +0100, Jozsef Kadlecsik wrote:
> On Tue, 2 Dec 2014, Pablo Neira Ayuso wrote:
> 
> > On Tue, Dec 02, 2014 at 07:46:44PM +0100, Pablo Neira Ayuso wrote:
> > > On Sun, Nov 30, 2014 at 07:56:52PM +0100, Jozsef Kadlecsik wrote:
> > > > When the set was full (hash type and maxelem reached), it was not
> > > > possible to update the extension part of already existing elements.
> > > > The patch removes this limitation. (Fixes netfilter bugzilla id 880.)
> > > 
> > > Could you please add this:
> > > 
> > > https://bugzilla.netfilter.org/show_bug.cgi?id=880
> > > 
> > > for quick browsing. Thanks.
> > 
> > I can fix this here.
> > 
> > Actually, I can manually apply from 1 to 6 in the next batch I'm going
> > to send to David.
> > 
> > I would like to make sure at least those get to him in time, -rc7 is
> > already out so we merge window may close by this weekend / beginning
> > next week (just predicting).
> 
> Thanks, Pablo indeed! Then I'll focus on the second part of the patches.

Thanks Jozsef. Applied from 1-6.

I'm going to prepare a batch for David. Please, focus on your rcu
patches, I'll do my best to get this in this merge window.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-12-03 11:54 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-30 18:56 [PATCH 00/10] ipset patches for nf-next, v2 Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 01/14] netfilter: ipset: Support updating extensions when the set is full Jozsef Kadlecsik
2014-12-02 18:46   ` Pablo Neira Ayuso
2014-12-02 18:50     ` Pablo Neira Ayuso
2014-12-03 11:26       ` Jozsef Kadlecsik
2014-12-03 11:56         ` Pablo Neira Ayuso
2014-11-30 18:56 ` [PATCH 02/14] netfilter: ipset: Alignment problem between 64bit kernel 32bit userspace Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 03/14] netfilter: ipset: Indicate when /0 networks are supported Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 04/14] netfilter: ipset: Simplify cidr handling for hash:*net* types Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 05/14] netfilter: ipset: Allocate the proper size of memory when /0 networks are supported Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 06/14] netfilter: ipset: Explicitly add padding elements to hash:net,net and hash:net,port,net Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 07/14] netfilter: ipset: Remove rbtree from hash:net,iface in order to run under RCU Jozsef Kadlecsik
2014-12-02 18:23   ` Pablo Neira Ayuso
2014-12-03 10:54     ` Jozsef Kadlecsik
2014-11-30 18:56 ` [PATCH 08/14] netfilter: ipset: Introduce RCU locking instead of rwlock per set in the core Jozsef Kadlecsik
2014-12-02 18:25   ` Pablo Neira Ayuso
2014-12-03 11:01     ` Jozsef Kadlecsik
2014-11-30 18:57 ` [PATCH 09/14] netfilter: ipset: Introduce RCU locking in the bitmap types Jozsef Kadlecsik
2014-11-30 18:57 ` [PATCH 10/14] netfilter: ipset: Introduce RCU locking in the list type Jozsef Kadlecsik
2014-12-02 18:35   ` Pablo Neira Ayuso
2014-12-02 18:52     ` Pablo Neira Ayuso
2014-12-03 11:17     ` Jozsef Kadlecsik
2014-12-03 11:36       ` Pablo Neira Ayuso
2014-11-30 18:57 ` [PATCH 11/14] netfilter: ipset: Introduce RCU locking in the hash types Jozsef Kadlecsik
2014-12-01  7:59   ` Jesper Dangaard Brouer
2014-12-02 18:40   ` Pablo Neira Ayuso
2014-12-03 11:23     ` Jozsef Kadlecsik
2014-11-30 18:57 ` [PATCH 12/14] netfilter: ipset: styles warned by checkpatch.pl fixed Jozsef Kadlecsik
2014-12-02 18:43   ` Pablo Neira Ayuso
2014-12-03 11:25     ` Jozsef Kadlecsik
2014-11-30 18:57 ` [PATCH 13/14] netfilter: ipset: Fix parallel resizing and listing of the same set Jozsef Kadlecsik
2014-11-30 18:57 ` [PATCH 14/14] netfilter: ipset: Fix sparse warning Jozsef Kadlecsik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).