Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 0/6] netfilter updates for net-next
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>

Hi David,

If time allows, please pull the following 6 updates for your
net-next tree:

* Remove limitation in the maximum number of supported sets in ipset.
  Now ipset automagically increments the number of slots in the array
  of sets by 64 new spare slots, from Jozsef Kadlecsik.

* Partially remove the generic queue infrastructure now that ip_queue
  is gone. Its only client is nfnetlink_queue now, from Florian
  Westphal.

* Add missing attribute policy checkings in ctnetlink, from Florian
  Westphal.

* Automagically kill conntrack entries that use the wrong output
  interface for the masquerading case in case of routing changes,
  from Jozsef Kadlecsik.

* Two patches two improve ct object traceability. Now ct objects are
  always placed in any of the existing lists. This allows us to dump
  the content of unconfirmed and dying conntracks via ctnetlink as
  a way to provide more instrumentation in case you suspect leaks,
  from myself.

You can pull these changes from:

git://1984.lsi.us.es/nf-next master

Thanks!

Florian Westphal (2):
  netfilter: kill support for per-af queue backends
  netfilter: ctnetlink: nla_policy updates

Jozsef Kadlecsik (2):
  netfilter: ipset: Increase the number of maximal sets automatically
  netfilter: nf_nat: Handle routing changes in MASQUERADE target

Pablo Neira Ayuso (2):
  netfilter: nf_conntrack: improve nf_conn object traceability
  netfilter: ctnetlink: dump entries from the dying and unconfirmed lists

 include/net/netfilter/nf_conntrack.h               |    2 +-
 include/net/netfilter/nf_nat.h                     |   15 ++
 include/net/netfilter/nf_queue.h                   |    8 +-
 include/uapi/linux/netfilter/nfnetlink_conntrack.h |    2 +
 net/ipv4/netfilter/iptable_nat.c                   |    4 +
 net/ipv6/netfilter/ip6table_nat.c                  |    4 +
 net/netfilter/core.c                               |    2 -
 net/netfilter/ipset/ip_set_core.c                  |  243 +++++++++++++-------
 net/netfilter/nf_conntrack_core.c                  |   25 +-
 net/netfilter/nf_conntrack_netlink.c               |  118 +++++++++-
 net/netfilter/nf_conntrack_proto_tcp.c             |    2 +
 net/netfilter/nf_queue.c                           |  152 ++----------
 net/netfilter/nfnetlink_queue_core.c               |   14 +-
 13 files changed, 332 insertions(+), 259 deletions(-)

--
1.7.10.4

^ permalink raw reply

* [PATCH 2/6] netfilter: nf_conntrack: improve nf_conn object traceability
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch modifies the conntrack subsystem so that all existing
allocated conntrack objects can be found in any of the following
places:

* the hash table, this is the typical place for alive conntrack objects.
* the unconfirmed list, this is the place for newly created conntrack objects
  that are still traversing the stack.
* the dying list, this is where you can find conntrack objects that are dying
  or that should die anytime soon (eg. once the destroy event is delivered to
  the conntrackd daemon).

Thus, we make sure that we follow the track for all existing conntrack
objects. This patch, together with some extension of the ctnetlink interface
to dump the content of the dying and unconfirmed lists, will help in case
to debug suspected nf_conn object leaks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_conntrack.h |    2 +-
 net/netfilter/nf_conntrack_core.c    |   25 +++++++++----------------
 net/netfilter/nf_conntrack_netlink.c |    2 +-
 3 files changed, 11 insertions(+), 18 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index f1494fe..caca0c4 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -182,7 +182,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 
 extern int nf_conntrack_hash_check_insert(struct nf_conn *ct);
 extern void nf_ct_delete_from_lists(struct nf_conn *ct);
-extern void nf_ct_insert_dying_list(struct nf_conn *ct);
+extern void nf_ct_dying_timeout(struct nf_conn *ct);
 
 extern void nf_conntrack_flush_report(struct net *net, u32 pid, int report);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 0f241be..af17516 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -221,11 +221,9 @@ destroy_conntrack(struct nf_conntrack *nfct)
 	 * too. */
 	nf_ct_remove_expectations(ct);
 
-	/* We overload first tuple to link into unconfirmed list. */
-	if (!nf_ct_is_confirmed(ct)) {
-		BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
-		hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
-	}
+	/* We overload first tuple to link into unconfirmed or dying list.*/
+	BUG_ON(hlist_nulls_unhashed(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode));
+	hlist_nulls_del_rcu(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
 
 	NF_CT_STAT_INC(net, delete);
 	spin_unlock_bh(&nf_conntrack_lock);
@@ -247,6 +245,9 @@ void nf_ct_delete_from_lists(struct nf_conn *ct)
 	 * Otherwise we can get spurious warnings. */
 	NF_CT_STAT_INC(net, delete_list);
 	clean_from_lists(ct);
+	/* add this conntrack to the dying list */
+	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
+			     &net->ct.dying);
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 EXPORT_SYMBOL_GPL(nf_ct_delete_from_lists);
@@ -268,31 +269,23 @@ static void death_by_event(unsigned long ul_conntrack)
 	}
 	/* we've got the event delivered, now it's dying */
 	set_bit(IPS_DYING_BIT, &ct->status);
-	spin_lock(&nf_conntrack_lock);
-	hlist_nulls_del(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode);
-	spin_unlock(&nf_conntrack_lock);
 	nf_ct_put(ct);
 }
 
-void nf_ct_insert_dying_list(struct nf_conn *ct)
+void nf_ct_dying_timeout(struct nf_conn *ct)
 {
 	struct net *net = nf_ct_net(ct);
 	struct nf_conntrack_ecache *ecache = nf_ct_ecache_find(ct);
 
 	BUG_ON(ecache == NULL);
 
-	/* add this conntrack to the dying list */
-	spin_lock_bh(&nf_conntrack_lock);
-	hlist_nulls_add_head(&ct->tuplehash[IP_CT_DIR_ORIGINAL].hnnode,
-			     &net->ct.dying);
-	spin_unlock_bh(&nf_conntrack_lock);
 	/* set a new timer to retry event delivery */
 	setup_timer(&ecache->timeout, death_by_event, (unsigned long)ct);
 	ecache->timeout.expires = jiffies +
 		(random32() % net->ct.sysctl_events_retry_timeout);
 	add_timer(&ecache->timeout);
 }
-EXPORT_SYMBOL_GPL(nf_ct_insert_dying_list);
+EXPORT_SYMBOL_GPL(nf_ct_dying_timeout);
 
 static void death_by_timeout(unsigned long ul_conntrack)
 {
@@ -307,7 +300,7 @@ static void death_by_timeout(unsigned long ul_conntrack)
 	    unlikely(nf_conntrack_event(IPCT_DESTROY, ct) < 0)) {
 		/* destroy event was not delivered */
 		nf_ct_delete_from_lists(ct);
-		nf_ct_insert_dying_list(ct);
+		nf_ct_dying_timeout(ct);
 		return;
 	}
 	set_bit(IPS_DYING_BIT, &ct->status);
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 7bbfb3d..34370a9 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -989,7 +989,7 @@ ctnetlink_del_conntrack(struct sock *ctnl, struct sk_buff *skb,
 					      nlmsg_report(nlh)) < 0) {
 			nf_ct_delete_from_lists(ct);
 			/* we failed to report the event, try later */
-			nf_ct_insert_dying_list(ct);
+			nf_ct_dying_timeout(ct);
 			nf_ct_put(ct);
 			return 0;
 		}
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/6] netfilter: ipset: Increase the number of maximal sets automatically
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>

The max number of sets was hardcoded at kernel cofiguration time and
could only be modified via a module parameter. The patch adds the support
of increasing the max number of sets automatically, as needed.

The array of sets is incremented by 64 new slots if we run out of
empty slots. The absolute limit for the maximal number of sets
is limited by 65534.

Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/ipset/ip_set_core.c |  243 ++++++++++++++++++++++++-------------
 1 file changed, 160 insertions(+), 83 deletions(-)

diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index fed899f..6d6d8f2 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -28,9 +28,10 @@ static LIST_HEAD(ip_set_type_list);		/* all registered set types */
 static DEFINE_MUTEX(ip_set_type_mutex);		/* protects ip_set_type_list */
 static DEFINE_RWLOCK(ip_set_ref_lock);		/* protects the set refs */
 
-static struct ip_set **ip_set_list;		/* all individual sets */
+static struct ip_set * __rcu *ip_set_list;	/* all individual sets */
 static ip_set_id_t ip_set_max = CONFIG_IP_SET_MAX; /* max number of sets */
 
+#define IP_SET_INC	64
 #define STREQ(a, b)	(strncmp(a, b, IPSET_MAXNAMELEN) == 0)
 
 static unsigned int max_sets;
@@ -42,6 +43,12 @@ MODULE_AUTHOR("Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>");
 MODULE_DESCRIPTION("core IP set support");
 MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_IPSET);
 
+/* When the nfnl mutex is held: */
+#define nfnl_dereference(p)		\
+	rcu_dereference_protected(p, 1)
+#define nfnl_set(id)			\
+	nfnl_dereference(ip_set_list)[id]
+
 /*
  * The set types are implemented in modules and registered set types
  * can be found in ip_set_type_list. Adding/deleting types is
@@ -321,19 +328,19 @@ EXPORT_SYMBOL_GPL(ip_set_get_ipaddr6);
  */
 
 static inline void
-__ip_set_get(ip_set_id_t index)
+__ip_set_get(struct ip_set *set)
 {
 	write_lock_bh(&ip_set_ref_lock);
-	ip_set_list[index]->ref++;
+	set->ref++;
 	write_unlock_bh(&ip_set_ref_lock);
 }
 
 static inline void
-__ip_set_put(ip_set_id_t index)
+__ip_set_put(struct ip_set *set)
 {
 	write_lock_bh(&ip_set_ref_lock);
-	BUG_ON(ip_set_list[index]->ref == 0);
-	ip_set_list[index]->ref--;
+	BUG_ON(set->ref == 0);
+	set->ref--;
 	write_unlock_bh(&ip_set_ref_lock);
 }
 
@@ -344,12 +351,25 @@ __ip_set_put(ip_set_id_t index)
  * so it can't be destroyed (or changed) under our foot.
  */
 
+static inline struct ip_set *
+ip_set_rcu_get(ip_set_id_t index)
+{
+	struct ip_set *set;
+
+	rcu_read_lock();
+	/* ip_set_list itself needs to be protected */
+	set = rcu_dereference(ip_set_list)[index];
+	rcu_read_unlock();
+
+	return set;
+}
+
 int
 ip_set_test(ip_set_id_t index, const struct sk_buff *skb,
 	    const struct xt_action_param *par,
 	    const struct ip_set_adt_opt *opt)
 {
-	struct ip_set *set = ip_set_list[index];
+	struct ip_set *set = ip_set_rcu_get(index);
 	int ret = 0;
 
 	BUG_ON(set == NULL);
@@ -388,7 +408,7 @@ ip_set_add(ip_set_id_t index, const struct sk_buff *skb,
 	   const struct xt_action_param *par,
 	   const struct ip_set_adt_opt *opt)
 {
-	struct ip_set *set = ip_set_list[index];
+	struct ip_set *set = ip_set_rcu_get(index);
 	int ret;
 
 	BUG_ON(set == NULL);
@@ -411,7 +431,7 @@ ip_set_del(ip_set_id_t index, const struct sk_buff *skb,
 	   const struct xt_action_param *par,
 	   const struct ip_set_adt_opt *opt)
 {
-	struct ip_set *set = ip_set_list[index];
+	struct ip_set *set = ip_set_rcu_get(index);
 	int ret = 0;
 
 	BUG_ON(set == NULL);
@@ -440,14 +460,17 @@ ip_set_get_byname(const char *name, struct ip_set **set)
 	ip_set_id_t i, index = IPSET_INVALID_ID;
 	struct ip_set *s;
 
+	rcu_read_lock();
 	for (i = 0; i < ip_set_max; i++) {
-		s = ip_set_list[i];
+		s = rcu_dereference(ip_set_list)[i];
 		if (s != NULL && STREQ(s->name, name)) {
-			__ip_set_get(i);
+			__ip_set_get(s);
 			index = i;
 			*set = s;
+			break;
 		}
 	}
+	rcu_read_unlock();
 
 	return index;
 }
@@ -462,8 +485,13 @@ EXPORT_SYMBOL_GPL(ip_set_get_byname);
 void
 ip_set_put_byindex(ip_set_id_t index)
 {
-	if (ip_set_list[index] != NULL)
-		__ip_set_put(index);
+	struct ip_set *set;
+
+	rcu_read_lock();
+	set = rcu_dereference(ip_set_list)[index];
+	if (set != NULL)
+		__ip_set_put(set);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(ip_set_put_byindex);
 
@@ -477,7 +505,7 @@ EXPORT_SYMBOL_GPL(ip_set_put_byindex);
 const char *
 ip_set_name_byindex(ip_set_id_t index)
 {
-	const struct ip_set *set = ip_set_list[index];
+	const struct ip_set *set = ip_set_rcu_get(index);
 
 	BUG_ON(set == NULL);
 	BUG_ON(set->ref == 0);
@@ -501,11 +529,18 @@ EXPORT_SYMBOL_GPL(ip_set_name_byindex);
 ip_set_id_t
 ip_set_nfnl_get(const char *name)
 {
+	ip_set_id_t i, index = IPSET_INVALID_ID;
 	struct ip_set *s;
-	ip_set_id_t index;
 
 	nfnl_lock();
-	index = ip_set_get_byname(name, &s);
+	for (i = 0; i < ip_set_max; i++) {
+		s = nfnl_set(i);
+		if (s != NULL && STREQ(s->name, name)) {
+			__ip_set_get(s);
+			index = i;
+			break;
+		}
+	}
 	nfnl_unlock();
 
 	return index;
@@ -521,12 +556,15 @@ EXPORT_SYMBOL_GPL(ip_set_nfnl_get);
 ip_set_id_t
 ip_set_nfnl_get_byindex(ip_set_id_t index)
 {
+	struct ip_set *set;
+
 	if (index > ip_set_max)
 		return IPSET_INVALID_ID;
 
 	nfnl_lock();
-	if (ip_set_list[index])
-		__ip_set_get(index);
+	set = nfnl_set(index);
+	if (set)
+		__ip_set_get(set);
 	else
 		index = IPSET_INVALID_ID;
 	nfnl_unlock();
@@ -545,8 +583,11 @@ EXPORT_SYMBOL_GPL(ip_set_nfnl_get_byindex);
 void
 ip_set_nfnl_put(ip_set_id_t index)
 {
+	struct ip_set *set;
 	nfnl_lock();
-	ip_set_put_byindex(index);
+	set = nfnl_set(index);
+	if (set != NULL)
+		__ip_set_put(set);
 	nfnl_unlock();
 }
 EXPORT_SYMBOL_GPL(ip_set_nfnl_put);
@@ -603,41 +644,46 @@ static const struct nla_policy ip_set_create_policy[IPSET_ATTR_CMD_MAX + 1] = {
 	[IPSET_ATTR_DATA]	= { .type = NLA_NESTED },
 };
 
-static ip_set_id_t
-find_set_id(const char *name)
+static struct ip_set *
+find_set_and_id(const char *name, ip_set_id_t *id)
 {
-	ip_set_id_t i, index = IPSET_INVALID_ID;
-	const struct ip_set *set;
+	struct ip_set *set = NULL;
+	ip_set_id_t i;
 
-	for (i = 0; index == IPSET_INVALID_ID && i < ip_set_max; i++) {
-		set = ip_set_list[i];
-		if (set != NULL && STREQ(set->name, name))
-			index = i;
+	*id = IPSET_INVALID_ID;
+	for (i = 0; i < ip_set_max; i++) {
+		set = nfnl_set(i);
+		if (set != NULL && STREQ(set->name, name)) {
+			*id = i;
+			break;
+		}
 	}
-	return index;
+	return (*id == IPSET_INVALID_ID ? NULL : set);
 }
 
 static inline struct ip_set *
 find_set(const char *name)
 {
-	ip_set_id_t index = find_set_id(name);
+	ip_set_id_t id;
 
-	return index == IPSET_INVALID_ID ? NULL : ip_set_list[index];
+	return find_set_and_id(name, &id);
 }
 
 static int
 find_free_id(const char *name, ip_set_id_t *index, struct ip_set **set)
 {
+	struct ip_set *s;
 	ip_set_id_t i;
 
 	*index = IPSET_INVALID_ID;
 	for (i = 0;  i < ip_set_max; i++) {
-		if (ip_set_list[i] == NULL) {
+		s = nfnl_set(i);
+		if (s == NULL) {
 			if (*index == IPSET_INVALID_ID)
 				*index = i;
-		} else if (STREQ(name, ip_set_list[i]->name)) {
+		} else if (STREQ(name, s->name)) {
 			/* Name clash */
-			*set = ip_set_list[i];
+			*set = s;
 			return -EEXIST;
 		}
 	}
@@ -730,10 +776,9 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 	 * and check clashing.
 	 */
 	ret = find_free_id(set->name, &index, &clash);
-	if (ret != 0) {
+	if (ret == -EEXIST) {
 		/* If this is the same set and requested, ignore error */
-		if (ret == -EEXIST &&
-		    (flags & IPSET_FLAG_EXIST) &&
+		if ((flags & IPSET_FLAG_EXIST) &&
 		    STREQ(set->type->name, clash->type->name) &&
 		    set->type->family == clash->type->family &&
 		    set->type->revision_min == clash->type->revision_min &&
@@ -741,13 +786,36 @@ ip_set_create(struct sock *ctnl, struct sk_buff *skb,
 		    set->variant->same_set(set, clash))
 			ret = 0;
 		goto cleanup;
-	}
+	} else if (ret == -IPSET_ERR_MAX_SETS) {
+		struct ip_set **list, **tmp;
+		ip_set_id_t i = ip_set_max + IP_SET_INC;
+
+		if (i < ip_set_max || i == IPSET_INVALID_ID)
+			/* Wraparound */
+			goto cleanup;
+
+		list = kzalloc(sizeof(struct ip_set *) * i, GFP_KERNEL);
+		if (!list)
+			goto cleanup;
+		/* nfnl mutex is held, both lists are valid */
+		tmp = nfnl_dereference(ip_set_list);
+		memcpy(list, tmp, sizeof(struct ip_set *) * ip_set_max);
+		rcu_assign_pointer(ip_set_list, list);
+		/* Make sure all current packets have passed through */
+		synchronize_net();
+		/* Use new list */
+		index = ip_set_max;
+		ip_set_max = i;
+		kfree(tmp);
+		ret = 0;
+	} else if (ret)
+		goto cleanup;
 
 	/*
 	 * Finally! Add our shiny new set to the list, and be done.
 	 */
 	pr_debug("create: '%s' created with index %u!\n", set->name, index);
-	ip_set_list[index] = set;
+	nfnl_set(index) = set;
 
 	return ret;
 
@@ -772,10 +840,10 @@ ip_set_setname_policy[IPSET_ATTR_CMD_MAX + 1] = {
 static void
 ip_set_destroy_set(ip_set_id_t index)
 {
-	struct ip_set *set = ip_set_list[index];
+	struct ip_set *set = nfnl_set(index);
 
 	pr_debug("set: %s\n",  set->name);
-	ip_set_list[index] = NULL;
+	nfnl_set(index) = NULL;
 
 	/* Must call it without holding any lock */
 	set->variant->destroy(set);
@@ -788,6 +856,7 @@ ip_set_destroy(struct sock *ctnl, struct sk_buff *skb,
 	       const struct nlmsghdr *nlh,
 	       const struct nlattr * const attr[])
 {
+	struct ip_set *s;
 	ip_set_id_t i;
 	int ret = 0;
 
@@ -807,22 +876,24 @@ ip_set_destroy(struct sock *ctnl, struct sk_buff *skb,
 	read_lock_bh(&ip_set_ref_lock);
 	if (!attr[IPSET_ATTR_SETNAME]) {
 		for (i = 0; i < ip_set_max; i++) {
-			if (ip_set_list[i] != NULL && ip_set_list[i]->ref) {
+			s = nfnl_set(i);
+			if (s != NULL && s->ref) {
 				ret = -IPSET_ERR_BUSY;
 				goto out;
 			}
 		}
 		read_unlock_bh(&ip_set_ref_lock);
 		for (i = 0; i < ip_set_max; i++) {
-			if (ip_set_list[i] != NULL)
+			s = nfnl_set(i);
+			if (s != NULL)
 				ip_set_destroy_set(i);
 		}
 	} else {
-		i = find_set_id(nla_data(attr[IPSET_ATTR_SETNAME]));
-		if (i == IPSET_INVALID_ID) {
+		s = find_set_and_id(nla_data(attr[IPSET_ATTR_SETNAME]), &i);
+		if (s == NULL) {
 			ret = -ENOENT;
 			goto out;
-		} else if (ip_set_list[i]->ref) {
+		} else if (s->ref) {
 			ret = -IPSET_ERR_BUSY;
 			goto out;
 		}
@@ -853,21 +924,24 @@ ip_set_flush(struct sock *ctnl, struct sk_buff *skb,
 	     const struct nlmsghdr *nlh,
 	     const struct nlattr * const attr[])
 {
+	struct ip_set *s;
 	ip_set_id_t i;
 
 	if (unlikely(protocol_failed(attr)))
 		return -IPSET_ERR_PROTOCOL;
 
 	if (!attr[IPSET_ATTR_SETNAME]) {
-		for (i = 0; i < ip_set_max; i++)
-			if (ip_set_list[i] != NULL)
-				ip_set_flush_set(ip_set_list[i]);
+		for (i = 0; i < ip_set_max; i++) {
+			s = nfnl_set(i);
+			if (s != NULL)
+				ip_set_flush_set(s);
+		}
 	} else {
-		i = find_set_id(nla_data(attr[IPSET_ATTR_SETNAME]));
-		if (i == IPSET_INVALID_ID)
+		s = find_set(nla_data(attr[IPSET_ATTR_SETNAME]));
+		if (s == NULL)
 			return -ENOENT;
 
-		ip_set_flush_set(ip_set_list[i]);
+		ip_set_flush_set(s);
 	}
 
 	return 0;
@@ -889,7 +963,7 @@ ip_set_rename(struct sock *ctnl, struct sk_buff *skb,
 	      const struct nlmsghdr *nlh,
 	      const struct nlattr * const attr[])
 {
-	struct ip_set *set;
+	struct ip_set *set, *s;
 	const char *name2;
 	ip_set_id_t i;
 	int ret = 0;
@@ -911,8 +985,8 @@ ip_set_rename(struct sock *ctnl, struct sk_buff *skb,
 
 	name2 = nla_data(attr[IPSET_ATTR_SETNAME2]);
 	for (i = 0; i < ip_set_max; i++) {
-		if (ip_set_list[i] != NULL &&
-		    STREQ(ip_set_list[i]->name, name2)) {
+		s = nfnl_set(i);
+		if (s != NULL && STREQ(s->name, name2)) {
 			ret = -IPSET_ERR_EXIST_SETNAME2;
 			goto out;
 		}
@@ -947,17 +1021,14 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 		     attr[IPSET_ATTR_SETNAME2] == NULL))
 		return -IPSET_ERR_PROTOCOL;
 
-	from_id = find_set_id(nla_data(attr[IPSET_ATTR_SETNAME]));
-	if (from_id == IPSET_INVALID_ID)
+	from = find_set_and_id(nla_data(attr[IPSET_ATTR_SETNAME]), &from_id);
+	if (from == NULL)
 		return -ENOENT;
 
-	to_id = find_set_id(nla_data(attr[IPSET_ATTR_SETNAME2]));
-	if (to_id == IPSET_INVALID_ID)
+	to = find_set_and_id(nla_data(attr[IPSET_ATTR_SETNAME2]), &to_id);
+	if (to == NULL)
 		return -IPSET_ERR_EXIST_SETNAME2;
 
-	from = ip_set_list[from_id];
-	to = ip_set_list[to_id];
-
 	/* Features must not change.
 	 * Not an artificial restriction anymore, as we must prevent
 	 * possible loops created by swapping in setlist type of sets. */
@@ -971,8 +1042,8 @@ ip_set_swap(struct sock *ctnl, struct sk_buff *skb,
 
 	write_lock_bh(&ip_set_ref_lock);
 	swap(from->ref, to->ref);
-	ip_set_list[from_id] = to;
-	ip_set_list[to_id] = from;
+	nfnl_set(from_id) = to;
+	nfnl_set(to_id) = from;
 	write_unlock_bh(&ip_set_ref_lock);
 
 	return 0;
@@ -992,7 +1063,7 @@ static int
 ip_set_dump_done(struct netlink_callback *cb)
 {
 	if (cb->args[2]) {
-		pr_debug("release set %s\n", ip_set_list[cb->args[1]]->name);
+		pr_debug("release set %s\n", nfnl_set(cb->args[1])->name);
 		ip_set_put_byindex((ip_set_id_t) cb->args[1]);
 	}
 	return 0;
@@ -1030,8 +1101,11 @@ dump_init(struct netlink_callback *cb)
 	 */
 
 	if (cda[IPSET_ATTR_SETNAME]) {
-		index = find_set_id(nla_data(cda[IPSET_ATTR_SETNAME]));
-		if (index == IPSET_INVALID_ID)
+		struct ip_set *set;
+
+		set = find_set_and_id(nla_data(cda[IPSET_ATTR_SETNAME]),
+				      &index);
+		if (set == NULL)
 			return -ENOENT;
 
 		dump_type = DUMP_ONE;
@@ -1081,7 +1155,7 @@ dump_last:
 		 dump_type, dump_flags, cb->args[1]);
 	for (; cb->args[1] < max; cb->args[1]++) {
 		index = (ip_set_id_t) cb->args[1];
-		set = ip_set_list[index];
+		set = nfnl_set(index);
 		if (set == NULL) {
 			if (dump_type == DUMP_ONE) {
 				ret = -ENOENT;
@@ -1100,7 +1174,7 @@ dump_last:
 		if (!cb->args[2]) {
 			/* Start listing: make sure set won't be destroyed */
 			pr_debug("reference set\n");
-			__ip_set_get(index);
+			__ip_set_get(set);
 		}
 		nlh = start_msg(skb, NETLINK_CB(cb->skb).portid,
 				cb->nlh->nlmsg_seq, flags,
@@ -1159,7 +1233,7 @@ next_set:
 release_refcount:
 	/* If there was an error or set is done, release set */
 	if (ret || !cb->args[2]) {
-		pr_debug("release set %s\n", ip_set_list[index]->name);
+		pr_debug("release set %s\n", nfnl_set(index)->name);
 		ip_set_put_byindex(index);
 		cb->args[2] = 0;
 	}
@@ -1409,17 +1483,15 @@ ip_set_header(struct sock *ctnl, struct sk_buff *skb,
 	const struct ip_set *set;
 	struct sk_buff *skb2;
 	struct nlmsghdr *nlh2;
-	ip_set_id_t index;
 	int ret = 0;
 
 	if (unlikely(protocol_failed(attr) ||
 		     attr[IPSET_ATTR_SETNAME] == NULL))
 		return -IPSET_ERR_PROTOCOL;
 
-	index = find_set_id(nla_data(attr[IPSET_ATTR_SETNAME]));
-	if (index == IPSET_INVALID_ID)
+	set = find_set(nla_data(attr[IPSET_ATTR_SETNAME]));
+	if (set == NULL)
 		return -ENOENT;
-	set = ip_set_list[index];
 
 	skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
 	if (skb2 == NULL)
@@ -1684,6 +1756,7 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 	}
 	case IP_SET_OP_GET_BYNAME: {
 		struct ip_set_req_get_set *req_get = data;
+		ip_set_id_t id;
 
 		if (*len != sizeof(struct ip_set_req_get_set)) {
 			ret = -EINVAL;
@@ -1691,12 +1764,14 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 		}
 		req_get->set.name[IPSET_MAXNAMELEN - 1] = '\0';
 		nfnl_lock();
-		req_get->set.index = find_set_id(req_get->set.name);
+		find_set_and_id(req_get->set.name, &id);
+		req_get->set.index = id;
 		nfnl_unlock();
 		goto copy;
 	}
 	case IP_SET_OP_GET_BYINDEX: {
 		struct ip_set_req_get_set *req_get = data;
+		struct ip_set *set;
 
 		if (*len != sizeof(struct ip_set_req_get_set) ||
 		    req_get->set.index >= ip_set_max) {
@@ -1704,9 +1779,8 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
 			goto done;
 		}
 		nfnl_lock();
-		strncpy(req_get->set.name,
-			ip_set_list[req_get->set.index]
-				? ip_set_list[req_get->set.index]->name : "",
+		set = nfnl_set(req_get->set.index);
+		strncpy(req_get->set.name, set ? set->name : "",
 			IPSET_MAXNAMELEN);
 		nfnl_unlock();
 		goto copy;
@@ -1737,6 +1811,7 @@ static struct nf_sockopt_ops so_set __read_mostly = {
 static int __init
 ip_set_init(void)
 {
+	struct ip_set **list;
 	int ret;
 
 	if (max_sets)
@@ -1744,22 +1819,22 @@ ip_set_init(void)
 	if (ip_set_max >= IPSET_INVALID_ID)
 		ip_set_max = IPSET_INVALID_ID - 1;
 
-	ip_set_list = kzalloc(sizeof(struct ip_set *) * ip_set_max,
-			      GFP_KERNEL);
-	if (!ip_set_list)
+	list = kzalloc(sizeof(struct ip_set *) * ip_set_max, GFP_KERNEL);
+	if (!list)
 		return -ENOMEM;
 
+	rcu_assign_pointer(ip_set_list, list);
 	ret = nfnetlink_subsys_register(&ip_set_netlink_subsys);
 	if (ret != 0) {
 		pr_err("ip_set: cannot register with nfnetlink.\n");
-		kfree(ip_set_list);
+		kfree(list);
 		return ret;
 	}
 	ret = nf_register_sockopt(&so_set);
 	if (ret != 0) {
 		pr_err("SO_SET registry failed: %d\n", ret);
 		nfnetlink_subsys_unregister(&ip_set_netlink_subsys);
-		kfree(ip_set_list);
+		kfree(list);
 		return ret;
 	}
 
@@ -1770,10 +1845,12 @@ ip_set_init(void)
 static void __exit
 ip_set_fini(void)
 {
+	struct ip_set **list = rcu_dereference_protected(ip_set_list, 1);
+
 	/* There can't be any existing set */
 	nf_unregister_sockopt(&so_set);
 	nfnetlink_subsys_unregister(&ip_set_netlink_subsys);
-	kfree(ip_set_list);
+	kfree(list);
 	pr_debug("these are the famous last words\n");
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/6] netfilter: ctnetlink: dump entries from the dying and unconfirmed lists
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch adds a new operation to dump the content of the dying and
unconfirmed lists.

Under some situations, the global conntrack counter can be inconsistent
with the number of entries that we can dump from the conntrack table.
The way to resolve this is to allow dumping the content of the unconfirmed
and dying lists, so far it was not possible to look at its content.

This provides some extra instrumentation to resolve problematic situations
in which anyone suspects memory leaks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/uapi/linux/netfilter/nfnetlink_conntrack.h |    2 +
 net/netfilter/nf_conntrack_netlink.c               |  108 ++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/include/uapi/linux/netfilter/nfnetlink_conntrack.h b/include/uapi/linux/netfilter/nfnetlink_conntrack.h
index 43bfe3e..86e930c 100644
--- a/include/uapi/linux/netfilter/nfnetlink_conntrack.h
+++ b/include/uapi/linux/netfilter/nfnetlink_conntrack.h
@@ -9,6 +9,8 @@ enum cntl_msg_types {
 	IPCTNL_MSG_CT_GET_CTRZERO,
 	IPCTNL_MSG_CT_GET_STATS_CPU,
 	IPCTNL_MSG_CT_GET_STATS,
+	IPCTNL_MSG_CT_GET_DYING,
+	IPCTNL_MSG_CT_GET_UNCONFIRMED,
 
 	IPCTNL_MSG_MAX
 };
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 34370a9..c24a00a 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1089,6 +1089,112 @@ out:
 	return err == -EAGAIN ? -ENOBUFS : err;
 }
 
+static int ctnetlink_done_list(struct netlink_callback *cb)
+{
+	if (cb->args[1])
+		nf_ct_put((struct nf_conn *)cb->args[1]);
+	return 0;
+}
+
+static int
+ctnetlink_dump_list(struct sk_buff *skb, struct netlink_callback *cb,
+		    struct hlist_nulls_head *list)
+{
+	struct nf_conn *ct, *last;
+	struct nf_conntrack_tuple_hash *h;
+	struct hlist_nulls_node *n;
+	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
+	u_int8_t l3proto = nfmsg->nfgen_family;
+	int res;
+
+	if (cb->args[2])
+		return 0;
+
+	spin_lock_bh(&nf_conntrack_lock);
+	last = (struct nf_conn *)cb->args[1];
+restart:
+	hlist_nulls_for_each_entry(h, n, list, hnnode) {
+		ct = nf_ct_tuplehash_to_ctrack(h);
+		if (l3proto && nf_ct_l3num(ct) != l3proto)
+			continue;
+		if (cb->args[1]) {
+			if (ct != last)
+				continue;
+			cb->args[1] = 0;
+		}
+		rcu_read_lock();
+		res = ctnetlink_fill_info(skb, NETLINK_CB(cb->skb).portid,
+					  cb->nlh->nlmsg_seq,
+					  NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
+					  ct);
+		rcu_read_unlock();
+		if (res < 0) {
+			nf_conntrack_get(&ct->ct_general);
+			cb->args[1] = (unsigned long)ct;
+			goto out;
+		}
+	}
+	if (cb->args[1]) {
+		cb->args[1] = 0;
+		goto restart;
+	} else
+		cb->args[2] = 1;
+out:
+	spin_unlock_bh(&nf_conntrack_lock);
+	if (last)
+		nf_ct_put(last);
+
+	return skb->len;
+}
+
+static int
+ctnetlink_dump_dying(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+
+	return ctnetlink_dump_list(skb, cb, &net->ct.dying);
+}
+
+static int
+ctnetlink_get_ct_dying(struct sock *ctnl, struct sk_buff *skb,
+		       const struct nlmsghdr *nlh,
+		       const struct nlattr * const cda[])
+{
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = ctnetlink_dump_dying,
+			.done = ctnetlink_done_list,
+		};
+		return netlink_dump_start(ctnl, skb, nlh, &c);
+	}
+
+	return -EOPNOTSUPP;
+}
+
+static int
+ctnetlink_dump_unconfirmed(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct net *net = sock_net(skb->sk);
+
+	return ctnetlink_dump_list(skb, cb, &net->ct.unconfirmed);
+}
+
+static int
+ctnetlink_get_ct_unconfirmed(struct sock *ctnl, struct sk_buff *skb,
+			     const struct nlmsghdr *nlh,
+			     const struct nlattr * const cda[])
+{
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = ctnetlink_dump_unconfirmed,
+			.done = ctnetlink_done_list,
+		};
+		return netlink_dump_start(ctnl, skb, nlh, &c);
+	}
+
+	return -EOPNOTSUPP;
+}
+
 #ifdef CONFIG_NF_NAT_NEEDED
 static int
 ctnetlink_parse_nat_setup(struct nf_conn *ct,
@@ -2712,6 +2818,8 @@ static const struct nfnl_callback ctnl_cb[IPCTNL_MSG_MAX] = {
 					    .policy = ct_nla_policy },
 	[IPCTNL_MSG_CT_GET_STATS_CPU]	= { .call = ctnetlink_stat_ct_cpu },
 	[IPCTNL_MSG_CT_GET_STATS]	= { .call = ctnetlink_stat_ct },
+	[IPCTNL_MSG_CT_GET_DYING]	= { .call = ctnetlink_get_ct_dying },
+	[IPCTNL_MSG_CT_GET_UNCONFIRMED]	= { .call = ctnetlink_get_ct_unconfirmed },
 };
 
 static const struct nfnl_callback ctnl_exp_cb[IPCTNL_MSG_EXP_MAX] = {
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 4/6] netfilter: kill support for per-af queue backends
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

We used to have several queueing backends, but nowadays only
nfnetlink_queue remains.

In light of this there doesn't seem to be a good reason to
support per-af registering -- just hook up nfnetlink_queue on module
load and remove it on unload.

This means that the userspace BIND/UNBIND_PF commands are now obsolete;
the kernel will ignore them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/net/netfilter/nf_queue.h     |    8 +-
 net/netfilter/core.c                 |    2 -
 net/netfilter/nf_queue.c             |  152 +++-------------------------------
 net/netfilter/nfnetlink_queue_core.c |   14 ++--
 4 files changed, 20 insertions(+), 156 deletions(-)

diff --git a/include/net/netfilter/nf_queue.h b/include/net/netfilter/nf_queue.h
index 252fd10..fb1c0be 100644
--- a/include/net/netfilter/nf_queue.h
+++ b/include/net/netfilter/nf_queue.h
@@ -21,14 +21,10 @@ struct nf_queue_entry {
 struct nf_queue_handler {
 	int			(*outfn)(struct nf_queue_entry *entry,
 					 unsigned int queuenum);
-	char			*name;
 };
 
-extern int nf_register_queue_handler(u_int8_t pf,
-				     const struct nf_queue_handler *qh);
-extern int nf_unregister_queue_handler(u_int8_t pf,
-				       const struct nf_queue_handler *qh);
-extern void nf_unregister_queue_handlers(const struct nf_queue_handler *qh);
+void nf_register_queue_handler(const struct nf_queue_handler *qh);
+void nf_unregister_queue_handler(void);
 extern void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict);
 
 #endif /* _NF_QUEUE_H */
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 68912dad..a9c488b 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -295,8 +295,6 @@ void __init netfilter_init(void)
 		panic("cannot create netfilter proc entry");
 #endif
 
-	if (netfilter_queue_init() < 0)
-		panic("cannot initialize nf_queue");
 	if (netfilter_log_init() < 0)
 		panic("cannot initialize nf_log");
 }
diff --git a/net/netfilter/nf_queue.c b/net/netfilter/nf_queue.c
index 8d2cf9e..d812c12 100644
--- a/net/netfilter/nf_queue.c
+++ b/net/netfilter/nf_queue.c
@@ -14,84 +14,32 @@
 #include "nf_internals.h"
 
 /*
- * A queue handler may be registered for each protocol.  Each is protected by
- * long term mutex.  The handler must provide an an outfn() to accept packets
- * for queueing and must reinject all packets it receives, no matter what.
+ * Hook for nfnetlink_queue to register its queue handler.
+ * We do this so that most of the NFQUEUE code can be modular.
+ *
+ * Once the queue is registered it must reinject all packets it
+ * receives, no matter what.
  */
-static const struct nf_queue_handler __rcu *queue_handler[NFPROTO_NUMPROTO] __read_mostly;
-
-static DEFINE_MUTEX(queue_handler_mutex);
+static const struct nf_queue_handler __rcu *queue_handler __read_mostly;
 
 /* return EBUSY when somebody else is registered, return EEXIST if the
  * same handler is registered, return 0 in case of success. */
-int nf_register_queue_handler(u_int8_t pf, const struct nf_queue_handler *qh)
+void nf_register_queue_handler(const struct nf_queue_handler *qh)
 {
-	int ret;
-	const struct nf_queue_handler *old;
-
-	if (pf >= ARRAY_SIZE(queue_handler))
-		return -EINVAL;
-
-	mutex_lock(&queue_handler_mutex);
-	old = rcu_dereference_protected(queue_handler[pf],
-					lockdep_is_held(&queue_handler_mutex));
-	if (old == qh)
-		ret = -EEXIST;
-	else if (old)
-		ret = -EBUSY;
-	else {
-		rcu_assign_pointer(queue_handler[pf], qh);
-		ret = 0;
-	}
-	mutex_unlock(&queue_handler_mutex);
-
-	return ret;
+	/* should never happen, we only have one queueing backend in kernel */
+	WARN_ON(rcu_access_pointer(queue_handler));
+	rcu_assign_pointer(queue_handler, qh);
 }
 EXPORT_SYMBOL(nf_register_queue_handler);
 
 /* The caller must flush their queue before this */
-int nf_unregister_queue_handler(u_int8_t pf, const struct nf_queue_handler *qh)
+void nf_unregister_queue_handler(void)
 {
-	const struct nf_queue_handler *old;
-
-	if (pf >= ARRAY_SIZE(queue_handler))
-		return -EINVAL;
-
-	mutex_lock(&queue_handler_mutex);
-	old = rcu_dereference_protected(queue_handler[pf],
-					lockdep_is_held(&queue_handler_mutex));
-	if (old && old != qh) {
-		mutex_unlock(&queue_handler_mutex);
-		return -EINVAL;
-	}
-
-	RCU_INIT_POINTER(queue_handler[pf], NULL);
-	mutex_unlock(&queue_handler_mutex);
-
+	RCU_INIT_POINTER(queue_handler, NULL);
 	synchronize_rcu();
-
-	return 0;
 }
 EXPORT_SYMBOL(nf_unregister_queue_handler);
 
-void nf_unregister_queue_handlers(const struct nf_queue_handler *qh)
-{
-	u_int8_t pf;
-
-	mutex_lock(&queue_handler_mutex);
-	for (pf = 0; pf < ARRAY_SIZE(queue_handler); pf++)  {
-		if (rcu_dereference_protected(
-				queue_handler[pf],
-				lockdep_is_held(&queue_handler_mutex)
-				) == qh)
-			RCU_INIT_POINTER(queue_handler[pf], NULL);
-	}
-	mutex_unlock(&queue_handler_mutex);
-
-	synchronize_rcu();
-}
-EXPORT_SYMBOL_GPL(nf_unregister_queue_handlers);
-
 static void nf_queue_entry_release_refs(struct nf_queue_entry *entry)
 {
 	/* Release those devices we held, or Alexey will kill me. */
@@ -137,7 +85,7 @@ static int __nf_queue(struct sk_buff *skb,
 	/* QUEUE == DROP if no one is waiting, to be safe. */
 	rcu_read_lock();
 
-	qh = rcu_dereference(queue_handler[pf]);
+	qh = rcu_dereference(queue_handler);
 	if (!qh) {
 		status = -ESRCH;
 		goto err_unlock;
@@ -344,77 +292,3 @@ void nf_reinject(struct nf_queue_entry *entry, unsigned int verdict)
 	kfree(entry);
 }
 EXPORT_SYMBOL(nf_reinject);
-
-#ifdef CONFIG_PROC_FS
-static void *seq_start(struct seq_file *seq, loff_t *pos)
-{
-	if (*pos >= ARRAY_SIZE(queue_handler))
-		return NULL;
-
-	return pos;
-}
-
-static void *seq_next(struct seq_file *s, void *v, loff_t *pos)
-{
-	(*pos)++;
-
-	if (*pos >= ARRAY_SIZE(queue_handler))
-		return NULL;
-
-	return pos;
-}
-
-static void seq_stop(struct seq_file *s, void *v)
-{
-
-}
-
-static int seq_show(struct seq_file *s, void *v)
-{
-	int ret;
-	loff_t *pos = v;
-	const struct nf_queue_handler *qh;
-
-	rcu_read_lock();
-	qh = rcu_dereference(queue_handler[*pos]);
-	if (!qh)
-		ret = seq_printf(s, "%2lld NONE\n", *pos);
-	else
-		ret = seq_printf(s, "%2lld %s\n", *pos, qh->name);
-	rcu_read_unlock();
-
-	return ret;
-}
-
-static const struct seq_operations nfqueue_seq_ops = {
-	.start	= seq_start,
-	.next	= seq_next,
-	.stop	= seq_stop,
-	.show	= seq_show,
-};
-
-static int nfqueue_open(struct inode *inode, struct file *file)
-{
-	return seq_open(file, &nfqueue_seq_ops);
-}
-
-static const struct file_operations nfqueue_file_ops = {
-	.owner	 = THIS_MODULE,
-	.open	 = nfqueue_open,
-	.read	 = seq_read,
-	.llseek	 = seq_lseek,
-	.release = seq_release,
-};
-#endif /* PROC_FS */
-
-
-int __init netfilter_queue_init(void)
-{
-#ifdef CONFIG_PROC_FS
-	if (!proc_create("nf_queue", S_IRUGO,
-			 proc_net_netfilter, &nfqueue_file_ops))
-		return -1;
-#endif
-	return 0;
-}
-
diff --git a/net/netfilter/nfnetlink_queue_core.c b/net/netfilter/nfnetlink_queue_core.c
index e12d44e..3158d87 100644
--- a/net/netfilter/nfnetlink_queue_core.c
+++ b/net/netfilter/nfnetlink_queue_core.c
@@ -809,7 +809,6 @@ static const struct nla_policy nfqa_cfg_policy[NFQA_CFG_MAX+1] = {
 };
 
 static const struct nf_queue_handler nfqh = {
-	.name 	= "nf_queue",
 	.outfn	= &nfqnl_enqueue_packet,
 };
 
@@ -827,14 +826,10 @@ nfqnl_recv_config(struct sock *ctnl, struct sk_buff *skb,
 	if (nfqa[NFQA_CFG_CMD]) {
 		cmd = nla_data(nfqa[NFQA_CFG_CMD]);
 
-		/* Commands without queue context - might sleep */
+		/* Obsolete commands without queue context */
 		switch (cmd->command) {
-		case NFQNL_CFG_CMD_PF_BIND:
-			return nf_register_queue_handler(ntohs(cmd->pf),
-							 &nfqh);
-		case NFQNL_CFG_CMD_PF_UNBIND:
-			return nf_unregister_queue_handler(ntohs(cmd->pf),
-							   &nfqh);
+		case NFQNL_CFG_CMD_PF_BIND: return 0;
+		case NFQNL_CFG_CMD_PF_UNBIND: return 0;
 		}
 	}
 
@@ -1074,6 +1069,7 @@ static int __init nfnetlink_queue_init(void)
 #endif
 
 	register_netdevice_notifier(&nfqnl_dev_notifier);
+	nf_register_queue_handler(&nfqh);
 	return status;
 
 #ifdef CONFIG_PROC_FS
@@ -1087,7 +1083,7 @@ cleanup_netlink_notifier:
 
 static void __exit nfnetlink_queue_fini(void)
 {
-	nf_unregister_queue_handlers(&nfqh);
+	nf_unregister_queue_handler();
 	unregister_netdevice_notifier(&nfqnl_dev_notifier);
 #ifdef CONFIG_PROC_FS
 	remove_proc_entry("nfnetlink_queue", proc_net_netfilter);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 5/6] netfilter: ctnetlink: nla_policy updates
From: pablo @ 2012-12-04 17:31 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Add stricter checking for a few attributes.
Note that these changes don't fix any bug in the current code base.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/netfilter/nf_conntrack_netlink.c   |    8 ++++++--
 net/netfilter/nf_conntrack_proto_tcp.c |    2 ++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c24a00a..4e078cd 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -898,7 +898,8 @@ ctnetlink_parse_zone(const struct nlattr *attr, u16 *zone)
 }
 
 static const struct nla_policy help_nla_policy[CTA_HELP_MAX+1] = {
-	[CTA_HELP_NAME]		= { .type = NLA_NUL_STRING },
+	[CTA_HELP_NAME]		= { .type = NLA_NUL_STRING,
+				    .len = NF_CT_HELPER_NAME_LEN - 1 },
 };
 
 static inline int
@@ -932,6 +933,8 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] = {
 	[CTA_ID]		= { .type = NLA_U32 },
 	[CTA_NAT_DST]		= { .type = NLA_NESTED },
 	[CTA_TUPLE_MASTER]	= { .type = NLA_NESTED },
+	[CTA_NAT_SEQ_ADJ_ORIG]  = { .type = NLA_NESTED },
+	[CTA_NAT_SEQ_ADJ_REPLY] = { .type = NLA_NESTED },
 	[CTA_ZONE]		= { .type = NLA_U16 },
 	[CTA_MARK_MASK]		= { .type = NLA_U32 },
 };
@@ -2322,7 +2325,8 @@ static const struct nla_policy exp_nla_policy[CTA_EXPECT_MAX+1] = {
 	[CTA_EXPECT_MASK]	= { .type = NLA_NESTED },
 	[CTA_EXPECT_TIMEOUT]	= { .type = NLA_U32 },
 	[CTA_EXPECT_ID]		= { .type = NLA_U32 },
-	[CTA_EXPECT_HELP_NAME]	= { .type = NLA_NUL_STRING },
+	[CTA_EXPECT_HELP_NAME]	= { .type = NLA_NUL_STRING,
+				    .len = NF_CT_HELPER_NAME_LEN - 1 },
 	[CTA_EXPECT_ZONE]	= { .type = NLA_U16 },
 	[CTA_EXPECT_FLAGS]	= { .type = NLA_U32 },
 	[CTA_EXPECT_CLASS]	= { .type = NLA_U32 },
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 61f9285..83876e9 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1353,6 +1353,8 @@ static const struct nla_policy tcp_timeout_nla_policy[CTA_TIMEOUT_TCP_MAX+1] = {
 	[CTA_TIMEOUT_TCP_TIME_WAIT]	= { .type = NLA_U32 },
 	[CTA_TIMEOUT_TCP_CLOSE]		= { .type = NLA_U32 },
 	[CTA_TIMEOUT_TCP_SYN_SENT2]	= { .type = NLA_U32 },
+	[CTA_TIMEOUT_TCP_RETRANS]	= { .type = NLA_U32 },
+	[CTA_TIMEOUT_TCP_UNACK]		= { .type = NLA_U32 },
 };
 #endif /* CONFIG_NF_CT_NETLINK_TIMEOUT */
 
-- 
1.7.10.4

^ permalink raw reply related

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Michael S. Tsirkin @ 2012-12-04 17:36 UTC (permalink / raw)
  To: Paul Moore; +Cc: Jason Wang, netdev, linux-security-module, selinux
In-Reply-To: <8577392.82G063LYx2@sifl>

On Tue, Dec 04, 2012 at 11:18:57AM -0500, Paul Moore wrote:
> On Tuesday, December 04, 2012 09:24:43 PM Jason Wang wrote:
> > On Monday, December 03, 2012 11:22:29 AM Paul Moore wrote:
> > > It may be that I'm misunderstanding TUNSETQUEUE and/or TUNSETIFF.  Can you
> > > elaborate as to why they should be different?
> > 
> > If I understand correctly, before multiqueue patchset, TUNSETIFF is used to:
> > 
> > 1) Create the tun/tap network device
> > 2) For persistent device, re-attach the fd to the network device / socket.
> > In this case, we call selinux_tun_dev_attch() to relabel the socket sid (in
> > fact also the device's since the socket were persistent also) to the sid of
> > process that calls TUNSETIFF.
> > 
> > So, after the changes of multiqueue, we need try to preserve those policy.
> > The interesting part is the introducing of TUNSETQUEUE, it's used to attach
> > more file descriptors/sockets to a tun/tap device after at least one file
> > descriptor were attached to the tun/tap device through TUNSETIFF. So I
> > think maybe we need differ those two ioctls. This patch looks fine for
> > TUNSETQUEUE, but for TUNSETIFF, we need relabel the tunsec to the process
> > that calling TUNSETIFF for persistent device?
> 
> Okay, based on your explanation of TUNSETQUEUE, the steps below are what I 
> believe we need to do ... if you disagree speak up quickly please.
> 
> A. TUNSETIFF (new, non-persistent device)
> 
> [Allocate and initialize the tun_struct LSM state based on the calling 
> process, use this state to label the TUN socket.]
> 
> 1. Call security_tun_dev_create() which authorizes the action.
> 2. Call security_tun_dev_alloc_security() which allocates the tun_struct LSM 
> blob and SELinux sets some internal blob state to record the label of the 
> calling process.
> 3. Call security_tun_dev_attach() which sets the label of the TUN socket to 
> match the label stored in the tun_struct LSM blob during A2.  No authorization 
> is done at this point since the socket is new/unlabeled.

> B. TUNSETIFF (existing, persistent device)
> 
> [Relabel the existing tun_struct LSM state based on the calling process, use
> this state to label the TUN socket.]
> 
> 1. Attempt to relabel/reset the tun_struct LSM blob from the currently stored 
> value, set during A2, to the label of the current calling process. *** THIS IS 
> NOT CURRENTLY DONE IN THE RFC PATCH ***
> 2. Call security_tun_dev_attach() which sets the label of the TUN socket to
> match the label stored in the tun_struct LSM blob during B1. No authorization 
> is done at this point since the socket is new/unlabeled.
> 
> C. TUNSETQUEUE
> 
> [Use the existing tun_struct LSM state to label the new TUN socket.]
> 
> 1. Call security_tun_dev_attach() which sets the label of the TUN socket to 
> match the label stored in the tun_struct LSM blob set during either A2 or B1.  
> No authorization is done at this point since the socket is new/unlabeled.


Here's what bothers me. libvirt currently opens tun and passes
fd to qemu.
What would prevent qemu from attaching fd using TUNSETQUEUE
to another device it does not own?


> > btw. Current code does allow calling TUNSETQUEUE to a persistent tun/tap
> > device with no file attached. It should be a bug and need to be fixed.
> 
> Since you wrote that code will you be submitting a patch to fix that problem?
> 
> > > One thing that I think we probably should change is the relabelto/from
> > > permissions in the function above (selinux_tun_dev_attach()); in the case
> > > where the socket does not yet have a label, e.g. 'sksec->sid == 0', we
> > > should probably skip the relabel permissions since we want to assign the
> > > TUN device label regardless in this case.
> > 
> > I'm not familiar with the selinux, have a quick glance of the code, looks
> > like the label has been initialized to SECINITSID_KERNEL in
> > selinux_socket_post_create().
> 
> Unless I've missed something in your changes, the multiqueue code never calls 
> any socket code which ends up calling {security,selinux}_socket_post_create(); 
> I believe you only call sk_alloc() which ends up calling 
> {security,selinux}_sk_alloc() which sets SECINITSID_UNLABELED (I mistakenly 
> wrote 0 instead in my earlier email which is techincally SECSID_NULL).  Either 
> way, I still think the logic I originally described above is correct.
> 
> -- 
> paul moore
> security and virtualization @ redhat

^ permalink raw reply

* Re: [net-next PATCH V3-evictor] net: frag evictor, avoid killing warm frag queues
From: Jesper Dangaard Brouer @ 2012-12-04 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Florian Westphal, netdev, Thomas Graf,
	Paul E. McKenney, Cong Wang, Herbert Xu
In-Reply-To: <1354632447.1388.150.camel@edumazet-glaptop>

On Tue, 2012-12-04 at 06:47 -0800, Eric Dumazet wrote:
> On Tue, 2012-12-04 at 14:30 +0100, Jesper Dangaard Brouer wrote:
> > The fragmentation evictor system have a very unfortunate eviction
> > system for killing fragment, when the system is put under pressure.
> > 
> > If packets are coming in too fast, the evictor code kills "warm"
> > fragments too quickly.  Resulting in close to zero throughput, as
> > fragments are killed before they have a chance to complete
> > 
> > This is related to the bad interaction with the LRU (Least Recently
> > Used) list.  Under load the LRU list sort-of changes meaning/behavior.
> > When the LRU head is very new/warm, then the head is most likely the
> > one with most fragments and the tail (latest used or added element)
> > with least.
> > 
> > Solved by, introducing a creation "jiffie" timestamp (creation_ts).
> > If the element is tried evicted in same jiffie, then perform tail drop
> > on the LRU list instead.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <jbrouer@redhat.com>

First of all, this patch is not the perfect thing, its a starting point
of a discussion to find a better solution.


> This would only 'work' if a reassembled packet can be done/completed
> under one jiffie.

True, and I'm not happy with this resolution.  It's only purpose is to
help me detect when the LRU list is reversing it functionality. 

This is the *only* message I'm trying to convey:

    **The LRU list is misbehaving** (in this situation)


Perhaps the best option is to implement something else than a LRU... I
just haven't found the correct replacement/idea yet.


> For 64KB packets, this means 100Mb link wont be able to deliver a
> reassembled packet under IP frags load if HZ=1000

True, the 1 jiffie check should be increased, but that's not the point.
(Also I make no promise of fairness, I hope we can address this fairness
issues in a later patch, perhaps in combination with replacing the LRU).


(Notice: I have run tests with higher high_thresh/low_thresh values, the
results are the same)


> LRU goal is to be able to select the oldest inet_frag_queue, because in
> typical networks, packet losses are really happening and this is why
> some packets wont complete their reassembly. They naturally will be
> found on LRU head, and they probably are very fat (for example a single
> packet was lost for the inet_frag_queue)

Look at what is happening in inet_frag_evictor(), when we are under
load.  We will quickly delete all the oldest inet_frag_queue, you are
talking about.  After which the LRU list will be filled with what? Only
new fragments.  

Think about that is the order of this list, now?  Remember it only
contains incomplete inet_frag_queue's.

My theory, prove me wrong, is when the LRU head is very new/warm, then
the head is most likely the one with most fragments and the tail (latest
used or added element) with the least fragments.


> Choosing the most recent inet_frag_queue is exactly the opposite
> strategy. We pay the huge cost of maintaining a central LRU, and we
> exactly misuse it.

Then the LRU list is perhaps is the wrong choice?

> As long as an inet_frag_queue receives new fragments and is moved to the
> LRU tail, its a candidate for being kept, not a candidate for being
> evicted.

Remember I have shown/proven that all inet_frag_queue's in the list
have been touched within 1 jiffie.  Which one do you choose for removal?

(Also remember if an inet_frag_queue looses one frame, on the network
layer, it will not complete, and after 1 jiffie it will be killed by the
evictor.  So, this function still "works")


> Only when an inet_frag_queue is the oldest one, it becomes a candidate
> for eviction.
> 
> I think you are trying to solve a configuration/tuning problem by
> changing a valid strategy.
> 
> Whats wrong with admitting high_thresh/low_thresh default values should
> be updated, now some people apparently want to use IP fragments in
> production ?

I'm not against increasing the high_thresh/low_thresh default values.
I have tested with your 4MB/3MB settings (and 40/39, and 400/399).  The
results are (almost) the same, its not the problem!  I have shown you
several test results already (added some extra tests below)
And yes, the high_thresh/low_thresh default values should be increased,
I just don't want to discuss how much.

I want to discuss the correctness of the evictor and LRU.  You are
trying to avoid calling the evictor code; you cannot, assuming a queing
system, where packets are arriving at a higher rate than you can
process.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


p.s. I'm working on getting a better interleaving of the fragments. I'm
depending on running netperf generators on different CPUs.  I tried
adding a SFQ qdisc, but no egress queueing occurred, so it didn't have
any effect.

RAW tests, with different high_thresh/low_thresh:
-------------------------------------------------
I'm extracting the "FRAG: inuse X memory YYYYYY" with the command:
 [root@dragon ~]# for pid in `ps aux | grep [n]etserver | awk '{print $2}' | tr '\n' ' '`; do echo -e "\nNetserver PID:$pid"; egrep -e 'UDP|FRAG' /proc/$pid/net/sockstat ; done

Default net-next kernel with out patches:

[root@dragon ~]# uname -a
Linux dragon 3.7.0-rc6-net-next+ #47 SMP Thu Nov 22 00:06:12 CET 2012 x86_64 x86_64 x86_64 GNU/Linux

----------------------
[root@dragon ~]# grep . /proc/sys/net/ipv4/ipfrag_*_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh:262144
/proc/sys/net/ipv4/ipfrag_low_thresh:196608

FRAG: inuse 4 memory 245152

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10580
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353279      0    9256.89
212992           20.00       10768            282.15

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      350801      0    9191.95
212992           20.00        7283            190.83

------------------------


[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*4)))
net.ipv4.ipfrag_high_thresh = 4194304
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*3)))
net.ipv4.ipfrag_low_thresh = 3145728

FRAG: inuse 41 memory 3867784

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10882
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353379      0    9259.50
212992           20.00       48986           1283.57

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      350488      0    9183.75
212992           20.00       33336            873.50

-------------------------

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*40)))
net.ipv4.ipfrag_high_thresh = 41943040
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*39)))
net.ipv4.ipfrag_low_thresh = 40894464

FRAG: inuse 442 memory 41693008

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[1] 10899
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      353097      0    9252.10
212992           20.00       38281           1003.07

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      351708      0    9215.72
212992           20.00       23602            618.44


---------------------------

[root@dragon ~]# sysctl -w net/ipv4/ipfrag_high_thresh=$(((1024**2*400)))
net.ipv4.ipfrag_high_thresh = 419430400
[root@dragon ~]# sysctl -w net/ipv4/ipfrag_low_thresh=$(((1024**2*399)))
net.ipv4.ipfrag_low_thresh = 418381824

FRAG: inuse 4665 memory 418760600

[jbrouer@firesoul ~]$ netperf -H 192.168.51.2 -T0,0 -t UDP_STREAM -l 20 & netperf -p 1337 -H 192.168.31.2 -T7,7 -t UDP_STREAM -l 20
[2] 10918
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.31.2 (192.168.31.2) port 0 AF_INET : cpu bind
UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.51.2 (192.168.51.2) port 0 AF_INET : cpu bind
Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      352255      0    9230.05
212992           20.00       28048            734.94

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992   65507   20.00      349842      0    9166.83
212992           20.00       20979            549.71

^ permalink raw reply

* Re: [PATCH 0/6] netfilter updates for net-next
From: David Miller @ 2012-12-04 18:02 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1354642293-4114-1-git-send-email-pablo@netfilter.org>

From: pablo@netfilter.org
Date: Tue,  4 Dec 2012 18:31:27 +0100

> You can pull these changes from:
> 
> git://1984.lsi.us.es/nf-next master

Pulled, thanks.

^ permalink raw reply

* Re: [PATCH 0/7] qlcnic: refactor 82xx adapter driver
From: David Miller @ 2012-12-04 18:02 UTC (permalink / raw)
  To: sony.chacko; +Cc: netdev, Dept_NX_Linux_NIC_Driver
In-Reply-To: <1354628038-2234-1-git-send-email-sony.chacko@qlogic.com>

From: Sony Chacko <sony.chacko@qlogic.com>
Date: Tue,  4 Dec 2012 08:33:51 -0500

> From: Sony Chacko <sony.chacko@qlogic.com>
> 
> Please apply the refactoring patches to net-next.

Series applied.

^ permalink raw reply

* Re: [PATCH 1/5 net-next v2] tg3: Fix inconsistent locking for tg3_netif_start().
From: David Miller @ 2012-12-04 18:02 UTC (permalink / raw)
  To: mchan; +Cc: netdev, nsujir, richardcochran
In-Reply-To: <1354599420-3589-1-git-send-email-mchan@broadcom.com>

All 5 patches applied, thanks.

^ permalink raw reply

* Re: [Patch net-next] xfrm: add missing xfrm message types to selinux perm table
From: David Miller @ 2012-12-04 18:02 UTC (permalink / raw)
  To: amwang; +Cc: netdev, steffen.klassert, herbert
In-Reply-To: <1354599571-6181-1-git-send-email-amwang@redhat.com>

From: Cong Wang <amwang@redhat.com>
Date: Tue,  4 Dec 2012 13:39:31 +0800

> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> Signed-off-by: Cong Wang <amwang@redhat.com>

Steffen will you take this into your IPSEC tree?

Thanks.

^ permalink raw reply

* Re: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: David Miller @ 2012-12-04 18:09 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <1354619621-16016-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue,  4 Dec 2012 12:13:34 +0100

> The goal of this serie is to be able to monitor multicast activities via
> rtnetlink.
> 
> The main changes are:
>  - when user dumps mfc entries it now get all entries, included the unresolved
>    cache.
>  - kernel sends rtnetlink when it adds/deletes mfc entries.
> 
> As usual, the patch against iproute2 will be sent once the patches are included and
> net-next merged. I can send it on demand.

This looks good, applied, thanks Nicolas.

The one thing I worry about are those 64-bit statistics.  I fear that they
not be 64-bit aligned in the final netlink message.  This matters on cpus
that trap on unaligned loads/stores, such as sparc and MIPS.

Can you validate this?

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Use eth_mac_addr() instead of duplicating it
From: Jesse Gross @ 2012-12-04 18:11 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20121203221732.GA14494-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>

On Mon, Dec 3, 2012 at 2:17 PM, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org> wrote:
>
> bonus: if we ever are to use IFF_LIVE_ADDR_CHANGE for
> anything further than to check availability in eth_mac_addr(),
> Open vSwitch will be ready for that.
>
> Signed-off-by: Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] openvswitch: Avoid useless holes in struct vport
From: Jesse Gross @ 2012-12-04 18:11 UTC (permalink / raw)
  To: Thomas Graf
  Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q
In-Reply-To: <20121203222432.GB14494-FZi0V3Vbi30CUdFEqe4BF2D2FQJk+8+b@public.gmane.org>

On Mon, Dec 3, 2012 at 2:24 PM, Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org> wrote:
>
> Having the 16bit port_no in between a set of pointers creates
> an unwanted and useless hole in the struct.
>
> Signed-off-by: Thomas Graf <tgraf-G/eBtMaohhA@public.gmane.org>

Applied.

^ permalink raw reply

* Re: [PATCH net-next 3/7] ipv6: improve ipv6_find_hdr() to skip empty routing headers
From: Jesse Gross @ 2012-12-04 18:15 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: David Miller, netdev, dev, Ansis Atteka
In-Reply-To: <20121203180611.GA20305@1984>

On Mon, Dec 3, 2012 at 10:06 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Dec 03, 2012 at 09:28:55AM -0800, Jesse Gross wrote:
>> On Mon, Dec 3, 2012 at 6:04 AM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>> > On Thu, Nov 29, 2012 at 10:35:45AM -0800, Jesse Gross wrote:
>> >> @@ -159,9 +162,10 @@ int ipv6_find_hdr(const struct sk_buff *skb, unsigned int *offset,
>> >>       }
>> >>       len = skb->len - start;
>> >>
>> >> -     while (nexthdr != target) {
>> >
>> > If the offset is set as parameter via ipv6_find_hdr, we now are always
>> > entering the loop even if we found the target header we're looking
>> > for, before that didn't happen.
>> >
>> > Something seems wrong here to me.
>>
>> If the target header is a routing header then you might still need to
>> continue searching because the first one that you see could be empty.
>
> OK, but if it's not a routing header what we're searching for (which
> seems to be the case of netfilter/IPVS) we waste way more cycles on
> copying the IPv6 header again and with way more things that are
> completely useless.

We could add a check to short circuit this but it seems like a
premature optimization to me.

Ansis, can you comment?

^ permalink raw reply

* Re: [RFC PATCH 2/2] tun: fix LSM/SELinux labeling of tun/tap devices
From: Paul Moore @ 2012-12-04 18:17 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, netdev, linux-security-module, selinux
In-Reply-To: <20121204173625.GA13993@redhat.com>

On Tuesday, December 04, 2012 07:36:26 PM Michael S. Tsirkin wrote:
> On Tue, Dec 04, 2012 at 11:18:57AM -0500, Paul Moore wrote:
> > Okay, based on your explanation of TUNSETQUEUE, the steps below are what I
> > believe we need to do ... if you disagree speak up quickly please.
> > 
> > A. TUNSETIFF (new, non-persistent device)
> > 
> > [Allocate and initialize the tun_struct LSM state based on the calling
> > process, use this state to label the TUN socket.]
> > 
> > 1. Call security_tun_dev_create() which authorizes the action.
> > 2. Call security_tun_dev_alloc_security() which allocates the tun_struct
> > LSM blob and SELinux sets some internal blob state to record the label of
> > the calling process.
> > 3. Call security_tun_dev_attach() which sets the label of the TUN socket
> > to match the label stored in the tun_struct LSM blob during A2.  No
> > authorization is done at this point since the socket is new/unlabeled.
> > 
> > B. TUNSETIFF (existing, persistent device)
> > 
> > [Relabel the existing tun_struct LSM state based on the calling process,
> > use this state to label the TUN socket.]
> > 
> > 1. Attempt to relabel/reset the tun_struct LSM blob from the currently
> > stored value, set during A2, to the label of the current calling process.
> > *** THIS IS NOT CURRENTLY DONE IN THE RFC PATCH ***
> > 2. Call security_tun_dev_attach() which sets the label of the TUN socket
> > to match the label stored in the tun_struct LSM blob during B1. No
> > authorization is done at this point since the socket is new/unlabeled.
> > 
> > C. TUNSETQUEUE
> > 
> > [Use the existing tun_struct LSM state to label the new TUN socket.]
> > 
> > 1. Call security_tun_dev_attach() which sets the label of the TUN socket
> > to match the label stored in the tun_struct LSM blob set during either A2
> > or B1. No authorization is done at this point since the socket is
> > new/unlabeled.
>
> Here's what bothers me. libvirt currently opens tun and passes
> fd to qemu. What would prevent qemu from attaching fd using TUNSETQUEUE
> to another device it does not own?

True, assuming all the above is correct and that I'm understanding it 
correctly (Jason?), we should probably add a new SELinux access control for 
TUNSETQUEUE.

The current DAC code exists in tun_not_capable().

-- 
paul moore
security and virtualization @ redhat


^ permalink raw reply

* Re: [GIT PULL] Remove __dev* markings from the networking drivers
From: David Miller @ 2012-12-04 18:17 UTC (permalink / raw)
  To: gregkh; +Cc: netdev, wfp5p
In-Reply-To: <20121203.153623.958379034481583512.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 03 Dec 2012 15:36:23 -0500 (EST)

> From: Greg KH <gregkh@linuxfoundation.org>
> Date: Mon, 3 Dec 2012 12:27:43 -0800
> 
>> Here's a pull request for all of Bill's __dev* removal patches for the
>> networking drivers, based on net-next.  I'll send out a separate one for
>> the wireless drivers in case those want to go through John's tree
>> 
>> If I should rework these in any way, please let me know.
> 
> Pulled, thanks Greg.

I had a reason to go look in detail at the changes in here.

It seemse the function declarations were not properly reformatted
after the __dev* tags were removed.  You can't just search and replace
this kind of stuff.  The result looks terrible.

Greg please check for things like this next time you send me changes
written by someone else.

Thanks.

^ permalink raw reply

* Re: [PATCH -next] net: neterion: use for_each_pci_dev to simplify the code
From: David Miller @ 2012-12-04 18:20 UTC (permalink / raw)
  To: weiyj.lk; +Cc: jdmason, yongjun_wei, netdev
In-Reply-To: <CAPgLHd-W5Lpa7EPFMit6Go+zt6ZrsV9dhEg9B2eQDxFCzau0Zg@mail.gmail.com>

From: Wei Yongjun <weiyj.lk@gmail.com>
Date: Tue, 4 Dec 2012 00:05:13 -0500

> From: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
> 
> Use for_each_pci_dev to simplify the code.
> 
> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] net/macb: increase RX buffer size for GEM
From: David Miller @ 2012-12-04 18:22 UTC (permalink / raw)
  To: nicolas.ferre; +Cc: netdev, linux-arm-kernel, linux-kernel, manabian, plagnioj
In-Reply-To: <1354536943-6356-1-git-send-email-nicolas.ferre@atmel.com>

From: Nicolas Ferre <nicolas.ferre@atmel.com>
Date: Mon, 3 Dec 2012 13:15:43 +0100

> Macb Ethernet controller requires a RX buffer of 128 bytes. It is
> highly sub-optimal for Gigabit-capable GEM that is able to use
> a bigger DMA buffer. Change this constant and associated macros
> with data stored in the private structure.
> I also kept the result of buffers per page calculation to lower the
> impact of this move to a variable rx buffer size on rx hot path.
> RX DMA buffer size has to be multiple of 64 bytes as indicated in
> DMA Configuration Register specification.
> 
> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>

This looks like it will waste a couple hundred bytes for 1500 MTU
frames, am I right?

^ permalink raw reply

* Re: [PATCH net-next v2] bridge: export multicast database via netlink
From: David Miller @ 2012-12-04 18:25 UTC (permalink / raw)
  To: amwang; +Cc: netdev, bridge, herbert, shemminger, tgraf, brouer
In-Reply-To: <1354539824-7898-1-git-send-email-amwang@redhat.com>

From: Cong Wang <amwang@redhat.com>
Date: Mon,  3 Dec 2012 21:03:43 +0800

> TODO: remove debugging printk's

Can you please take care of this so we can consider this patch
seriously for inclusion?

Thanks.

^ permalink raw reply

* Re: [PATCH] dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
From: David Miller @ 2012-12-04 18:26 UTC (permalink / raw)
  To: serge.hallyn; +Cc: linux-kernel, netdev, ebiederm, dlezcano
In-Reply-To: <20121204021712.GA10268@sergelap>

From: Serge Hallyn <serge.hallyn@canonical.com>
Date: Mon, 3 Dec 2012 20:17:12 -0600

> When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
> to ns1.  When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
> nothing to ns1.
> 
> This patch changes that behavior so that when moving a nic from ns1 to ns2, we
> send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2.  (The KOBJ_MOVE is still
> sent to ns2).
> 
> The effects of this can be seen when starting and stopping containers in
> an upstart based host.  Lxc will create a pair of veth nics, the kernel
> sends KOBJ_ADD, and upstart starts network-instance jobs for each.  When
> one nic is moved to the container, because no KOBJ_REMOVED event is
> received, the network-instance job for that veth never goes away.  This
> was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
> With this patch the networ-instance jobs properly go away.
> 
> The other oddness solved here is that if a nic is passed into a running
> upstart-based container, without this patch no network-instance job is
> started in the container.  But when the container creates a new nic
> itself (ip link add new type veth) then network-interface jobs are
> created.  With this patch, behavior comes in line with a regular host.
> 
> v2: also send KOBJ_ADD to new netns.  There will then be a
> _MOVE event from the device_rename() call, but that should
> be innocuous.
> 
> Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com>
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH] ip6mr: fix rtm_family of rtnl msg
From: David Miller @ 2012-12-04 18:27 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev, kaber
In-Reply-To: <1354618909-15631-1-git-send-email-nicolas.dichtel@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Tue,  4 Dec 2012 12:01:49 +0100

> We talk about IPv6, hence the family is RTNL_FAMILY_IP6MR!
> rtnl_register() is already called with RTNL_FAMILY_IP6MR.
> 
> The bug is here since the beginning of this function (commit 5b285cac3570).
> 
> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>

Applied, thanks.

^ permalink raw reply

* [PATCH v2] net: phy: smsc: force all capable mode if the phy is started in powerdown mode
From: Philippe Reynes @ 2012-12-04 18:52 UTC (permalink / raw)
  To: netdev, linux-kernel, davem
  Cc: otavio, javier, jkosina, eric.jarrige, julien.boibessot,
	thomas.petazzoni, Philippe Reynes

A SMSC PHY in power down mode can't be used.
If a SMSC PHY is in this mode in the config_init
stage, the mode "all capable" is set. So the PHY
could then be used.

Signed-off-by: Philippe Reynes <tremyfr@yahoo.fr>
---
Difference between v1 and v2:
- move comment after first block
- update comment to network rules

 drivers/net/phy/smsc.c  |   26 +++++++++++++++++++++++++-
 include/linux/smscphy.h |    5 +++++
 2 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/drivers/net/phy/smsc.c b/drivers/net/phy/smsc.c
index 88e3991..ad4fa0e 100644
--- a/drivers/net/phy/smsc.c
+++ b/drivers/net/phy/smsc.c
@@ -43,7 +43,31 @@ static int smsc_phy_ack_interrupt(struct phy_device *phydev)
 
 static int smsc_phy_config_init(struct phy_device *phydev)
 {
-	int rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
+	int rc = phy_read(phydev, MII_LAN83C185_SPECIAL_MODES);
+	if (rc < 0)
+		return rc;
+
+	/* If the SMSC PHY is in power down mode, then set it
+	 * in all capable mode before using it.
+	 */
+	if ((rc & MII_LAN83C185_MODE_MASK) == MII_LAN83C185_MODE_POWERDOWN) {
+		int timeout = 50000;
+
+		/* set "all capable" mode and reset the phy */
+		rc |= MII_LAN83C185_MODE_ALL;
+		phy_write(phydev, MII_LAN83C185_SPECIAL_MODES, rc);
+		phy_write(phydev, MII_BMCR, BMCR_RESET);
+
+		/* wait end of reset (max 500 ms) */
+		do {
+			udelay(10);
+			if (timeout-- == 0)
+				return -1;
+			rc = phy_read(phydev, MII_BMCR);
+		} while (rc & BMCR_RESET);
+	}
+
+	rc = phy_read(phydev, MII_LAN83C185_CTRL_STATUS);
 	if (rc < 0)
 		return rc;
 
diff --git a/include/linux/smscphy.h b/include/linux/smscphy.h
index ce718cb..f4bf16e 100644
--- a/include/linux/smscphy.h
+++ b/include/linux/smscphy.h
@@ -4,6 +4,7 @@
 #define MII_LAN83C185_ISF 29 /* Interrupt Source Flags */
 #define MII_LAN83C185_IM  30 /* Interrupt Mask */
 #define MII_LAN83C185_CTRL_STATUS 17 /* Mode/Status Register */
+#define MII_LAN83C185_SPECIAL_MODES 18 /* Special Modes Register */
 
 #define MII_LAN83C185_ISF_INT1 (1<<1) /* Auto-Negotiation Page Received */
 #define MII_LAN83C185_ISF_INT2 (1<<2) /* Parallel Detection Fault */
@@ -22,4 +23,8 @@
 #define MII_LAN83C185_EDPWRDOWN (1 << 13) /* EDPWRDOWN */
 #define MII_LAN83C185_ENERGYON  (1 << 1)  /* ENERGYON */
 
+#define MII_LAN83C185_MODE_MASK      0xE0
+#define MII_LAN83C185_MODE_POWERDOWN 0xC0 /* Power Down mode */
+#define MII_LAN83C185_MODE_ALL       0xE0 /* All capable mode */
+
 #endif /* __LINUX_SMSCPHY_H__ */
-- 
1.7.4.4

^ permalink raw reply related

* Re: [RFT PATCH] 8139cp: properly support change of MTU values [v2]
From: John Greene @ 2012-12-04 19:04 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Francois Romieu, David Miller, netdev
In-Reply-To: <1354635895.24281.144.camel@shinybook.infradead.org>

On 12/04/2012 10:44 AM, David Woodhouse wrote:
> On Mon, 2012-12-03 at 21:46 +0100, Francois Romieu wrote:
>> David Miller <davem@davemloft.net> :
>> [...]
>>> I've applied this to net-next, if it triggers any problems we have
>>> some time to work it out before 3.8 is released.
>>
>> I have bounced the messages to David Woodhouse since he authored the
>> last 8139cp changes in net-next and owns the hardware to notice
>> regressions.
>
> Thanks.
>
> This almost works. The patch itself is fine, but the device can't
> receive packets larger than 2266 bytes (ping -s 2238). After that, I get
> rx_fifo errors. I think the RX FIFO is only 2KiB on the 8139cp, isn't
> it? So after that it's dependent on how fast it can shovel it out across
> the PCI bus. Which is "not fast" in this case.
>
> Transmit appears to be fine.
>
Checked the datasheet (admittedly old v1.5 12/6/2001): yes FIFOs are 2k 
on both Rx and Tx.  I need to check this on the emulator again, but it 
didn't fail with pings up to nearly 9000 bytes, apparently a difference 
of real vs. virtual hardware (perhaps an interesting science experiment 
to adjust MTU to what the underlying hardware does, but not today ;)

Still, this does fix reported problem that driver could be set to huge 
MTU erroneously, and now rejects really weird values as it should.

Thanks for the test, David.

Anything to add/subtract?

Francois: I noted your follow on patch, find merit in it as well.  I 
need to digest it more fully and expect it should be a follow up to this 
barring any other issues from me: appreciate your help also!



-- 
John Greene
jogreene@redhat.com

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox