Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 7/7] netfilter: add user-space connection tracking helper infrastructure
From: pablo @ 2012-06-04 12:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev
In-Reply-To: <1338812485-4232-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

There are good reasons to supports helpers in user-space instead:

* Rapid connection tracking helper development, as developing code
  in user-space is usually faster.

* Reliability: A buggy helper does not crash the kernel. Moreover,
  we can monitor the helper process and restart it in case of problems.

* Security: Avoid complex string matching and mangling in kernel-space
  running in unprivileged mode. Going further, we can even think about
  running user-space helpers as a non-root process.

* It allows the development of very specific helpers (most likely
  non-standard proprietary protocols) that are very likely to be rejected
  for mainline inclusion in the form of kernel-space connection tracking
  helpers.

This patch adds the infrastructure to allow the implementation of
user-space conntrack helpers by means of the new nfnetlink subsystem
`nfnetlink_cthelper' and the existing queueing infrastructure
(nfnetlink_queue).

I had to add the new hook NF_IP6_PRI_CONNTRACK_HELPER to register
ipv[4|6]_helper which results from splitting ipv[4|6]_confirm into
two pieces. This change is required not to break NAT sequence
adjustment and conntrack confirmation for traffic that is enqueued
to our user-space conntrack helpers.

Basic operation, in a few steps:

1) Register user-space helper by means of `nfct':

 nfct helper add ftp inet

 [ It must be a valid existing helper supported by conntrack-tools.

2) Add rules to enable the FTP user-space helper which is
   used to track traffic going to TCP port 10000.

For locally generated packets:

 iptables -I OUTPUT -t raw -p tcp --dport 21 -j CT --helper ftp

For non-locally generated packets:

 iptables -I PREROUTING -t raw -p tcp --dport 21 -j CT --helper ftp

3) Run the test conntrackd in helper mode (see example files under
   doc/helper/conntrackd.conf

 conntrackd

4) Generate FTP traffic going, if everything is OK, then conntrackd
   should create expectations (you can check that with `conntrack':

 conntrack -E expect

    [NEW] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp
[DESTROY] 301 proto=6 src=192.168.1.136 dst=130.89.148.12 sport=0 dport=54037 mask-src=255.255.255.255 mask-dst=255.255.255.255 sport=0 dport=65535 master-src=192.168.1.136 master-dst=130.89.148.12 sport=57127 dport=21 class=0 helper=ftp

This confirms that our test helper is receiving packets including the
conntrack information, and adding expectations in kernel-space.

The user-space helper can also store its private tracking information
in the conntrack structure in the kernel via the CTA_HELP_INFO. The
kernel will consider this a binary blob whose layout is unknown. This
information will be included in the information that is transfered
to user-space via glue code that integrates nfnetlink_queue and
ctnetlink.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/Kbuild                 |    1 +
 include/linux/netfilter/nfnetlink.h            |    3 +-
 include/linux/netfilter/nfnetlink_cthelper.h   |   55 ++
 include/linux/netfilter_ipv4.h                 |    1 +
 include/linux/netfilter_ipv6.h                 |    1 +
 include/net/netfilter/nf_conntrack_helper.h    |   11 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   56 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   56 +-
 net/netfilter/Kconfig                          |    8 +
 net/netfilter/Makefile                         |    1 +
 net/netfilter/nf_conntrack_helper.c            |   24 +-
 net/netfilter/nfnetlink_cthelper.c             |  668 ++++++++++++++++++++++++
 12 files changed, 858 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/netfilter/nfnetlink_cthelper.h
 create mode 100644 net/netfilter/nfnetlink_cthelper.c

diff --git a/include/linux/netfilter/Kbuild b/include/linux/netfilter/Kbuild
index 1697036..874ae8f 100644
--- a/include/linux/netfilter/Kbuild
+++ b/include/linux/netfilter/Kbuild
@@ -10,6 +10,7 @@ header-y += nfnetlink.h
 header-y += nfnetlink_acct.h
 header-y += nfnetlink_compat.h
 header-y += nfnetlink_conntrack.h
+header-y += nfnetlink_cthelper.h
 header-y += nfnetlink_cttimeout.h
 header-y += nfnetlink_log.h
 header-y += nfnetlink_queue.h
diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index a1048c1..18341cd 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -50,7 +50,8 @@ struct nfgenmsg {
 #define NFNL_SUBSYS_IPSET		6
 #define NFNL_SUBSYS_ACCT		7
 #define NFNL_SUBSYS_CTNETLINK_TIMEOUT	8
-#define NFNL_SUBSYS_COUNT		9
+#define NFNL_SUBSYS_CTHELPER		9
+#define NFNL_SUBSYS_COUNT		10
 
 #ifdef __KERNEL__
 
diff --git a/include/linux/netfilter/nfnetlink_cthelper.h b/include/linux/netfilter/nfnetlink_cthelper.h
new file mode 100644
index 0000000..33659f6
--- /dev/null
+++ b/include/linux/netfilter/nfnetlink_cthelper.h
@@ -0,0 +1,55 @@
+#ifndef _NFNL_CTHELPER_H_
+#define _NFNL_CTHELPER_H_
+
+#define NFCT_HELPER_STATUS_DISABLED	0
+#define NFCT_HELPER_STATUS_ENABLED	1
+
+enum nfnl_acct_msg_types {
+	NFNL_MSG_CTHELPER_NEW,
+	NFNL_MSG_CTHELPER_GET,
+	NFNL_MSG_CTHELPER_DEL,
+	NFNL_MSG_CTHELPER_MAX
+};
+
+enum nfnl_cthelper_type {
+	NFCTH_UNSPEC,
+	NFCTH_NAME,
+	NFCTH_TUPLE,
+	NFCTH_QUEUE_NUM,
+	NFCTH_POLICY,
+	NFCTH_PRIV_DATA_LEN,
+	NFCTH_STATUS,
+	__NFCTH_MAX
+};
+#define NFCTH_MAX (__NFCTH_MAX - 1)
+
+enum nfnl_cthelper_policy_type {
+	NFCTH_POLICY_SET_UNSPEC,
+	NFCTH_POLICY_SET_NUM,
+	NFCTH_POLICY_SET,
+	NFCTH_POLICY_SET1	= NFCTH_POLICY_SET,
+	NFCTH_POLICY_SET2,
+	NFCTH_POLICY_SET3,
+	NFCTH_POLICY_SET4,
+	__NFCTH_POLICY_SET_MAX
+};
+#define NFCTH_POLICY_SET_MAX (__NFCTH_POLICY_SET_MAX - 1)
+
+enum nfnl_cthelper_pol_type {
+	NFCTH_POLICY_UNSPEC,
+	NFCTH_POLICY_NAME,
+	NFCTH_POLICY_EXPECT_MAX,
+	NFCTH_POLICY_EXPECT_TIMEOUT,
+	__NFCTH_POLICY_MAX
+};
+#define NFCTH_POLICY_MAX (__NFCTH_POLICY_MAX - 1)
+
+enum nfnl_cthelper_tuple_type {
+	NFCTH_TUPLE_UNSPEC,
+	NFCTH_TUPLE_L3PROTONUM,
+	NFCTH_TUPLE_L4PROTONUM,
+	__NFCTH_TUPLE_MAX,
+};
+#define NFCTH_TUPLE_MAX (__NFCTH_TUPLE_MAX - 1)
+
+#endif /* _NFNL_CTHELPER_H */
diff --git a/include/linux/netfilter_ipv4.h b/include/linux/netfilter_ipv4.h
index fa0946c..e2b1280 100644
--- a/include/linux/netfilter_ipv4.h
+++ b/include/linux/netfilter_ipv4.h
@@ -66,6 +66,7 @@ enum nf_ip_hook_priorities {
 	NF_IP_PRI_SECURITY = 50,
 	NF_IP_PRI_NAT_SRC = 100,
 	NF_IP_PRI_SELINUX_LAST = 225,
+	NF_IP_PRI_CONNTRACK_HELPER = 300,
 	NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX,
 	NF_IP_PRI_LAST = INT_MAX,
 };
diff --git a/include/linux/netfilter_ipv6.h b/include/linux/netfilter_ipv6.h
index 57c0251..7c8a513 100644
--- a/include/linux/netfilter_ipv6.h
+++ b/include/linux/netfilter_ipv6.h
@@ -71,6 +71,7 @@ enum nf_ip6_hook_priorities {
 	NF_IP6_PRI_SECURITY = 50,
 	NF_IP6_PRI_NAT_SRC = 100,
 	NF_IP6_PRI_SELINUX_LAST = 225,
+	NF_IP6_PRI_CONNTRACK_HELPER = 300,
 	NF_IP6_PRI_LAST = INT_MAX,
 };
 
diff --git a/include/net/netfilter/nf_conntrack_helper.h b/include/net/netfilter/nf_conntrack_helper.h
index e5091a9..f499aa5 100644
--- a/include/net/netfilter/nf_conntrack_helper.h
+++ b/include/net/netfilter/nf_conntrack_helper.h
@@ -15,6 +15,11 @@
 
 struct module;
 
+enum nf_ct_helper_flags {
+	NF_CT_HELPER_F_USERSPACE	= (1 << 0),
+	NF_CT_HELPER_F_CONFIGURED	= (1 << 1),
+};
+
 #define NF_CT_HELPER_NAME_LEN	16
 
 struct nf_conntrack_helper {
@@ -42,6 +47,9 @@ struct nf_conntrack_helper {
 	int (*from_nlattr)(struct nlattr *attr, struct nf_conn *ct);
 	int (*to_nlattr)(struct sk_buff *skb, const struct nf_conn *ct);
 	unsigned int expect_class_max;
+
+	unsigned int flags;
+	unsigned int queue_num;		/* For user-space helpers. */
 };
 
 extern struct nf_conntrack_helper *
@@ -96,4 +104,7 @@ nf_ct_helper_expectfn_find_by_name(const char *name);
 struct nf_ct_helper_expectfn *
 nf_ct_helper_expectfn_find_by_symbol(const void *symbol);
 
+extern struct hlist_head *nf_ct_helper_hash;
+extern unsigned int nf_ct_helper_hsize;
+
 #endif /*_NF_CONNTRACK_HELPER_H*/
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index 91747d4..d3cb34d 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -95,11 +95,11 @@ static int ipv4_get_l4proto(const struct sk_buff *skb, unsigned int nhoff,
 	return NF_ACCEPT;
 }
 
-static unsigned int ipv4_confirm(unsigned int hooknum,
-				 struct sk_buff *skb,
-				 const struct net_device *in,
-				 const struct net_device *out,
-				 int (*okfn)(struct sk_buff *))
+static unsigned int ipv4_helper(unsigned int hooknum,
+				struct sk_buff *skb,
+				const struct net_device *in,
+				const struct net_device *out,
+				int (*okfn)(struct sk_buff *))
 {
 	struct nf_conn *ct;
 	enum ip_conntrack_info ctinfo;
@@ -110,24 +110,45 @@ static unsigned int ipv4_confirm(unsigned int hooknum,
 	/* This is where we call the helper: as the packet goes out. */
 	ct = nf_ct_get(skb, &ctinfo);
 	if (!ct || ctinfo == IP_CT_RELATED_REPLY)
-		goto out;
+		return NF_ACCEPT;
 
 	help = nfct_help(ct);
 	if (!help)
-		goto out;
+		return NF_ACCEPT;
 
 	/* rcu_read_lock()ed by nf_hook_slow */
 	helper = rcu_dereference(help->helper);
 	if (!helper)
-		goto out;
+		return NF_ACCEPT;
+
+	/* This is an user-space helper not yet configured, skip. */
+	if ((helper->flags &
+		(NF_CT_HELPER_F_USERSPACE | NF_CT_HELPER_F_CONFIGURED)) ==
+		 NF_CT_HELPER_F_USERSPACE) {
+		return NF_ACCEPT;
+	}
 
 	ret = helper->help(skb, skb_network_offset(skb) + ip_hdrlen(skb),
 			   ct, ctinfo);
-	if (ret != NF_ACCEPT) {
+	if (ret != NF_ACCEPT && (ret & NF_VERDICT_MASK) != NF_QUEUE) {
 		nf_log_packet(NFPROTO_IPV4, hooknum, skb, in, out, NULL,
 			      "nf_ct_%s: dropping packet", helper->name);
-		return ret;
 	}
+	return ret;
+}
+
+static unsigned int ipv4_confirm(unsigned int hooknum,
+				 struct sk_buff *skb,
+				 const struct net_device *in,
+				 const struct net_device *out,
+				 int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn *ct;
+	enum ip_conntrack_info ctinfo;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct || ctinfo == IP_CT_RELATED_REPLY)
+		return NF_ACCEPT;
 
 	/* adjust seqs for loopback traffic only in outgoing direction */
 	if (test_bit(IPS_SEQ_ADJUST_BIT, &ct->status) &&
@@ -140,7 +161,6 @@ static unsigned int ipv4_confirm(unsigned int hooknum,
 			return NF_DROP;
 		}
 	}
-out:
 	/* We've seen it coming out the other side: confirm it */
 	return nf_conntrack_confirm(skb);
 }
@@ -185,6 +205,13 @@ static struct nf_hook_ops ipv4_conntrack_ops[] __read_mostly = {
 		.priority	= NF_IP_PRI_CONNTRACK,
 	},
 	{
+		.hook		= ipv4_helper,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV4,
+		.hooknum	= NF_INET_POST_ROUTING,
+		.priority	= NF_IP_PRI_CONNTRACK_HELPER,
+	},
+	{
 		.hook		= ipv4_confirm,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV4,
@@ -192,6 +219,13 @@ static struct nf_hook_ops ipv4_conntrack_ops[] __read_mostly = {
 		.priority	= NF_IP_PRI_CONNTRACK_CONFIRM,
 	},
 	{
+		.hook		= ipv4_helper,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV4,
+		.hooknum	= NF_INET_LOCAL_IN,
+		.priority	= NF_IP_PRI_CONNTRACK_HELPER,
+	},
+	{
 		.hook		= ipv4_confirm,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV4,
diff --git a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
index fe925e4..f9b3693 100644
--- a/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c
@@ -143,11 +143,11 @@ static int ipv6_get_l4proto(const struct sk_buff *skb, unsigned int nhoff,
 	return NF_ACCEPT;
 }
 
-static unsigned int ipv6_confirm(unsigned int hooknum,
-				 struct sk_buff *skb,
-				 const struct net_device *in,
-				 const struct net_device *out,
-				 int (*okfn)(struct sk_buff *))
+static unsigned int ipv6_helper(unsigned int hooknum,
+				struct sk_buff *skb,
+				const struct net_device *in,
+				const struct net_device *out,
+				int (*okfn)(struct sk_buff *))
 {
 	struct nf_conn *ct;
 	const struct nf_conn_help *help;
@@ -161,15 +161,15 @@ static unsigned int ipv6_confirm(unsigned int hooknum,
 	/* This is where we call the helper: as the packet goes out. */
 	ct = nf_ct_get(skb, &ctinfo);
 	if (!ct || ctinfo == IP_CT_RELATED_REPLY)
-		goto out;
+		return NF_ACCEPT;
 
 	help = nfct_help(ct);
 	if (!help)
-		goto out;
+		return NF_ACCEPT;
 	/* rcu_read_lock()ed by nf_hook_slow */
 	helper = rcu_dereference(help->helper);
 	if (!helper)
-		goto out;
+		return NF_ACCEPT;
 
 	protoff = nf_ct_ipv6_skip_exthdr(skb, extoff, &pnum,
 					 skb->len - extoff);
@@ -178,13 +178,35 @@ static unsigned int ipv6_confirm(unsigned int hooknum,
 		return NF_ACCEPT;
 	}
 
+	/* This is an user-space helper not yet configured, skip. */
+	if ((helper->flags &
+		(NF_CT_HELPER_F_USERSPACE | NF_CT_HELPER_F_CONFIGURED)) ==
+		 NF_CT_HELPER_F_USERSPACE) {
+		return NF_ACCEPT;
+	}
+
 	ret = helper->help(skb, protoff, ct, ctinfo);
-	if (ret != NF_ACCEPT) {
+	if (ret != NF_ACCEPT && (ret & NF_VERDICT_MASK) != NF_QUEUE) {
 		nf_log_packet(NFPROTO_IPV6, hooknum, skb, in, out, NULL,
 			      "nf_ct_%s: dropping packet", helper->name);
 		return ret;
 	}
-out:
+	return ret;
+}
+
+static unsigned int ipv6_confirm(unsigned int hooknum,
+				 struct sk_buff *skb,
+				 const struct net_device *in,
+				 const struct net_device *out,
+				 int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn *ct;
+	enum ip_conntrack_info ctinfo;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct || ctinfo == IP_CT_RELATED_REPLY)
+		return NF_ACCEPT;
+
 	/* We've seen it coming out the other side: confirm it */
 	return nf_conntrack_confirm(skb);
 }
@@ -255,6 +277,13 @@ static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
 		.priority	= NF_IP6_PRI_CONNTRACK,
 	},
 	{
+		.hook		= ipv6_helper,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_POST_ROUTING,
+		.priority	= NF_IP6_PRI_CONNTRACK_HELPER,
+	},
+	{
 		.hook		= ipv6_confirm,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV6,
@@ -262,6 +291,13 @@ static struct nf_hook_ops ipv6_conntrack_ops[] __read_mostly = {
 		.priority	= NF_IP6_PRI_LAST,
 	},
 	{
+		.hook		= ipv6_helper,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_LOCAL_IN,
+		.priority	= NF_IP6_PRI_CONNTRACK_HELPER,
+	},
+	{
 		.hook		= ipv6_confirm,
 		.owner		= THIS_MODULE,
 		.pf		= NFPROTO_IPV6,
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 209c1ed..cd5668e 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -12,6 +12,14 @@ tristate "Netfilter NFACCT over NFNETLINK interface"
 	  If this option is enabled, the kernel will include support
 	  for extended accounting via NFNETLINK.
 
+config NETFILTER_NETLINK_CTHELPER
+tristate "Netfilter NFCT_HELPER over NFNETLINK interface"
+	depends on NETFILTER_ADVANCED
+	select NETFILTER_NETLINK
+	help
+	  If this option is enabled, the kernel will include support
+	  for user-space connection tracking helpers via NFNETLINK.
+
 config NETFILTER_NETLINK_QUEUE
 	tristate "Netfilter NFQUEUE over NFNETLINK interface"
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 4e7960c..2f3bc0f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_NETFILTER) = netfilter.o
 
 obj-$(CONFIG_NETFILTER_NETLINK) += nfnetlink.o
 obj-$(CONFIG_NETFILTER_NETLINK_ACCT) += nfnetlink_acct.o
+obj-$(CONFIG_NETFILTER_NETLINK_CTHELPER) += nfnetlink_cthelper.o
 obj-$(CONFIG_NETFILTER_NETLINK_QUEUE) += nfnetlink_queue.o
 obj-$(CONFIG_NETFILTER_NETLINK_LOG) += nfnetlink_log.o
 
diff --git a/net/netfilter/nf_conntrack_helper.c b/net/netfilter/nf_conntrack_helper.c
index e0d3f4b..3d91d11 100644
--- a/net/netfilter/nf_conntrack_helper.c
+++ b/net/netfilter/nf_conntrack_helper.c
@@ -30,8 +30,10 @@
 #include <net/netfilter/nf_conntrack_extend.h>
 
 static DEFINE_MUTEX(nf_ct_helper_mutex);
-static struct hlist_head *nf_ct_helper_hash __read_mostly;
-static unsigned int nf_ct_helper_hsize __read_mostly;
+struct hlist_head *nf_ct_helper_hash __read_mostly;
+EXPORT_SYMBOL_GPL(nf_ct_helper_hash);
+unsigned int nf_ct_helper_hsize __read_mostly;
+EXPORT_SYMBOL_GPL(nf_ct_helper_hsize);
 static unsigned int nf_ct_helper_count __read_mostly;
 
 static bool nf_ct_auto_assign_helper __read_mostly = true;
@@ -322,18 +324,30 @@ EXPORT_SYMBOL_GPL(nf_ct_helper_expectfn_find_by_symbol);
 
 int nf_conntrack_helper_register(struct nf_conntrack_helper *me)
 {
+	int ret = 0;
+	struct nf_conntrack_helper *cur;
+	struct hlist_node *n;
 	unsigned int h = helper_hash(&me->tuple);
 
-	BUG_ON(me->expect_policy == NULL);
+	BUG_ON(me->expect_policy == NULL &&
+	       !(me->flags & NF_CT_HELPER_F_USERSPACE));
 	BUG_ON(me->expect_class_max >= NF_CT_MAX_EXPECT_CLASSES);
 	BUG_ON(strlen(me->name) > NF_CT_HELPER_NAME_LEN - 1);
 
 	mutex_lock(&nf_ct_helper_mutex);
+	hlist_for_each_entry(cur, n, &nf_ct_helper_hash[h], hnode) {
+		if (strncmp(cur->name, me->name, NF_CT_HELPER_NAME_LEN) == 0 &&
+		    cur->tuple.src.l3num == me->tuple.src.l3num &&
+		    cur->tuple.dst.protonum == me->tuple.dst.protonum) {
+			ret = -EEXIST;
+			goto out;
+		}
+	}
 	hlist_add_head_rcu(&me->hnode, &nf_ct_helper_hash[h]);
 	nf_ct_helper_count++;
+out:
 	mutex_unlock(&nf_ct_helper_mutex);
-
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_helper_register);
 
diff --git a/net/netfilter/nfnetlink_cthelper.c b/net/netfilter/nfnetlink_cthelper.c
new file mode 100644
index 0000000..8c80a34
--- /dev/null
+++ b/net/netfilter/nfnetlink_cthelper.c
@@ -0,0 +1,668 @@
+/*
+ * (C) 2012 Pablo Neira Ayuso <pablo@netfilter.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation (or any later at your option).
+ *
+ * This software has been sponsored by Vyatta Inc. <http://www.vyatta.com>
+ */
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/netlink.h>
+#include <linux/rculist.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/errno.h>
+#include <net/netlink.h>
+#include <net/sock.h>
+
+#include <net/netfilter/nf_conntrack_helper.h>
+#include <net/netfilter/nf_conntrack_expect.h>
+#include <net/netfilter/nf_conntrack_ecache.h>
+
+#include <linux/netfilter/nfnetlink.h>
+#include <linux/netfilter/nfnetlink_conntrack.h>
+#include <linux/netfilter/nfnetlink_cthelper.h>
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso <pablo@netfilter.org>");
+MODULE_DESCRIPTION("nfnl_cthelper: User-space connection tracking helpers");
+
+static int
+nfnl_userspace_cthelper(struct sk_buff *skb, unsigned int protoff,
+			struct nf_conn *ct, enum ip_conntrack_info ctinfo)
+{
+	const struct nf_conn_help *help;
+	struct nf_conntrack_helper *helper;
+
+	help = nfct_help(ct);
+	if (help == NULL)
+		return NF_DROP;
+
+	/* rcu_read_lock()ed by nf_hook_slow */
+	helper = rcu_dereference(help->helper);
+	if (helper == NULL)
+		return NF_DROP;
+
+	/* If the user-space helper is not available, don't block traffic. */
+	return NF_QUEUE_NR(helper->queue_num) | NF_VERDICT_FLAG_QUEUE_BYPASS;
+}
+
+static const struct nla_policy nfnl_cthelper_tuple_pol[NFCTH_TUPLE_MAX+1] = {
+	[NFCTH_TUPLE_L3PROTONUM] = { .type = NLA_U16, },
+	[NFCTH_TUPLE_L4PROTONUM] = { .type = NLA_U8, },
+};
+
+static int
+nfnl_cthelper_parse_tuple(struct nf_conntrack_tuple *tuple,
+			  const struct nlattr *attr)
+{
+	struct nlattr *tb[NFCTH_TUPLE_MAX+1];
+
+	nla_parse_nested(tb, NFCTH_TUPLE_MAX, attr, nfnl_cthelper_tuple_pol);
+
+	if (!tb[NFCTH_TUPLE_L3PROTONUM] || !tb[NFCTH_TUPLE_L4PROTONUM])
+		return -EINVAL;
+
+	tuple->src.l3num = ntohs(nla_get_u16(tb[NFCTH_TUPLE_L3PROTONUM]));
+	tuple->dst.protonum = nla_get_u8(tb[NFCTH_TUPLE_L4PROTONUM]);
+
+	return 0;
+}
+
+static int
+nfnl_cthelper_from_nlattr(struct nlattr *attr, struct nf_conn *ct)
+{
+	const struct nf_conn_help *help = nfct_help(ct);
+
+	if (help->helper->data_len == 0)
+		return -EINVAL;
+
+	memcpy(&help->data, nla_data(attr), help->helper->data_len);
+	return 0;
+}
+
+static int
+nfnl_cthelper_to_nlattr(struct sk_buff *skb, const struct nf_conn *ct)
+{
+	const struct nf_conn_help *help = nfct_help(ct);
+
+	if (help->helper->data_len &&
+	    nla_put(skb, CTA_HELP_INFO, help->helper->data_len, &help->data))
+		goto nla_put_failure;
+
+	return 0;
+
+nla_put_failure:
+	return -ENOSPC;
+}
+
+static const struct nla_policy nfnl_cthelper_expect_pol[NFCTH_POLICY_MAX+1] = {
+	[NFCTH_POLICY_NAME] = { .type = NLA_NUL_STRING,
+				.len = NF_CT_HELPER_NAME_LEN-1 },
+	[NFCTH_POLICY_EXPECT_MAX] = { .type = NLA_U32, },
+	[NFCTH_POLICY_EXPECT_TIMEOUT] = { .type = NLA_U32, },
+};
+
+static int
+nfnl_cthelper_expect_policy(struct nf_conntrack_expect_policy *expect_policy,
+			    const struct nlattr *attr)
+{
+	struct nlattr *tb[NFCTH_POLICY_MAX+1];
+
+	nla_parse_nested(tb, NFCTH_POLICY_MAX, attr, nfnl_cthelper_expect_pol);
+
+	if (!tb[NFCTH_POLICY_NAME] ||
+	    !tb[NFCTH_POLICY_EXPECT_MAX] ||
+	    !tb[NFCTH_POLICY_EXPECT_TIMEOUT])
+		return -EINVAL;
+
+	strncpy(expect_policy->name,
+		nla_data(tb[NFCTH_POLICY_NAME]), NF_CT_HELPER_NAME_LEN);
+	expect_policy->max_expected =
+		ntohl(nla_get_be32(tb[NFCTH_POLICY_EXPECT_MAX]));
+	expect_policy->timeout =
+		ntohl(nla_get_be32(tb[NFCTH_POLICY_EXPECT_TIMEOUT]));
+
+	return 0;
+}
+
+static const struct nla_policy
+nfnl_cthelper_expect_policy_set[NFCTH_POLICY_SET_MAX+1] = {
+	[NFCTH_POLICY_SET_NUM] = { .type = NLA_U32, },
+};
+
+static int
+nfnl_cthelper_parse_expect_policy(struct nf_conntrack_helper *helper,
+				  const struct nlattr *attr)
+{
+	int i, ret;
+	struct nf_conntrack_expect_policy *expect_policy;
+	struct nlattr *tb[NFCTH_POLICY_SET_MAX+1];
+
+	nla_parse_nested(tb, NFCTH_POLICY_SET_MAX, attr,
+					nfnl_cthelper_expect_policy_set);
+
+	if (!tb[NFCTH_POLICY_SET_NUM])
+		return -EINVAL;
+
+	helper->expect_class_max =
+		ntohl(nla_get_be32(tb[NFCTH_POLICY_SET_NUM]));
+
+	if (helper->expect_class_max != 0 &&
+	    helper->expect_class_max > NF_CT_MAX_EXPECT_CLASSES)
+		return -EOVERFLOW;
+
+	expect_policy = kzalloc(sizeof(struct nf_conntrack_expect_policy) *
+				helper->expect_class_max, GFP_KERNEL);
+	if (expect_policy == NULL)
+		return -ENOMEM;
+
+	for (i=0; i<helper->expect_class_max; i++) {
+		if (!tb[NFCTH_POLICY_SET+i])
+			goto err;
+
+		ret = nfnl_cthelper_expect_policy(&expect_policy[i],
+						  tb[NFCTH_POLICY_SET+i]);
+		if (ret < 0)
+			goto err;
+	}
+	helper->expect_policy = expect_policy;
+	return 0;
+err:
+	kfree(expect_policy);
+	return -EINVAL;
+}
+
+static int
+nfnl_cthelper_create(const struct nlattr * const tb[],
+		     struct nf_conntrack_tuple *tuple)
+{
+	struct nf_conntrack_helper *helper;
+	int ret;
+
+	if (!tb[NFCTH_TUPLE] || !tb[NFCTH_POLICY] || !tb[NFCTH_PRIV_DATA_LEN])
+		return -EINVAL;
+
+	helper = kzalloc(sizeof(struct nf_conntrack_helper), GFP_KERNEL);
+	if (helper == NULL)
+		return -ENOMEM;
+
+	ret = nfnl_cthelper_parse_expect_policy(helper, tb[NFCTH_POLICY]);
+	if (ret < 0)
+		goto err;
+
+	strncpy(helper->name, nla_data(tb[NFCTH_NAME]), NF_CT_HELPER_NAME_LEN);
+	helper->data_len = ntohl(nla_get_be32(tb[NFCTH_PRIV_DATA_LEN]));
+	helper->flags |= NF_CT_HELPER_F_USERSPACE;
+	memcpy(&helper->tuple, tuple, sizeof(struct nf_conntrack_tuple));
+
+	helper->me = THIS_MODULE;
+	helper->help = nfnl_userspace_cthelper;
+	helper->from_nlattr = nfnl_cthelper_from_nlattr;
+	helper->to_nlattr = nfnl_cthelper_to_nlattr;
+
+	/* Default to queue number zero, this can be updated at any time. */
+	if (tb[NFCTH_QUEUE_NUM])
+		helper->queue_num = ntohl(nla_get_be32(tb[NFCTH_QUEUE_NUM]));
+
+	if (tb[NFCTH_STATUS]) {
+		int status = ntohl(nla_get_be32(tb[NFCTH_STATUS]));
+
+		switch(status) {
+		case NFCT_HELPER_STATUS_ENABLED:
+			helper->flags |= NF_CT_HELPER_F_CONFIGURED;
+			break;
+		case NFCT_HELPER_STATUS_DISABLED:
+			helper->flags &= ~NF_CT_HELPER_F_CONFIGURED;
+			break;
+		}
+	}
+
+	ret = nf_conntrack_helper_register(helper);
+	if (ret < 0)
+		goto err;
+
+	return 0;
+err:
+	kfree(helper);
+	return ret;
+}
+
+static int
+nfnl_cthelper_update(const struct nlattr * const tb[],
+		     struct nf_conntrack_helper *helper)
+{
+	int ret;
+
+	if (tb[NFCTH_PRIV_DATA_LEN])
+		return -EBUSY;
+
+	if (tb[NFCTH_POLICY]) {
+		ret = nfnl_cthelper_parse_expect_policy(helper,
+							tb[NFCTH_POLICY]);
+		if (ret < 0)
+			return ret;
+	}
+	if (tb[NFCTH_QUEUE_NUM])
+		helper->queue_num = ntohl(nla_get_be32(tb[NFCTH_QUEUE_NUM]));
+
+	if (tb[NFCTH_STATUS]) {
+		int status = ntohl(nla_get_be32(tb[NFCTH_STATUS]));
+
+		switch(status) {
+		case NFCT_HELPER_STATUS_ENABLED:
+			helper->flags |= NF_CT_HELPER_F_CONFIGURED;
+			break;
+		case NFCT_HELPER_STATUS_DISABLED:
+			helper->flags &= ~NF_CT_HELPER_F_CONFIGURED;
+			break;
+		}
+	}
+	return 0;
+}
+
+static int
+nfnl_cthelper_new(struct sock *nfnl, struct sk_buff *skb,
+		  const struct nlmsghdr *nlh, const struct nlattr * const tb[])
+{
+	const char *helper_name;
+	struct nf_conntrack_helper *cur, *helper = NULL;
+	struct nf_conntrack_tuple tuple;
+	struct hlist_node *n;
+	int ret = 0, i;
+
+	if (!tb[NFCTH_NAME] || !tb[NFCTH_TUPLE])
+		return -EINVAL;
+
+	helper_name = nla_data(tb[NFCTH_NAME]);
+
+	ret = nfnl_cthelper_parse_tuple(&tuple, tb[NFCTH_TUPLE]);
+	if (ret < 0)
+		return ret;
+
+	rcu_read_lock();
+	for (i = 0; i < nf_ct_helper_hsize && !helper; i++) {
+		hlist_for_each_entry_rcu(cur, n, &nf_ct_helper_hash[i], hnode) {
+
+			/* skip non-userspace conntrack helpers. */
+			if (!(cur->flags & NF_CT_HELPER_F_USERSPACE))
+				continue;
+
+			if (strncmp(cur->name, helper_name,
+					NF_CT_HELPER_NAME_LEN) != 0)
+				continue;
+
+			if ((tuple.src.l3num != cur->tuple.src.l3num ||
+			     tuple.dst.protonum != cur->tuple.dst.protonum))
+				continue;
+
+			if (nlh->nlmsg_flags & NLM_F_EXCL) {
+				ret = -EEXIST;
+				goto err;
+			}
+			helper = cur;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	if (helper == NULL)
+		ret = nfnl_cthelper_create(tb, &tuple);
+	else
+		ret = nfnl_cthelper_update(tb, helper);
+
+	return ret;
+err:
+	rcu_read_unlock();
+	return ret;
+}
+
+static int
+nfnl_cthelper_dump_tuple(struct sk_buff *skb,
+			 struct nf_conntrack_helper *helper)
+{
+	struct nlattr *nest_parms;
+
+	nest_parms = nla_nest_start(skb, NFCTH_TUPLE | NLA_F_NESTED);
+	if (nest_parms == NULL)
+		goto nla_put_failure;
+
+	if (nla_put_u16(skb, NFCTH_TUPLE_L3PROTONUM,
+			htons(helper->tuple.src.l3num)))
+		goto nla_put_failure;
+
+	if (nla_put_u8(skb, NFCTH_TUPLE_L4PROTONUM, helper->tuple.dst.protonum))
+		goto nla_put_failure;
+
+	nla_nest_end(skb, nest_parms);
+	return 0;
+
+nla_put_failure:
+	return -1;
+}
+
+static int
+nfnl_cthelper_dump_policy(struct sk_buff *skb,
+			struct nf_conntrack_helper *helper)
+{
+	int i;
+	struct nlattr *nest_parms1, *nest_parms2;
+
+	nest_parms1 = nla_nest_start(skb, NFCTH_POLICY | NLA_F_NESTED);
+	if (nest_parms1 == NULL)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, NFCTH_POLICY_SET_NUM,
+			htonl(helper->expect_class_max)))
+		goto nla_put_failure;
+
+	for (i=0; i<helper->expect_class_max; i++) {
+		nest_parms2 = nla_nest_start(skb,
+				(NFCTH_POLICY_SET+i) | NLA_F_NESTED);
+		if (nest_parms2 == NULL)
+			goto nla_put_failure;
+
+		if (nla_put_string(skb, NFCTH_POLICY_NAME,
+				   helper->expect_policy[i].name))
+			goto nla_put_failure;
+
+		if (nla_put_u32(skb, NFCTH_POLICY_EXPECT_MAX,
+				htonl(helper->expect_policy[i].max_expected)))
+			goto nla_put_failure;
+
+		if (nla_put_u32(skb, NFCTH_POLICY_EXPECT_TIMEOUT,
+				htonl(helper->expect_policy[i].timeout)))
+			goto nla_put_failure;
+
+		nla_nest_end(skb, nest_parms2);
+	}
+	nla_nest_end(skb, nest_parms1);
+	return 0;
+
+nla_put_failure:
+	return -1;
+}
+
+static int
+nfnl_cthelper_fill_info(struct sk_buff *skb, u32 pid, u32 seq, u32 type,
+			int event, struct nf_conntrack_helper *helper)
+{
+	struct nlmsghdr *nlh;
+	struct nfgenmsg *nfmsg;
+	unsigned int flags = pid ? NLM_F_MULTI : 0;
+	int status;
+
+	event |= NFNL_SUBSYS_CTHELPER << 8;
+	nlh = nlmsg_put(skb, pid, seq, event, sizeof(*nfmsg), flags);
+	if (nlh == NULL)
+		goto nlmsg_failure;
+
+	nfmsg = nlmsg_data(nlh);
+	nfmsg->nfgen_family = AF_UNSPEC;
+	nfmsg->version = NFNETLINK_V0;
+	nfmsg->res_id = 0;
+
+	if (nla_put_string(skb, NFCTH_NAME, helper->name))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, NFCTH_QUEUE_NUM, htonl(helper->queue_num)))
+		goto nla_put_failure;
+
+	if (nfnl_cthelper_dump_tuple(skb, helper) < 0)
+		goto nla_put_failure;
+
+	if (nfnl_cthelper_dump_policy(skb, helper) < 0)
+		goto nla_put_failure;
+
+	if (nla_put_be32(skb, NFCTH_PRIV_DATA_LEN, htonl(helper->data_len)))
+		goto nla_put_failure;
+
+	if (helper->flags & NF_CT_HELPER_F_CONFIGURED)
+		status = NFCT_HELPER_STATUS_ENABLED;
+	else
+		status = NFCT_HELPER_STATUS_DISABLED;
+
+	if (nla_put_be32(skb, NFCTH_STATUS, htonl(status)))
+		goto nla_put_failure;
+
+	nlmsg_end(skb, nlh);
+	return skb->len;
+
+nlmsg_failure:
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -1;
+}
+
+static int
+nfnl_cthelper_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct nf_conntrack_helper *cur, *last;
+	struct hlist_node *n;
+
+	rcu_read_lock();
+	last = (struct nf_conntrack_helper *)cb->args[1];
+	for (; cb->args[0] < nf_ct_helper_hsize; cb->args[0]++) {
+restart:
+		hlist_for_each_entry_rcu(cur, n,
+				&nf_ct_helper_hash[cb->args[0]], hnode) {
+
+			/* skip non-userspace conntrack helpers. */
+			if (!(cur->flags & NF_CT_HELPER_F_USERSPACE))
+				continue;
+
+			if (cb->args[1]) {
+				if (cur != last)
+					continue;
+				cb->args[1] = 0;
+			}
+			if (nfnl_cthelper_fill_info(skb,
+					    NETLINK_CB(cb->skb).pid,
+					    cb->nlh->nlmsg_seq,
+					    NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
+					    NFNL_MSG_CTHELPER_NEW, cur) < 0) {
+				cb->args[1] = (unsigned long)cur;
+				goto out;
+			}
+		}
+	}
+	if (cb->args[1]) {
+		cb->args[1] = 0;
+		goto restart;
+	}
+out:
+	rcu_read_unlock();
+	return skb->len;
+}
+
+static int
+nfnl_cthelper_get(struct sock *nfnl, struct sk_buff *skb,
+		  const struct nlmsghdr *nlh, const struct nlattr * const tb[])
+{
+	int ret = -ENOENT, i;
+	struct nf_conntrack_helper *cur;
+	struct hlist_node *n;
+	struct sk_buff *skb2;
+	char *helper_name = NULL;
+	struct nf_conntrack_tuple tuple;
+	bool tuple_set = false;
+
+	if (nlh->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = nfnl_cthelper_dump_table,
+		};
+		return netlink_dump_start(nfnl, skb, nlh, &c);
+	}
+
+	if (tb[NFCTH_NAME])
+		helper_name = nla_data(tb[NFCTH_NAME]);
+
+	if (tb[NFCTH_TUPLE]) {
+		ret = nfnl_cthelper_parse_tuple(&tuple, tb[NFCTH_TUPLE]);
+		if (ret < 0)
+			return ret;
+
+		tuple_set = true;
+	}
+
+	for (i = 0; i < nf_ct_helper_hsize; i++) {
+		hlist_for_each_entry_rcu(cur, n, &nf_ct_helper_hash[i], hnode) {
+
+			/* skip non-userspace conntrack helpers. */
+			if (!(cur->flags & NF_CT_HELPER_F_USERSPACE))
+				continue;
+
+			if (helper_name && strncmp(cur->name, helper_name,
+						NF_CT_HELPER_NAME_LEN) != 0) {
+				continue;
+			}
+			if (tuple_set &&
+			    (tuple.src.l3num != cur->tuple.src.l3num ||
+			     tuple.dst.protonum != cur->tuple.dst.protonum))
+				continue;
+
+			skb2 = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+			if (skb2 == NULL) {
+				ret = -ENOMEM;
+				break;
+			}
+
+			ret = nfnl_cthelper_fill_info(skb2, NETLINK_CB(skb).pid,
+						nlh->nlmsg_seq,
+						NFNL_MSG_TYPE(nlh->nlmsg_type),
+						NFNL_MSG_CTHELPER_NEW, cur);
+			if (ret <= 0) {
+				kfree_skb(skb2);
+				break;
+			}
+
+			ret = netlink_unicast(nfnl, skb2, NETLINK_CB(skb).pid,
+						MSG_DONTWAIT);
+			if (ret > 0)
+				ret = 0;
+
+			/* this avoids a loop in nfnetlink. */
+			return ret == -EAGAIN ? -ENOBUFS : ret;
+		}
+	}
+	return ret;
+}
+
+static int
+nfnl_cthelper_del(struct sock *nfnl, struct sk_buff *skb,
+	     const struct nlmsghdr *nlh, const struct nlattr * const tb[])
+{
+	char *helper_name = NULL;
+	struct nf_conntrack_helper *cur;
+	struct hlist_node *n, *tmp;
+	struct nf_conntrack_tuple tuple;
+	bool tuple_set = false, found = false;
+	int i, j = 0, ret;
+
+	if (tb[NFCTH_NAME])
+		helper_name = nla_data(tb[NFCTH_NAME]);
+
+	if (tb[NFCTH_TUPLE]) {
+		ret = nfnl_cthelper_parse_tuple(&tuple, tb[NFCTH_TUPLE]);
+		if (ret < 0)
+			return ret;
+
+		tuple_set = true;
+	}
+
+	for (i = 0; i < nf_ct_helper_hsize; i++) {
+		hlist_for_each_entry_safe(cur, n, tmp, &nf_ct_helper_hash[i],
+								hnode) {
+			/* skip non-userspace conntrack helpers. */
+			if (!(cur->flags & NF_CT_HELPER_F_USERSPACE))
+				continue;
+
+			j++;
+
+			if (helper_name && strncmp(cur->name, helper_name,
+						NF_CT_HELPER_NAME_LEN) != 0) {
+				continue;
+			}
+			if (tuple_set &&
+			    (tuple.src.l3num != cur->tuple.src.l3num ||
+			     tuple.dst.protonum != cur->tuple.dst.protonum))
+				continue;
+
+			found = true;
+			nf_conntrack_helper_unregister(cur);
+		}
+	}
+	/* Make sure we return success if we flush and there is no helpers */
+	return (found || j == 0) ? 0 : -ENOENT;
+}
+
+static const struct nla_policy nfnl_cthelper_policy[NFCTH_MAX+1] = {
+	[NFCTH_NAME] = { .type = NLA_NUL_STRING,
+			 .len = NF_CT_HELPER_NAME_LEN-1 },
+	[NFCTH_QUEUE_NUM] = { .type = NLA_U32, },
+};
+
+static const struct nfnl_callback nfnl_cthelper_cb[NFNL_MSG_CTHELPER_MAX] = {
+	[NFNL_MSG_CTHELPER_NEW]		= { .call = nfnl_cthelper_new,
+					    .attr_count = NFCTH_MAX,
+					    .policy = nfnl_cthelper_policy },
+	[NFNL_MSG_CTHELPER_GET]		= { .call = nfnl_cthelper_get,
+					    .attr_count = NFCTH_MAX,
+					    .policy = nfnl_cthelper_policy },
+	[NFNL_MSG_CTHELPER_DEL]		= { .call = nfnl_cthelper_del,
+					    .attr_count = NFCTH_MAX,
+					    .policy = nfnl_cthelper_policy },
+};
+
+static const struct nfnetlink_subsystem nfnl_cthelper_subsys = {
+	.name				= "cthelper",
+	.subsys_id			= NFNL_SUBSYS_CTHELPER,
+	.cb_count			= NFNL_MSG_CTHELPER_MAX,
+	.cb				= nfnl_cthelper_cb,
+};
+
+MODULE_ALIAS_NFNL_SUBSYS(NFNL_SUBSYS_CTHELPER);
+
+static int __init nfnl_cthelper_init(void)
+{
+	int ret;
+
+	pr_info("nfnl_cthelper: registering with nfnetlink.\n");
+	ret = nfnetlink_subsys_register(&nfnl_cthelper_subsys);
+	if (ret < 0) {
+		pr_err("nfnl_cthelper: cannot register with nfnetlink.\n");
+		goto err_out;
+	}
+	return 0;
+err_out:
+	return ret;
+}
+
+static void __exit nfnl_cthelper_exit(void)
+{
+	struct nf_conntrack_helper *cur;
+	struct hlist_node *n, *tmp;
+	int i;
+
+	pr_info("nfnl_cthelper: unregistering from nfnetlink.\n");
+	nfnetlink_subsys_unregister(&nfnl_cthelper_subsys);
+
+	for (i=0; i<nf_ct_helper_hsize; i++) {
+		hlist_for_each_entry_safe(cur, n, tmp, &nf_ct_helper_hash[i],
+									hnode) {
+			/* skip non-userspace conntrack helpers. */
+			if (!(cur->flags & NF_CT_HELPER_F_USERSPACE))
+				continue;
+
+			nf_conntrack_helper_unregister(cur);
+		}
+	}
+}
+
+module_init(nfnl_cthelper_init);
+module_exit(nfnl_cthelper_exit);
-- 
1.7.10


^ permalink raw reply related

* [PATCH 0/7] [RFC] new user-space connection tracking helper infrastructure
From: pablo @ 2012-06-04 12:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

From: Pablo Neira Ayuso <pablo@netfilter.org>

Hi!

This is a new try to provide a full user-space connection tracking helper
infrastructure. Some of you, that check my tree, already know that I've been
working on this since time ago.

Previous approaches had important limitations and the integration with iptables
was not precisely nice.

The initial patches prepare the field for the introduction of the
cthelper infrastructure:

1) allocate fixed area for helper name, as a side effect, the initialization
   code of the kernel-space helpers looks better IMO.

2) allow variable length conntrack extensions.

3) add support for variable length helper extensions.

4) improve integration between nfnetlink_queue and ctnetlink. Now, you don't
   have to open two handlers listen to packets via nfqueue and receive
   events via ctnetlink. Instead, you can enable one flag to get the conntrack
   data together with the packet via nfqueue.

5) improve integration of packet mangling and nf_conntrack. This has been
   a long standing issue. If you mangle one TCP packet in user-space and
   connection tracking is enabled, nf_ct_tcp reports sequence tracking errors.
   This patch aims to resolve this issue.

6) Add CTA_HELP_INFO attribute. This is used to store the private helper
   data. Thus, we don't need to keep a redundant cache of conntrack entries
   in user-space. The private helper information is stored.

7) finally, the netlink cthelper infrastructure.

Of course, this patch makes no sense without the user-space changes, they are:

* updates in the conntrack-tools (see cthelper11 branch):
http://git.netfilter.org/cgi-bin/gitweb.cgi?p=conntrack-tools.git;a=shortlog;h=refs/heads/cthelper11

It includes the FTP user-space helper, one RPC helper (for NFSv3) and one TNS
helper (for Oracle).

* libnetfilter_cthelper
http://git.netfilter.org/cgi-bin/gitweb.cgi?p=libnetfilter_cthelper.git;a=summary

* libnetfilter_conntrack (new libmnl API)
http://git.netfilter.org/cgi-bin/gitweb.cgi?p=libnetfilter_conntrack.git;a=summary

* libnetfilter_queue
http://git.netfilter.org/cgi-bin/gitweb.cgi?p=libnetfilter_queue.git;a=shortlog;h=refs/heads/cthelper2

WARNING: Changes may occur in the user-space side until all those cthelper
branches are merged into master. Mind that this is work-in-progress.

Pablo Neira Ayuso (7):
  netfilter: nf_ct_helper: allocate 16 bytes for the helper and policy names
  netfilter: nf_ct_ext: support variable length extensions
  netfilter: nf_ct_helper: implement variable length helper private data
  netfilter: add glue code to integrate nfnetlink_queue and ctnetlink
  netfilter: nfnl_queue: support NAT TCP sequence adjustment if packet mangled
  netfilter: ctnetlink: add CTA_HELP_INFO attribute
  netfilter: add user-space connection tracking helper infrastructure

 include/linux/netfilter.h                      |   10 +
 include/linux/netfilter/Kbuild                 |    1 +
 include/linux/netfilter/nf_conntrack_sip.h     |    1 +
 include/linux/netfilter/nfnetlink.h            |    3 +-
 include/linux/netfilter/nfnetlink_conntrack.h  |    1 +
 include/linux/netfilter/nfnetlink_cthelper.h   |   55 ++
 include/linux/netfilter/nfnetlink_queue.h      |    7 +
 include/linux/netfilter_ipv4.h                 |    1 +
 include/linux/netfilter_ipv6.h                 |    1 +
 include/net/netfilter/nf_conntrack.h           |   35 +-
 include/net/netfilter/nf_conntrack_expect.h    |    4 +-
 include/net/netfilter/nf_conntrack_extend.h    |    7 +-
 include/net/netfilter/nf_conntrack_helper.h    |   29 +-
 include/net/netfilter/nf_nat_helper.h          |    7 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |   56 +-
 net/ipv4/netfilter/nf_nat_amanda.c             |    4 +-
 net/ipv4/netfilter/nf_nat_h323.c               |    8 +-
 net/ipv4/netfilter/nf_nat_helper.c             |   13 +
 net/ipv4/netfilter/nf_nat_pptp.c               |    6 +-
 net/ipv4/netfilter/nf_nat_sip.c                |   14 +-
 net/ipv4/netfilter/nf_nat_tftp.c               |    4 +-
 net/ipv6/netfilter/nf_conntrack_l3proto_ipv6.c |   56 +-
 net/netfilter/Kconfig                          |    8 +
 net/netfilter/Makefile                         |    1 +
 net/netfilter/core.c                           |    4 +
 net/netfilter/nf_conntrack_core.c              |    3 +-
 net/netfilter/nf_conntrack_extend.c            |   16 +-
 net/netfilter/nf_conntrack_ftp.c               |   11 +-
 net/netfilter/nf_conntrack_h323_main.c         |   16 +-
 net/netfilter/nf_conntrack_helper.c            |   35 +-
 net/netfilter/nf_conntrack_irc.c               |    8 +-
 net/netfilter/nf_conntrack_netlink.c           |  190 ++++++-
 net/netfilter/nf_conntrack_pptp.c              |   17 +-
 net/netfilter/nf_conntrack_proto_gre.c         |   16 +-
 net/netfilter/nf_conntrack_sane.c              |   12 +-
 net/netfilter/nf_conntrack_sip.c               |   36 +-
 net/netfilter/nf_conntrack_tftp.c              |    8 +-
 net/netfilter/nfnetlink_cthelper.c             |  668 ++++++++++++++++++++++++
 net/netfilter/nfnetlink_queue.c                |   84 ++-
 net/netfilter/xt_CT.c                          |   44 +-
 40 files changed, 1309 insertions(+), 191 deletions(-)
 create mode 100644 include/linux/netfilter/nfnetlink_cthelper.h
 create mode 100644 net/netfilter/nfnetlink_cthelper.c

-- 
1.7.10

^ permalink raw reply

* [PATCH 4/7] netfilter: add glue code to integrate nfnetlink_queue and ctnetlink
From: pablo @ 2012-06-04 12:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev
In-Reply-To: <1338812485-4232-1-git-send-email-pablo@netfilter.org>

From: Pablo Neira Ayuso <pablo@netfilter.org>

This patch allows you to include the conntrack information together
with the packet that is sent to user-space via NFQUEUE.

Previously, there was no integration between ctnetlink and
nfnetlink_queue. If you wanted to access conntrack information
from your libnetfilter_queue program, you required to query
ctnetlink from user-space to obtain it. Thus, delaying the packet
processing even more.

Including the conntrack information is optional, you can set it
via NFQNL_F_CONNTRACK flag with the new NFQ_CFG_FLAGS attribute.

This change provides the required features to use nfnetlink_queue
as the user-space queueing infrastructure for the follow-up patch
that introduces user-space conntrack helpers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter.h                 |   10 ++
 include/linux/netfilter/nfnetlink_queue.h |    7 ++
 net/netfilter/core.c                      |    4 +
 net/netfilter/nf_conntrack_netlink.c      |  158 ++++++++++++++++++++++++++++-
 net/netfilter/nfnetlink_queue.c           |   62 +++++++++++
 5 files changed, 240 insertions(+), 1 deletion(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index ff9c84c..a08dcb6 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -383,6 +383,16 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi *fl, u_int8_t family)
 extern void (*ip_ct_attach)(struct sk_buff *, struct sk_buff *) __rcu;
 extern void nf_ct_attach(struct sk_buff *, struct sk_buff *);
 extern void (*nf_ct_destroy)(struct nf_conntrack *) __rcu;
+
+struct nf_conn;
+struct nlattr;
+
+struct nfq_ct_hook {
+	size_t (*build_size)(const struct nf_conn *ct);
+	int (*build)(struct sk_buff *skb, struct nf_conn *ct);
+	int (*parse)(const struct nlattr *attr, struct nf_conn *ct);
+};
+extern struct nfq_ct_hook *nfq_ct_hook;
 #else
 static inline void nf_ct_attach(struct sk_buff *new, struct sk_buff *skb) {}
 #endif
diff --git a/include/linux/netfilter/nfnetlink_queue.h b/include/linux/netfilter/nfnetlink_queue.h
index 24b32e6..da44b33 100644
--- a/include/linux/netfilter/nfnetlink_queue.h
+++ b/include/linux/netfilter/nfnetlink_queue.h
@@ -42,6 +42,8 @@ enum nfqnl_attr_type {
 	NFQA_IFINDEX_PHYSOUTDEV,	/* __u32 ifindex */
 	NFQA_HWADDR,			/* nfqnl_msg_packet_hw */
 	NFQA_PAYLOAD,			/* opaque data payload */
+	NFQA_CT,			/* nf_conntrack_netlink.h */
+	NFQA_CT_INFO,			/* enum ip_conntrack_info */
 
 	__NFQA_MAX
 };
@@ -78,12 +80,17 @@ struct nfqnl_msg_config_params {
 	__u8	copy_mode;	/* enum nfqnl_config_mode */
 } __attribute__ ((packed));
 
+enum nfqnl_flags {
+	NFQNL_F_NONE		= 0,
+	NFQNL_F_CONNTRACK	= (1 << 0),
+};
 
 enum nfqnl_attr_config {
 	NFQA_CFG_UNSPEC,
 	NFQA_CFG_CMD,			/* nfqnl_msg_config_cmd */
 	NFQA_CFG_PARAMS,		/* nfqnl_msg_config_params */
 	NFQA_CFG_QUEUE_MAXLEN,		/* __u32 */
+	NFQA_CFG_FLAGS,			/* __u32 */
 	__NFQA_CFG_MAX
 };
 #define NFQA_CFG_MAX (__NFQA_CFG_MAX-1)
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e19f365..7eef845 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -264,6 +264,10 @@ void nf_conntrack_destroy(struct nf_conntrack *nfct)
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL(nf_conntrack_destroy);
+
+struct nfq_ct_hook *nfq_ct_hook;
+EXPORT_SYMBOL_GPL(nfq_ct_hook);
+
 #endif /* CONFIG_NF_CONNTRACK */
 
 #ifdef CONFIG_PROC_FS
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 30f5e12..28ac04c 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1620,6 +1620,152 @@ ctnetlink_new_conntrack(struct sock *ctnl, struct sk_buff *skb,
 	return err;
 }
 
+#if defined(CONFIG_NETFILTER_NETLINK_QUEUE) ||	\
+    defined(CONFIG_NETFILTER_NETLINK_QUEUE_MODULE)
+static size_t
+ctnetlink_nfqueue_build_size(const struct nf_conn *ct)
+{
+	return 3 * nla_total_size(0) /* CTA_TUPLE_ORIG|REPL|MASTER */
+	       + 3 * nla_total_size(0) /* CTA_TUPLE_IP */
+	       + 3 * nla_total_size(0) /* CTA_TUPLE_PROTO */
+	       + 3 * nla_total_size(sizeof(u_int8_t)) /* CTA_PROTO_NUM */
+	       + nla_total_size(sizeof(u_int32_t)) /* CTA_ID */
+	       + nla_total_size(sizeof(u_int32_t)) /* CTA_STATUS */
+	       + nla_total_size(sizeof(u_int32_t)) /* CTA_TIMEOUT */
+	       + nla_total_size(0) /* CTA_PROTOINFO */
+	       + nla_total_size(0) /* CTA_HELP */
+	       + nla_total_size(NF_CT_HELPER_NAME_LEN) /* CTA_HELP_NAME */
+	       + ctnetlink_secctx_size(ct)
+#ifdef CONFIG_NF_NAT_NEEDED
+	       + 2 * nla_total_size(0) /* CTA_NAT_SEQ_ADJ_ORIG|REPL */
+	       + 6 * nla_total_size(sizeof(u_int32_t)) /* CTA_NAT_SEQ_OFFSET */
+#endif
+#ifdef CONFIG_NF_CONNTRACK_MARK
+	       + nla_total_size(sizeof(u_int32_t)) /* CTA_MARK */
+#endif
+	       + ctnetlink_proto_size(ct)
+	       ;
+}
+
+static int
+ctnetlink_nfqueue_build(struct sk_buff *skb, struct nf_conn *ct)
+{
+	struct nlattr *nest_parms;
+
+	rcu_read_lock();
+	nest_parms = nla_nest_start(skb, CTA_TUPLE_ORIG | NLA_F_NESTED);
+	if (!nest_parms)
+		goto nla_put_failure;
+	if (ctnetlink_dump_tuples(skb, nf_ct_tuple(ct, IP_CT_DIR_ORIGINAL)) < 0)
+		goto nla_put_failure;
+	nla_nest_end(skb, nest_parms);
+
+	nest_parms = nla_nest_start(skb, CTA_TUPLE_REPLY | NLA_F_NESTED);
+	if (!nest_parms)
+		goto nla_put_failure;
+	if (ctnetlink_dump_tuples(skb, nf_ct_tuple(ct, IP_CT_DIR_REPLY)) < 0)
+		goto nla_put_failure;
+	nla_nest_end(skb, nest_parms);
+
+	if (nf_ct_zone(ct)) {
+		if (nla_put_be16(skb, CTA_ZONE, htons(nf_ct_zone(ct))))
+			goto nla_put_failure;
+	}
+
+	if (ctnetlink_dump_id(skb, ct) < 0)
+		goto nla_put_failure;
+
+	if (ctnetlink_dump_status(skb, ct) < 0)
+		goto nla_put_failure;
+
+	if (ctnetlink_dump_timeout(skb, ct) < 0)
+		goto nla_put_failure;
+
+	if (ctnetlink_dump_protoinfo(skb, ct) < 0)
+		goto nla_put_failure;
+
+	if (ctnetlink_dump_helpinfo(skb, ct) < 0)
+		goto nla_put_failure;
+
+#ifdef CONFIG_NF_CONNTRACK_SECMARK
+	if (ct->secmark && ctnetlink_dump_secctx(skb, ct) < 0)
+		goto nla_put_failure;
+#endif
+	if (ct->master && ctnetlink_dump_master(skb, ct) < 0)
+		goto nla_put_failure;
+
+	if ((ct->status & IPS_SEQ_ADJUST) &&
+	    ctnetlink_dump_nat_seq_adj(skb, ct) < 0)
+		goto nla_put_failure;
+
+#ifdef CONFIG_NF_CONNTRACK_MARK
+	if (ct->mark && ctnetlink_dump_mark(skb, ct) < 0)
+		goto nla_put_failure;
+#endif
+	rcu_read_unlock();
+	return 0;
+
+nla_put_failure:
+	rcu_read_unlock();
+	return -ENOSPC;
+}
+
+static int
+ctnetlink_nfqueue_parse(const struct nlattr *attr, struct nf_conn *ct)
+{
+	const struct nlattr * const cda[CTA_MAX+1];
+	struct nf_conntrack_tuple otuple, rtuple;
+	u16 u3 = nf_ct_l3num(ct);
+	int err;
+
+	nla_parse_nested((struct nlattr **)cda, CTA_MAX, attr, ct_nla_policy);
+
+	if (cda[CTA_TUPLE_ORIG]) {
+		err = ctnetlink_parse_tuple(cda, &otuple, CTA_TUPLE_ORIG, u3);
+		if (err < 0)
+			return err;
+	}
+	if (cda[CTA_TUPLE_REPLY]) {
+		err = ctnetlink_parse_tuple(cda, &rtuple, CTA_TUPLE_REPLY, u3);
+		if (err < 0)
+			return err;
+	}
+	if (cda[CTA_TIMEOUT]) {
+		err = ctnetlink_change_timeout(ct, cda);
+		if (err < 0)
+			return err;
+	}
+	if (cda[CTA_STATUS]) {
+		err = ctnetlink_change_status(ct, cda);
+		if (err < 0)
+			return err;
+	}
+	if (cda[CTA_PROTOINFO]) {
+		err = ctnetlink_change_protoinfo(ct, cda);
+		if (err < 0)
+			return err;
+	}
+#if defined(CONFIG_NF_CONNTRACK_MARK)
+	if (cda[CTA_MARK])
+		ct->mark = ntohl(nla_get_be32(cda[CTA_MARK]));
+#endif
+#ifdef CONFIG_NF_NAT_NEEDED
+	if (cda[CTA_NAT_SEQ_ADJ_ORIG] || cda[CTA_NAT_SEQ_ADJ_REPLY]) {
+		err = ctnetlink_change_nat_seq_adj(ct, cda);
+		if (err < 0)
+			return err;
+	}
+#endif
+	return 0;
+}
+
+static struct nfq_ct_hook ctnetlink_nfqueue_hook = {
+	.build_size	= ctnetlink_nfqueue_build_size,
+	.build		= ctnetlink_nfqueue_build,
+	.parse		= ctnetlink_nfqueue_parse,
+};
+#endif /* CONFIG_NETFILTER_NETLINK_QUEUE */
+
 /***********************************************************************
  * EXPECT
  ***********************************************************************/
@@ -2424,7 +2570,12 @@ static int __init ctnetlink_init(void)
 		pr_err("ctnetlink_init: cannot register pernet operations\n");
 		goto err_unreg_exp_subsys;
 	}
-
+#if defined(CONFIG_NETFILTER_NETLINK_QUEUE) ||	\
+    defined(CONFIG_NETFILTER_NETLINK_QUEUE_MODULE)
+	/* setup interaction between nf_queue and nf_conntrack_netlink. */
+	RCU_INIT_POINTER(nfq_ct_hook, &ctnetlink_nfqueue_hook);
+	printk("registering nf_queue and ctnetlink interaction\n");
+#endif
 	return 0;
 
 err_unreg_exp_subsys:
@@ -2442,6 +2593,11 @@ static void __exit ctnetlink_exit(void)
 	unregister_pernet_subsys(&ctnetlink_net_ops);
 	nfnetlink_subsys_unregister(&ctnl_exp_subsys);
 	nfnetlink_subsys_unregister(&ctnl_subsys);
+#if defined(CONFIG_NETFILTER_NETLINK_QUEUE) ||	\
+    defined(CONFIG_NETFILTER_NETLINK_QUEUE_MODULE)
+	RCU_INIT_POINTER(nfq_ct_hook, NULL);
+	printk("unregistering nf_queue and ctnetlink interaction\n");
+#endif
 }
 
 module_init(ctnetlink_init);
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 8d6bcf3..b007779 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -30,6 +30,7 @@
 #include <linux/list.h>
 #include <net/sock.h>
 #include <net/netfilter/nf_queue.h>
+#include <net/netfilter/nf_conntrack.h>
 
 #include <linux/atomic.h>
 
@@ -44,6 +45,7 @@ struct nfqnl_instance {
 	struct rcu_head rcu;
 
 	int peer_pid;
+	unsigned int flags;
 	unsigned int queue_maxlen;
 	unsigned int copy_range;
 	unsigned int queue_dropped;
@@ -232,6 +234,9 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
 	struct sk_buff *entskb = entry->skb;
 	struct net_device *indev;
 	struct net_device *outdev;
+	struct nfq_ct_hook *nfq_ct;
+	struct nf_conn *ct = NULL;
+	enum ip_conntrack_info ctinfo = 0; /* make gcc happy. */
 
 	size =    NLMSG_SPACE(sizeof(struct nfgenmsg))
 		+ nla_total_size(sizeof(struct nfqnl_msg_packet_hdr))
@@ -265,6 +270,17 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
 		break;
 	}
 
+	/* rcu_read_lock()ed by __nf_queue already. */
+	nfq_ct = rcu_dereference(nfq_ct_hook);
+	if (nfq_ct != NULL && (queue->flags & NFQNL_F_CONNTRACK)) {
+		ct = nf_ct_get(entskb, &ctinfo);
+		if (ct) {
+			if (!nf_ct_is_untracked(ct))
+				size += nfq_ct->build_size(ct);
+			else
+				ct = NULL;
+		}
+	}
 
 	skb = alloc_skb(size, GFP_ATOMIC);
 	if (!skb)
@@ -388,6 +404,24 @@ nfqnl_build_packet_message(struct nfqnl_instance *queue,
 			BUG();
 	}
 
+	if (nfq_ct != NULL && (queue->flags & NFQNL_F_CONNTRACK) && ct) {
+		struct nlattr *nest_parms;
+		u_int32_t tmp;
+
+		nest_parms = nla_nest_start(skb, NFQA_CT | NLA_F_NESTED);
+		if (!nest_parms)
+			goto nla_put_failure;
+
+		if (nfq_ct->build(skb, ct) < 0)
+			goto nla_put_failure;
+
+		nla_nest_end(skb, nest_parms);
+
+		tmp = ctinfo;
+		if (nla_put_u32(skb, NFQA_CT_INFO, htonl(ctinfo)))
+			goto nla_put_failure;
+	}
+
 	nlh->nlmsg_len = skb->tail - old_tail;
 	return skb;
 
@@ -726,6 +760,7 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff *skb,
 	struct nfqnl_instance *queue;
 	unsigned int verdict;
 	struct nf_queue_entry *entry;
+	struct nfq_ct_hook *nfq_ct;
 
 	queue = instance_lookup(queue_num);
 	if (!queue)
@@ -753,6 +788,19 @@ nfqnl_recv_verdict(struct sock *ctnl, struct sk_buff *skb,
 	if (nfqa[NFQA_MARK])
 		entry->skb->mark = ntohl(nla_get_be32(nfqa[NFQA_MARK]));
 
+	rcu_read_lock();
+	nfq_ct = rcu_dereference(nfq_ct_hook);
+	if (nfq_ct != NULL &&
+	    (queue->flags & NFQNL_F_CONNTRACK) && nfqa[NFQA_CT]) {
+		enum ip_conntrack_info ctinfo;
+		struct nf_conn *ct;
+
+		ct = nf_ct_get(entry->skb, &ctinfo);
+		if (ct && !nf_ct_is_untracked(ct))
+			nfq_ct->parse(nfqa[NFQA_CT], ct);
+	}
+	rcu_read_unlock();
+
 	nf_reinject(entry, verdict);
 	return 0;
 }
@@ -768,6 +816,7 @@ nfqnl_recv_unsupp(struct sock *ctnl, struct sk_buff *skb,
 static const struct nla_policy nfqa_cfg_policy[NFQA_CFG_MAX+1] = {
 	[NFQA_CFG_CMD]		= { .len = sizeof(struct nfqnl_msg_config_cmd) },
 	[NFQA_CFG_PARAMS]	= { .len = sizeof(struct nfqnl_msg_config_params) },
+	[NFQA_CFG_FLAGS]	= { .type = NLA_U32 }
 };
 
 static const struct nf_queue_handler nfqh = {
@@ -861,6 +910,19 @@ nfqnl_recv_config(struct sock *ctnl, struct sk_buff *skb,
 		spin_unlock_bh(&queue->lock);
 	}
 
+	if (nfqa[NFQA_CFG_FLAGS]) {
+		__be32 *flags;
+
+		if (!queue) {
+			ret = -ENODEV;
+			goto err_out_unlock;
+		}
+		flags = nla_data(nfqa[NFQA_CFG_FLAGS]);
+		spin_lock_bh(&queue->lock);
+		queue->flags = ntohl(*flags);
+		spin_unlock_bh(&queue->lock);
+	}
+
 err_out_unlock:
 	rcu_read_unlock();
 	return ret;
-- 
1.7.10

^ permalink raw reply related

* [BUG] vanilla 32bit 3.4.0, lockdep, l2tp_xmit_skb + sch_direct_xmit warning
From: Denys Fedoryshchenko @ 2012-06-04 12:37 UTC (permalink / raw)
  To: netdev, linux-kernel

CBSS_PPPoE ~ # ip l2tp show tunnel
Tunnel 2, encap IP
   From 194.146.153.XX to 194.146.153.YY
   Peer tunnel 2

CBSS_PPPoE ~ # ip l2tp show session
   Session 1 in tunnel 2
   Peer session 1, tunnel 2
   interface name: tun0
   offset 0, peer offset 0

CBSS_PPPoE ~ # ip link show dev tun0
303: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc pfifo_fast 
state UNKNOWN mode DEFAULT qlen 1000
     link/ether 6e:25:18:ce:8e:3b brd ff:ff:ff:ff:ff:ff
CBSS_PPPoE ~ # ip addr show dev tun0
303: tun0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1492 qdisc pfifo_fast 
state UNKNOWN qlen 1000
     link/ether 6e:25:18:ce:8e:3b brd ff:ff:ff:ff:ff:ff
     inet 10.0.6.2/24 scope global tun0

command was executed: ping 10.0.6.1

         [  135.292915]
         [  135.293008] 
======================================================
         [  135.293115] [ INFO: possible circular locking dependency 
detected ]
         [  135.293221] 3.4.0-build-0061 #12 Not tainted
         [  135.293316] 
-------------------------------------------------------
         [  135.293420] ping/6404 is trying to acquire lock:
         [  135.293517]  (slock-AF_INET){+.-...}, at: [<f88c83ec>] 
l2tp_xmit_skb+0x173/0x47e [l2tp_core]
         [  135.293780]
         [  135.293780] but task is already holding lock:
         [  135.293970]  (_xmit_ETHER#2){+.-...}, at: [<c02f09b9>] 
sch_direct_xmit+0x36/0x119
         [  135.294251]
         [  135.294252] which lock already depends on the new lock.
         [  135.294252]
         [  135.294532]
         [  135.294533] the existing dependency chain (in reverse order) 
is:
         [  135.294727]
         [  135.294728] -> #1 (_xmit_ETHER#2){+.-...}:
         [  135.295018]        [<c015a6d1>] lock_acquire+0x71/0x85
         [  135.295140]        [<c034ddad>] _raw_spin_lock+0x33/0x40
         [  135.295262]        [<c02e79f2>] neigh_update+0x1d9/0x385
         [  135.295384]        [<c031fff7>] arp_process+0x477/0x491
         [  135.295507]        [<c031f887>] NF_HOOK.clone.19+0x45/0x4c
         [  135.295628]        [<c031fb2c>] arp_rcv+0xb1/0xc3
         [  135.295748]        [<c02deca7>] 
__netif_receive_skb+0x329/0x378
         [  135.295873]        [<c02dee74>] netif_receive_skb+0x4e/0x7d
         [  135.295997]        [<c02def60>] napi_skb_finish+0x1e/0x34
         [  135.296121]        [<c02df389>] napi_gro_receive+0x20/0x24
         [  135.296245]        [<f8527213>] rtl8169_poll+0x2e6/0x52c 
[r8169]
         [  135.296374]        [<c02df48f>] net_rx_action+0x90/0x15d
         [  135.296496]        [<c012b42d>] __do_softirq+0x7b/0x118
         [  135.296620]
         [  135.296620] -> #0 (slock-AF_INET){+.-...}:
         [  135.296889]        [<c015a08b>] __lock_acquire+0x9a3/0xc27
         [  135.297010]        [<c015a6d1>] lock_acquire+0x71/0x85
         [  135.297130]        [<c034ddad>] _raw_spin_lock+0x33/0x40
         [  135.297251]        [<f88c83ec>] l2tp_xmit_skb+0x173/0x47e 
[l2tp_core]
         [  135.297376]        [<f86b11fb>] l2tp_eth_dev_xmit+0x1a/0x2f 
[l2tp_eth]
         [  135.297500]        [<c02e0573>] 
dev_hard_start_xmit+0x333/0x3f2
         [  135.297623]        [<c02f09d8>] sch_direct_xmit+0x55/0x119
         [  135.297745]        [<c02e08b4>] dev_queue_xmit+0x282/0x418
         [  135.297868]        [<c031f887>] NF_HOOK.clone.19+0x45/0x4c
         [  135.297992]        [<c031f8b0>] arp_xmit+0x22/0x24
         [  135.298113]        [<c031f8f3>] arp_send+0x41/0x48
         [  135.300267]        [<c031fa65>] arp_solicit+0x16b/0x181
         [  135.300388]        [<c02e6852>] neigh_probe+0x3c/0x52
         [  135.300509]        [<c02e6e46>] 
__neigh_event_send+0x1a3/0x1bc
         [  135.300630]        [<c02e8221>] 
neigh_resolve_output+0x59/0x149
         [  135.300750]        [<c03039e0>] 
ip_finish_output2+0x1e1/0x21c
         [  135.300871]        [<c0303a50>] ip_finish_output+0x35/0x39
         [  135.300989]        [<c03048c7>] ip_output+0x87/0x8c
         [  135.301110]        [<c03030c6>] dst_output+0x15/0x18
         [  135.301232]        [<c03042d7>] ip_local_out+0x17/0x1a
         [  135.301355]        [<c0304f59>] ip_send_skb+0x12/0x5c
         [  135.301478]        [<c0304fcd>] 
ip_push_pending_frames+0x2a/0x2e
         [  135.301603]        [<c031b98d>] raw_sendmsg+0x67a/0x749
         [  135.301726]        [<c032445f>] inet_sendmsg+0x53/0x5a
         [  135.301850]        [<c02d0162>] sock_sendmsg+0xaa/0xc2
         [  135.301974]        [<c02d0387>] __sys_sendmsg+0x182/0x20c
         [  135.302098]        [<c02d1518>] sys_sendmsg+0x36/0x4d
         [  135.302219]        [<c02d1a66>] sys_socketcall+0x214/0x27e
         [  135.302344]        [<c034e511>] syscall_call+0x7/0xb
         [  135.302467]
         [  135.302468] other info that might help us debug this:
         [  135.302468]
         [  135.302739]  Possible unsafe locking scenario:
         [  135.302739]
         [  135.302928]        CPU0                    CPU1
         [  135.303022]        ----                    ----
         [  135.303118]   lock(_xmit_ETHER#2);
         [  135.303274]                                
lock(slock-AF_INET);
         [  135.303417]                                
lock(_xmit_ETHER#2);
         [  135.303582]   lock(slock-AF_INET);
         [  135.303719]
         [  135.303719]  *** DEADLOCK ***
         [  135.303719]
         [  135.303990] 4 locks held by ping/6404:
         [  135.304087]  #0:  (sk_lock-AF_INET){+.+.+.}, at: 
[<c031b928>] raw_sendmsg+0x615/0x749
         [  135.304361]  #1:  (rcu_read_lock){.+.+..}, at: [<c0302fad>] 
rcu_read_lock+0x0/0x35
         [  135.304637]  #2:  (rcu_read_lock_bh){.+....}, at: 
[<c02dbf9c>] rcu_lock_acquire+0x0/0x30
         [  135.304913]  #3:  (_xmit_ETHER#2){+.-...}, at: [<c02f09b9>] 
sch_direct_xmit+0x36/0x119
         [  135.305209]
         [  135.305209] stack backtrace:
         [  135.305391] Pid: 6404, comm: ping Not tainted 
3.4.0-build-0061 #12
         [  135.305492] Call Trace:
         [  135.305589]  [<c034c156>] ? printk+0x18/0x1a
         [  135.305689]  [<c0158a74>] print_circular_bug+0x1ac/0x1b6
         [  135.305790]  [<c015a08b>] __lock_acquire+0x9a3/0xc27
         [  135.305890]  [<c0159500>] ? valid_state+0x1d4/0x201
         [  135.305989]  [<c019d0d6>] ? 
__slab_alloc.clone.59.clone.64+0xc4/0x2de
         [  135.306097]  [<c015a6d1>] lock_acquire+0x71/0x85
         [  135.306197]  [<f88c83ec>] ? l2tp_xmit_skb+0x173/0x47e 
[l2tp_core]
         [  135.306301]  [<c034ddad>] _raw_spin_lock+0x33/0x40
         [  135.306401]  [<f88c83ec>] ? l2tp_xmit_skb+0x173/0x47e 
[l2tp_core]
         [  135.306506]  [<f88c83ec>] l2tp_xmit_skb+0x173/0x47e 
[l2tp_core]
         [  135.306609]  [<c014f946>] ? timekeeping_get_ns+0xf/0x46
         [  135.306710]  [<f86b11fb>] l2tp_eth_dev_xmit+0x1a/0x2f 
[l2tp_eth]
         [  135.306815]  [<c02e0573>] dev_hard_start_xmit+0x333/0x3f2
         [  135.306919]  [<c02f09d8>] sch_direct_xmit+0x55/0x119
         [  135.307016]  [<c02e08b4>] dev_queue_xmit+0x282/0x418
         [  135.307112]  [<c02e0632>] ? dev_hard_start_xmit+0x3f2/0x3f2
         [  135.307220]  [<c031f887>] NF_HOOK.clone.19+0x45/0x4c
         [  135.307322]  [<c031f8b0>] arp_xmit+0x22/0x24
         [  135.307428]  [<c02e0632>] ? dev_hard_start_xmit+0x3f2/0x3f2
         [  135.307529]  [<c031f8f3>] arp_send+0x41/0x48
         [  135.307625]  [<c031fa65>] arp_solicit+0x16b/0x181
         [  135.307721]  [<c02e6852>] neigh_probe+0x3c/0x52
         [  135.307821]  [<c02e6e46>] __neigh_event_send+0x1a3/0x1bc
         [  135.307923]  [<c02e8221>] neigh_resolve_output+0x59/0x149
         [  135.308025]  [<c0302fe0>] ? rcu_read_lock+0x33/0x35
         [  135.308125]  [<c03039e0>] ip_finish_output2+0x1e1/0x21c
         [  135.308225]  [<c02fcce6>] ? ipv4_mtu+0x36/0x65
         [  135.308326]  [<c0303a50>] ip_finish_output+0x35/0x39
         [  135.308426]  [<c03048c7>] ip_output+0x87/0x8c
         [  135.308523]  [<c0303a1b>] ? ip_finish_output2+0x21c/0x21c
         [  135.308624]  [<c03030c6>] dst_output+0x15/0x18
         [  135.308721]  [<c03042d7>] ip_local_out+0x17/0x1a
         [  135.308821]  [<c0304f59>] ip_send_skb+0x12/0x5c
         [  135.308922]  [<c0304fcd>] ip_push_pending_frames+0x2a/0x2e
         [  135.309022]  [<c031b98d>] raw_sendmsg+0x67a/0x749
         [  135.309119]  [<c032445f>] inet_sendmsg+0x53/0x5a
         [  135.309219]  [<c02d0162>] sock_sendmsg+0xaa/0xc2
         [  135.309318]  [<c015a397>] ? 
lock_release_non_nested+0x88/0x20b
         [  135.309420]  [<c01898a0>] ? might_fault+0x2d/0x79
         [  135.309520]  [<c01898e6>] ? might_fault+0x73/0x79
         [  135.309620]  [<c02d8400>] ? copy_from_user+0x8/0xa
         [  135.309717]  [<c02d8729>] ? verify_iovec+0x3e/0x75
         [  135.309817]  [<c02d0387>] __sys_sendmsg+0x182/0x20c
         [  135.309919]  [<c0159553>] ? mark_lock+0x26/0x1bb
         [  135.310019]  [<c0159bb6>] ? __lock_acquire+0x4ce/0xc27
         [  135.310120]  [<c018b700>] ? handle_pte_fault+0x284/0x93d
         [  135.310221]  [<c015a397>] ? 
lock_release_non_nested+0x88/0x20b
         [  135.310328]  [<c0188dee>] ? page_address+0x8a/0x9f
         [  135.310430]  [<c01898a0>] ? might_fault+0x2d/0x79
         [  135.310531]  [<c01a42e2>] ? fget_light+0x2b/0x7c
         [  135.310631]  [<c02d1518>] sys_sendmsg+0x36/0x4d
         [  135.310725]  [<c02d1a66>] sys_socketcall+0x214/0x27e
         [  135.310827]  [<c034e544>] ? restore_all+0xf/0xf
         [  135.310929]  [<c011e817>] ? vmalloc_sync_all+0x5/0x5
         [  135.311032]  [<c022d2ec>] ? trace_hardirqs_on_thunk+0xc/0x10
         [  135.311135]  [<c034e511>] syscall_call+0x7/0xb
---
Denys Fedoryshchenko, Network Engineer, Virtual ISP S.A.L.

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 12:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1337276056.3403.37.camel@edumazet-glaptop>

On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Please note I havent tested yet this patch, lacking hardware for this.
> 
> (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> ixgbe uses fragments...)

virtio-net uses netdev_alloc_skb but maybe it should call
build_skb instead?

Also, it's not uncommon for drivers to copy short packets out to be able
to reuse pages.  virtio does this but I am guessing the logic is not
really virtio specific.

We could do
	if (len < GOOD_COPY_LEN)
		netdev_alloc_skb
		memmov
	else
		build_skb

but maybe it makes sense to put this logic in build_skb?

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 12:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <20120517174551.GN14498@1wt.eu>

On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote:
> Impressed !
> 
> For the first time I could proxy HTTP traffic at gigabit speed on this
> little box powered by USB ! I've long believed that proper splicing
> would make this possible and now I'm seeing it is. Congrats Eric !

which userspace do you use?
anything I can try?

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Willy Tarreau @ 2012-06-04 12:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Eric Dumazet, David Miller, netdev
In-Reply-To: <20120604123912.GB28992@redhat.com>

On Mon, Jun 04, 2012 at 03:39:12PM +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2012 at 07:45:51PM +0200, Willy Tarreau wrote:
> > Impressed !
> > 
> > For the first time I could proxy HTTP traffic at gigabit speed on this
> > little box powered by USB ! I've long believed that proper splicing
> > would make this possible and now I'm seeing it is. Congrats Eric !
> 
> which userspace do you use?

It's haproxy-1.5-dev with splicing enabled.

> anything I can try?

Yes, feel free to download -dev11, build it for kernels >= 2.6.28 and
make a small config to relay TCP/HTTP to another host. of course you
need gigabit-capable client and server.

Willy

^ permalink raw reply

* Re: [PATCH 3/7] netfilter: nf_ct_helper: implement variable length helper private data
From: Jan Engelhardt @ 2012-06-04 13:06 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1338812485-4232-4-git-send-email-pablo@netfilter.org>

On Monday 2012-06-04 14:21, pablo@netfilter.org wrote:

>+static inline void *nfct_help_data(const struct nf_conn *ct)
>+{
>+	struct nf_conn_help *help;
>+
>+	help = nf_ct_ext_find(ct, NF_CT_EXT_HELPER);
>+
>+	return (void *)&help->data;
>+}

I think you wanted

	return help->data;

here. Remember that help->data is of type char[0] which is
convertible to char* - which is what you want,
while adding an extra & would turn that into the undesired (char[0])*.



>@@ -89,12 +59,13 @@ struct nf_conn_help {
> 	/* Helper. if any */
> 	struct nf_conntrack_helper __rcu *helper;
> 
>-	union nf_conntrack_help help;
>-
> 	struct hlist_head expectations;
> 
> 	/* Current number of expected connections */
> 	u8 expecting[NF_CT_MAX_EXPECT_CLASSES];
>+
>+	/* private helper information. */
>+	char data[0];

There is a now-standardized notation:

	char data[];



>@@ -218,13 +221,13 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
> 	}
> 
> 	if (help == NULL) {
>-		help = nf_ct_helper_ext_add(ct, flags);
>+		help = nf_ct_helper_ext_add(ct, helper, flags);
> 		if (help == NULL) {
> 			ret = -ENOMEM;
> 			goto out;
> 		}
> 	} else {
>-		memset(&help->help, 0, sizeof(help->help));
>+		memset(&help->data, 0, sizeof(helper->data_len));
> 	}

memset(help->data, 0, sizeof(helper->data_len));

>index 6f4b00a..30f5e12 100644
>--- a/net/netfilter/nf_conntrack_netlink.c
>+++ b/net/netfilter/nf_conntrack_netlink.c
>@@ -1218,7 +1218,7 @@ ctnetlink_change_helper(struct nf_conn *ct, const struct nlattr * const cda[])
> 		if (help->helper)
> 			return -EBUSY;
> 		/* need to zero data of old helper */
>-		memset(&help->help, 0, sizeof(help->help));
>+		memset(&help->data, 0, help->helper->data_len);

Here too.. memset(help->data,...

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Eric Dumazet @ 2012-06-04 13:06 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <20120604123738.GA28992@redhat.com>

On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote:
> On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > 
> > Please note I havent tested yet this patch, lacking hardware for this.
> > 
> > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> > ixgbe uses fragments...)
> 
> virtio-net uses netdev_alloc_skb but maybe it should call
> build_skb instead?
> 
> Also, it's not uncommon for drivers to copy short packets out to be able
> to reuse pages.  virtio does this but I am guessing the logic is not
> really virtio specific.
> 
> We could do
> 	if (len < GOOD_COPY_LEN)
> 		netdev_alloc_skb
> 		memmov
> 	else
> 		build_skb
> 
> but maybe it makes sense to put this logic in build_skb?
> 
> 

I am not sure to understand the question.

If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to
change.

build_skb() is for drivers that allocate the memory to hold frame, and
wait for NIC completion before allocating/populating the skb itself.

^ permalink raw reply

* Re: [PATCH 3/7] netfilter: nf_ct_helper: implement variable length helper private data
From: Joe Perches @ 2012-06-04 13:09 UTC (permalink / raw)
  To: Jan Engelhardt; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <alpine.LNX.2.01.1206041456480.16684@frira.zrqbmnf.qr>

On Mon, 2012-06-04 at 15:06 +0200, Jan Engelhardt wrote:
> On Monday 2012-06-04 14:21, pablo@netfilter.org wrote:

> >@@ -218,13 +221,13 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
> > 	}
> > 
> > 	if (help == NULL) {
> >-		help = nf_ct_helper_ext_add(ct, flags);
> >+		help = nf_ct_helper_ext_add(ct, helper, flags);
> > 		if (help == NULL) {
> > 			ret = -ENOMEM;
> > 			goto out;
> > 		}
> > 	} else {
> >-		memset(&help->help, 0, sizeof(help->help));
> >+		memset(&help->data, 0, sizeof(helper->data_len));
> > 	}
> 
> memset(help->data, 0, sizeof(helper->data_len));

	memset(help->data, 0, helper->data_len);



^ permalink raw reply

* Re: [PATCH 3/7] netfilter: nf_ct_helper: implement variable length helper private data
From: Jan Engelhardt @ 2012-06-04 13:16 UTC (permalink / raw)
  To: Joe Perches; +Cc: pablo, netfilter-devel, netdev
In-Reply-To: <1338815399.8574.10.camel@joe2Laptop>

On Monday 2012-06-04 15:09, Joe Perches wrote:

>On Mon, 2012-06-04 at 15:06 +0200, Jan Engelhardt wrote:
>> On Monday 2012-06-04 14:21, pablo@netfilter.org wrote:
>
>> >@@ -218,13 +221,13 @@ int __nf_ct_try_assign_helper(struct nf_conn *ct, struct nf_conn *tmpl,
>> > 	}
>> > 
>> > 	if (help == NULL) {
>> >-		help = nf_ct_helper_ext_add(ct, flags);
>> >+		help = nf_ct_helper_ext_add(ct, helper, flags);
>> > 		if (help == NULL) {
>> > 			ret = -ENOMEM;
>> > 			goto out;
>> > 		}
>> > 	} else {
>> >-		memset(&help->help, 0, sizeof(help->help));
>> >+		memset(&help->data, 0, sizeof(helper->data_len));
>> > 	}
>> 
>> memset(help->data, 0, sizeof(helper->data_len));
>
>	memset(help->data, 0, helper->data_len);

I knew this looked suspect. With so many "sizeof"s, this spot was 
starting to look like a "mine is bigger" competition.

^ permalink raw reply

* [PATCH RFC] c_can_pci: generic module for c_can on PCI
From: Federico Vaga @ 2012-06-04 13:32 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde
  Cc: Federico Vaga, Giancarlo Asnaghi, Alan Cox, Alessandro Rubini,
	linux-can, netdev, linux-kernel
In-Reply-To: <1338816766-7089-1-git-send-email-federico.vaga@gmail.com>

Signed-off-by: Federico Vaga <federico.vaga@gmail.com>
Acked-by: Giancarlo Asnaghi <giancarlo.asnaghi@st.com>
Cc: Alan Cox <alan@linux.intel.com>
---
 drivers/net/can/c_can/Kconfig     |   11 +-
 drivers/net/can/c_can/Makefile    |    1 +
 drivers/net/can/c_can/c_can_pci.c |  221 +++++++++++++++++++++++++++++++++++++
 3 files changed, 230 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/can/c_can/c_can_pci.c

diff --git a/drivers/net/can/c_can/Kconfig b/drivers/net/can/c_can/Kconfig
index ffb9773..74ef97d 100644
--- a/drivers/net/can/c_can/Kconfig
+++ b/drivers/net/can/c_can/Kconfig
@@ -2,14 +2,19 @@ menuconfig CAN_C_CAN
 	tristate "Bosch C_CAN devices"
 	depends on CAN_DEV && HAS_IOMEM
 
-if CAN_C_CAN
-
 config CAN_C_CAN_PLATFORM
 	tristate "Generic Platform Bus based C_CAN driver"
+	depends on CAN_C_CAN
 	---help---
 	  This driver adds support for the C_CAN chips connected to
 	  the "platform bus" (Linux abstraction for directly to the
 	  processor attached devices) which can be found on various
 	  boards from ST Microelectronics (http://www.st.com)
 	  like the SPEAr1310 and SPEAr320 evaluation boards.
-endif
+
+config CAN_C_CAN_PCI
+	tristate "Generic PCI Bus based C_CAN driver"
+	depends on CAN_C_CAN
+	---help---
+	  This driver adds support for the C_CAN chips connected to
+	  the PCI bus.
diff --git a/drivers/net/can/c_can/Makefile b/drivers/net/can/c_can/Makefile
index 9273f6d..ad1cc84 100644
--- a/drivers/net/can/c_can/Makefile
+++ b/drivers/net/can/c_can/Makefile
@@ -4,5 +4,6 @@
 
 obj-$(CONFIG_CAN_C_CAN) += c_can.o
 obj-$(CONFIG_CAN_C_CAN_PLATFORM) += c_can_platform.o
+obj-$(CONFIG_CAN_C_CAN_PCI) += c_can_pci.o
 
 ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
diff --git a/drivers/net/can/c_can/c_can_pci.c b/drivers/net/can/c_can/c_can_pci.c
new file mode 100644
index 0000000..b635375
--- /dev/null
+++ b/drivers/net/can/c_can/c_can_pci.c
@@ -0,0 +1,221 @@
+/*
+ * Platform CAN bus driver for Bosch C_CAN controller
+ *
+ * Copyright (C) 2012 Federico Vaga <federico.vaga@gmail.com>
+  *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2. This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/clk.h>
+#include <linux/pci.h>
+#include <linux/can/dev.h>
+
+#include "c_can.h"
+
+enum c_can_pci_reg_align {
+	C_CAN_REG_ALIGN_16,
+	C_CAN_REG_ALIGN_32,
+};
+
+struct c_can_pci_data {
+	unsigned int reg_align;	/* Set the register alignment in the memory */
+	unsigned int freq;	/* Set the frequency if clk is not usable */
+};
+
+/*
+ * 16-bit c_can registers can be arranged differently in the memory
+ * architecture of different implementations. For example: 16-bit
+ * registers can be aligned to a 16-bit boundary or 32-bit boundary etc.
+ * Handle the same by providing a common read/write interface.
+ */
+static u16 c_can_pci_read_reg_aligned_to_16bit(struct c_can_priv *priv,
+						void *reg)
+{
+	return readw(reg);
+}
+
+static void c_can_pci_write_reg_aligned_to_16bit(struct c_can_priv *priv,
+						void *reg, u16 val)
+{
+	writew(val, reg);
+}
+
+static u16 c_can_pci_read_reg_aligned_to_32bit(struct c_can_priv *priv,
+						void *reg)
+{
+	return readw(reg + (long)reg - (long)priv->regs);
+}
+
+static void c_can_pci_write_reg_aligned_to_32bit(struct c_can_priv *priv,
+						void *reg, u16 val)
+{
+	writew(val, reg + (long)reg - (long)priv->regs);
+}
+
+static int __devinit c_can_pci_probe(struct pci_dev *pdev,
+				     const struct pci_device_id *ent)
+{
+	struct c_can_pci_data *c_can_pci_data = (void *)ent->driver_data;
+	struct c_can_priv *priv;
+	struct net_device *dev;
+	void __iomem *addr;
+	struct clk *clk;
+	int ret;
+
+	ret = pci_enable_device(pdev);
+	if (ret) {
+		dev_err(&pdev->dev, "pci_enable_device FAILED\n");
+		goto out;
+	}
+
+	ret = pci_request_regions(pdev, KBUILD_MODNAME);
+	if (ret) {
+		dev_err(&pdev->dev, "pci_request_regions FAILED\n");
+		goto out_disable_device;
+	}
+
+	pci_set_master(pdev);
+	pci_enable_msi(pdev);
+
+	addr = pci_iomap(pdev, 0, pci_resource_len(pdev, 0));
+	if (!addr) {
+		dev_err(&pdev->dev,
+			"device has no PCI memory resources, "
+			"failing adapter\n");
+		ret = -ENOMEM;
+		goto out_release_regions;
+	}
+
+	/* allocate the c_can device */
+	dev = alloc_c_can_dev();
+	if (!dev) {
+		ret = -ENOMEM;
+		goto out_iounmap;
+	}
+
+	priv = netdev_priv(dev);
+	pci_set_drvdata(pdev, dev);
+	SET_NETDEV_DEV(dev, &pdev->dev);
+
+	dev->irq = pdev->irq;
+	priv->regs = addr;
+
+	if (!c_can_pci_data->freq) {
+		/* get the appropriate clk */
+		clk = clk_get(&pdev->dev, NULL);
+		if (IS_ERR(clk)) {
+			dev_err(&pdev->dev, "no clock defined\n");
+			ret = -ENODEV;
+			goto out_free_c_can;
+		}
+		priv->can.clock.freq = clk_get_rate(clk);
+		priv->priv = clk;
+	} else {
+		priv->can.clock.freq = c_can_pci_data->freq;
+		priv->priv = NULL;
+	}
+
+	switch (c_can_pci_data->reg_align) {
+	case C_CAN_REG_ALIGN_32:
+		priv->read_reg = c_can_pci_read_reg_aligned_to_32bit;
+		priv->write_reg = c_can_pci_write_reg_aligned_to_32bit;
+		break;
+	case C_CAN_REG_ALIGN_16:
+	default:
+		priv->read_reg = c_can_pci_read_reg_aligned_to_16bit;
+		priv->write_reg = c_can_pci_write_reg_aligned_to_16bit;
+		break;
+	}
+
+	ret = register_c_can_dev(dev);
+	if (ret) {
+		dev_err(&pdev->dev, "registering %s failed (err=%d)\n",
+			KBUILD_MODNAME, ret);
+		goto out_free_clock;
+	}
+
+	dev_info(&pdev->dev, "%s device registered (regs=%p, irq=%d)\n",
+		 KBUILD_MODNAME, priv->regs, dev->irq);
+
+	return 0;
+
+out_free_clock:
+	if (!priv->priv)
+		clk_put(priv->priv);
+out_free_c_can:
+	pci_set_drvdata(pdev, NULL);
+	free_c_can_dev(dev);
+out_iounmap:
+	pci_iounmap(pdev, addr);
+out_release_regions:
+	pci_disable_msi(pdev);
+	pci_clear_master(pdev);
+	pci_release_regions(pdev);
+out_disable_device:
+	/*
+	 * do not call pci_disable_device on sta2x11 because it
+	 * break all other Bus masters on this EP
+	 */
+	if(pdev->vendor == PCI_VENDOR_ID_STMICRO &&
+	   pdev->device == PCI_DEVICE_ID_STMICRO_CAN)
+		goto out;
+	pci_disable_device(pdev);
+out:
+	return ret;
+}
+
+static void __devexit c_can_pci_remove(struct pci_dev *pdev)
+{
+	struct net_device *dev = pci_get_drvdata(pdev);
+	struct c_can_priv *priv = netdev_priv(dev);
+
+	pci_set_drvdata(pdev, NULL);
+	free_c_can_dev(dev);
+	if (!priv->priv)
+		clk_put(priv->priv);
+	pci_iounmap(pdev, priv->regs);
+	pci_disable_msi(pdev);
+	pci_clear_master(pdev);
+	pci_release_regions(pdev);
+	/*
+	 * do not call pci_disable_device on sta2x11 because it
+	 * break all other Bus masters on this EP
+	 */
+	if(pdev->vendor == PCI_VENDOR_ID_STMICRO &&
+	   pdev->device == PCI_DEVICE_ID_STMICRO_CAN)
+		return;
+	pci_disable_device(pdev);
+}
+
+static struct c_can_pci_data c_can_sta2x11= {
+	.reg_align = C_CAN_REG_ALIGN_32,
+	.freq = 52000000, /* 52 Mhz */
+};
+
+#define C_CAN_ID(_vend, _dev, _driverdata) {		\
+	PCI_DEVICE(_vend, _dev),			\
+	.driver_data = (unsigned long)&_driverdata,	\
+}
+DEFINE_PCI_DEVICE_TABLE(c_can_pci_tbl) = {
+	C_CAN_ID(PCI_VENDOR_ID_STMICRO, PCI_DEVICE_ID_STMICRO_CAN,
+		 c_can_sta2x11),
+	{},
+};
+static struct pci_driver sta2x11_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = c_can_pci_tbl,
+	.probe = c_can_pci_probe,
+	.remove = __devexit_p(c_can_pci_remove),
+};
+
+module_pci_driver(sta2x11_pci_driver);
+
+MODULE_AUTHOR("Federico Vaga <federico.vaga@gmail.com>");
+MODULE_LICENSE("GPL V2");
+MODULE_DESCRIPTION("PCI CAN bus driver for Bosch C_CAN controller");
+MODULE_DEVICE_TABLE(pci, c_can_pci_tbl);
-- 
1.7.10.2

^ permalink raw reply related

* generic module for c-can on pci
From: Federico Vaga @ 2012-06-04 13:32 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde
  Cc: Federico Vaga, Giancarlo Asnaghi, Alan Cox, Alessandro Rubini,
	linux-can, netdev, linux-kernel
In-Reply-To: <4FC135C6.5030206@grandegger.com>

As suggested I developed a generic module for C-CAN
on PCI. Probably I will do some changes about our
specific board, but I think that the module is generic
enough.

^ permalink raw reply

* Re: [PATCH 4/7] netfilter: add glue code to integrate nfnetlink_queue and ctnetlink
From: Jan Engelhardt @ 2012-06-04 13:38 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1338812485-4232-5-git-send-email-pablo@netfilter.org>


On Monday 2012-06-04 14:21, pablo@netfilter.org wrote:
>+static int
>+ctnetlink_nfqueue_parse(const struct nlattr *attr, struct nf_conn *ct)
>+{
>+	const struct nlattr * const cda[CTA_MAX+1];

I suppose you wrote that because the same appears in function
headers/signatures

	void foo(const struct nlattr *const tb[]) { ... }

But there, it is actually equal to

	void foo(const struct nlattr *const *tb) { ... }

In either case, tb is writable. IMHO, [] should be avoided in
signatures to avoid self-confusion, as it seemed to occur in your
case, where cda is - unlike tb - really marked const.
You likely wanted

	const struct nlattr *cda[CTA_MAX+1];

>+       nla_parse_nested((struct nlattr **)cda, CTA_MAX, attr, ct_nla_policy);



^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 13:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338815213.2760.1806.camel@edumazet-glaptop>

On Mon, Jun 04, 2012 at 03:06:53PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 15:37 +0300, Michael S. Tsirkin wrote:
> > On Thu, May 17, 2012 at 07:34:16PM +0200, Eric Dumazet wrote:
> > > From: Eric Dumazet <edumazet@google.com>
> > > 
> > > Please note I havent tested yet this patch, lacking hardware for this.
> > > 
> > > (tg3/bnx2/bnx2x use build_skb, r8169 does a copy of incoming frames,
> > > ixgbe uses fragments...)
> > 
> > virtio-net uses netdev_alloc_skb but maybe it should call
> > build_skb instead?
> > 
> > Also, it's not uncommon for drivers to copy short packets out to be able
> > to reuse pages.  virtio does this but I am guessing the logic is not
> > really virtio specific.
> > 
> > We could do
> > 	if (len < GOOD_COPY_LEN)
> > 		netdev_alloc_skb
> > 		memmov
> > 	else
> > 		build_skb
> > 
> > but maybe it makes sense to put this logic in build_skb?
> > 
> > 
> 
> I am not sure to understand the question.
> 
> If virtio-net uses netdev_alloc_skb(), all is good, you have nothing to
> change.
> 
> build_skb() is for drivers that allocate the memory to hold frame, and
> wait for NIC completion before allocating/populating the skb itself.
> 

This is generally what virtio does, take a look:
page_to_skb fills the first fragment and receive_mergeable fills the
rest (other modes are for legacy hardware).

The way hypervisor now works is this (we call it mergeable buffers):

- pages are passed to hardware
- hypervisor puts virtio specific stuff in first 12 bytes
  on first page
- following this, the rest of the first page and all following
  pages have data

The driver gets the 1st page, allocates the skb, copies out the 12 byte
header and copies the first 128 bytes of data into skb.
The rest if any is populated by the pages.

So I guess I'm asking for advice, would it make sense to switch to build_skb
and how best to handle the data copying above? Maybe it would help
if we changed the hypervisor to write the 12 bytes separately?

-- 
MST

^ permalink raw reply

* [PATCH net-next] sock_diag: add SK_MEMINFO_BACKLOG
From: Eric Dumazet @ 2012-06-04 13:50 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Adding socket backlog len in INET_DIAG_SKMEMINFO is really useful to
diagnose various TCP problems.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/sock_diag.h |    1 +
 net/core/sock_diag.c      |    1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index db4bae7..6793fac 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -18,6 +18,7 @@ enum {
 	SK_MEMINFO_FWD_ALLOC,
 	SK_MEMINFO_WMEM_QUEUED,
 	SK_MEMINFO_OPTMEM,
+	SK_MEMINFO_BACKLOG,
 
 	SK_MEMINFO_VARS,
 };
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 5fd1467..0d934ce 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -46,6 +46,7 @@ int sock_diag_put_meminfo(struct sock *sk, struct sk_buff *skb, int attrtype)
 	mem[SK_MEMINFO_FWD_ALLOC] = sk->sk_forward_alloc;
 	mem[SK_MEMINFO_WMEM_QUEUED] = sk->sk_wmem_queued;
 	mem[SK_MEMINFO_OPTMEM] = atomic_read(&sk->sk_omem_alloc);
+	mem[SK_MEMINFO_BACKLOG] = sk->sk_backlog.len;
 
 	return 0;
 

^ permalink raw reply related

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Eric Dumazet @ 2012-06-04 14:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <20120604134138.GA29814@redhat.com>

On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote:

> This is generally what virtio does, take a look:
> page_to_skb fills the first fragment and receive_mergeable fills the
> rest (other modes are for legacy hardware).
> 
> The way hypervisor now works is this (we call it mergeable buffers):
> 
> - pages are passed to hardware
> - hypervisor puts virtio specific stuff in first 12 bytes
>   on first page
> - following this, the rest of the first page and all following
>   pages have data
> 
> The driver gets the 1st page, allocates the skb, copies out the 12 byte
> header and copies the first 128 bytes of data into skb.
> The rest if any is populated by the pages.
> 
> So I guess I'm asking for advice, would it make sense to switch to build_skb
> and how best to handle the data copying above? Maybe it would help
> if we changed the hypervisor to write the 12 bytes separately?
>   

Thanks for these details.

Not sure 12 bytes of headroom would be enough (instead of the
NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
done indeed is to use the first page as the skb->head, so using
build_skb() indeed, removing one fragment, one (small) copy and one
{put|get}_page() pair.

^ permalink raw reply

* Re: [PATCH 7/7] netfilter: add user-space connection tracking helper infrastructure
From: Jan Engelhardt @ 2012-06-04 14:04 UTC (permalink / raw)
  To: pablo; +Cc: netfilter-devel, netdev
In-Reply-To: <1338812485-4232-8-git-send-email-pablo@netfilter.org>


On Monday 2012-06-04 14:21, pablo@netfilter.org wrote:
>+static int
>+nfnl_cthelper_from_nlattr(struct nlattr *attr, struct nf_conn *ct)
>+{
>+	const struct nf_conn_help *help = nfct_help(ct);
>+
>+	if (help->helper->data_len == 0)
>+		return -EINVAL;
>+
>+	memcpy(&help->data, nla_data(attr), help->helper->data_len);

memcpy(help->data, ...)

>+static int
>+nfnl_cthelper_to_nlattr(struct sk_buff *skb, const struct nf_conn *ct)
>+{
>+	const struct nf_conn_help *help = nfct_help(ct);
>+
>+	if (help->helper->data_len &&
>+	    nla_put(skb, CTA_HELP_INFO, help->helper->data_len, &help->data))
>+		goto nla_put_failure;

help->data

^ permalink raw reply

* Re: [PATCH RFC] c_can_pci: generic module for c_can on PCI
From: Marc Kleine-Budde @ 2012-06-04 14:04 UTC (permalink / raw)
  To: Federico Vaga
  Cc: Wolfgang Grandegger, Giancarlo Asnaghi, Alan Cox,
	Alessandro Rubini, linux-can, netdev, linux-kernel
In-Reply-To: <1338816766-7089-2-git-send-email-federico.vaga@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9563 bytes --]

On 06/04/2012 03:32 PM, Federico Vaga wrote:
> Signed-off-by: Federico Vaga <federico.vaga@gmail.com>
> Acked-by: Giancarlo Asnaghi <giancarlo.asnaghi@st.com>
> Cc: Alan Cox <alan@linux.intel.com>

Please port you driver to the recent c_can changes. Use the c_can branch
of the linux-can-next repo[1] as base for your work. You have to rework
the register access function. Please have a look if there are devm_
variants for the registration/mapping of the pci and clock.

[1] https://gitorious.org/linux-can/linux-can-next

More comments inline. Marc

> ---
>  drivers/net/can/c_can/Kconfig     |   11 +-
>  drivers/net/can/c_can/Makefile    |    1 +
>  drivers/net/can/c_can/c_can_pci.c |  221 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 230 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/net/can/c_can/c_can_pci.c
> 
> diff --git a/drivers/net/can/c_can/Kconfig b/drivers/net/can/c_can/Kconfig
> index ffb9773..74ef97d 100644
> --- a/drivers/net/can/c_can/Kconfig
> +++ b/drivers/net/can/c_can/Kconfig
> @@ -2,14 +2,19 @@ menuconfig CAN_C_CAN
>  	tristate "Bosch C_CAN devices"
>  	depends on CAN_DEV && HAS_IOMEM
>  
> -if CAN_C_CAN

please keep the if CAN_C_CAN...

> -
>  config CAN_C_CAN_PLATFORM
>  	tristate "Generic Platform Bus based C_CAN driver"
> +	depends on CAN_C_CAN

...then you don't have to add the depends on here.

>  	---help---
>  	  This driver adds support for the C_CAN chips connected to
>  	  the "platform bus" (Linux abstraction for directly to the
>  	  processor attached devices) which can be found on various
>  	  boards from ST Microelectronics (http://www.st.com)
>  	  like the SPEAr1310 and SPEAr320 evaluation boards.
> -endif

... Just move you pci driver inside the if...endif block...
> +
> +config CAN_C_CAN_PCI
> +	tristate "Generic PCI Bus based C_CAN driver"
> +	depends on CAN_C_CAN

...and remove the depends on CAN_C_CAN. You probably have to add a
depends on PCI.

> +	---help---
> +	  This driver adds support for the C_CAN chips connected to
> +	  the PCI bus.
> diff --git a/drivers/net/can/c_can/Makefile b/drivers/net/can/c_can/Makefile
> index 9273f6d..ad1cc84 100644
> --- a/drivers/net/can/c_can/Makefile
> +++ b/drivers/net/can/c_can/Makefile
> @@ -4,5 +4,6 @@
>  
>  obj-$(CONFIG_CAN_C_CAN) += c_can.o
>  obj-$(CONFIG_CAN_C_CAN_PLATFORM) += c_can_platform.o
> +obj-$(CONFIG_CAN_C_CAN_PCI) += c_can_pci.o
>  
>  ccflags-$(CONFIG_CAN_DEBUG_DEVICES) := -DDEBUG
> diff --git a/drivers/net/can/c_can/c_can_pci.c b/drivers/net/can/c_can/c_can_pci.c
> new file mode 100644
> index 0000000..b635375
> --- /dev/null
> +++ b/drivers/net/can/c_can/c_can_pci.c
> @@ -0,0 +1,221 @@
> +/*
> + * Platform CAN bus driver for Bosch C_CAN controller
> + *
> + * Copyright (C) 2012 Federico Vaga <federico.vaga@gmail.com>
> +  *
   ^^^ double space :)

> + * This file is licensed under the terms of the GNU General Public
> + * License version 2. This program is licensed "as is" without any
> + * warranty of any kind, whether express or implied.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/clk.h>
> +#include <linux/pci.h>
> +#include <linux/can/dev.h>
> +
> +#include "c_can.h"
> +
> +enum c_can_pci_reg_align {
> +	C_CAN_REG_ALIGN_16,
> +	C_CAN_REG_ALIGN_32,
> +};
> +
> +struct c_can_pci_data {
> +	unsigned int reg_align;	/* Set the register alignment in the memory */
        ^^^^^^^^^^^^
use the enum you defined above.

> +	unsigned int freq;	/* Set the frequency if clk is not usable */
> +};
> +
> +/*
> + * 16-bit c_can registers can be arranged differently in the memory
> + * architecture of different implementations. For example: 16-bit
> + * registers can be aligned to a 16-bit boundary or 32-bit boundary etc.
> + * Handle the same by providing a common read/write interface.
> + */
> +static u16 c_can_pci_read_reg_aligned_to_16bit(struct c_can_priv *priv,
> +						void *reg)
> +{
> +	return readw(reg);
> +}
> +
> +static void c_can_pci_write_reg_aligned_to_16bit(struct c_can_priv *priv,
> +						void *reg, u16 val)
> +{
> +	writew(val, reg);
> +}
> +
> +static u16 c_can_pci_read_reg_aligned_to_32bit(struct c_can_priv *priv,
> +						void *reg)
> +{
> +	return readw(reg + (long)reg - (long)priv->regs);
> +}
> +
> +static void c_can_pci_write_reg_aligned_to_32bit(struct c_can_priv *priv,
> +						void *reg, u16 val)
> +{
> +	writew(val, reg + (long)reg - (long)priv->regs);
> +}
> +
> +static int __devinit c_can_pci_probe(struct pci_dev *pdev,
> +				     const struct pci_device_id *ent)
> +{
> +	struct c_can_pci_data *c_can_pci_data = (void *)ent->driver_data;
> +	struct c_can_priv *priv;
> +	struct net_device *dev;
> +	void __iomem *addr;
> +	struct clk *clk;
> +	int ret;
> +
> +	ret = pci_enable_device(pdev);
> +	if (ret) {
> +		dev_err(&pdev->dev, "pci_enable_device FAILED\n");
> +		goto out;
> +	}
> +
> +	ret = pci_request_regions(pdev, KBUILD_MODNAME);
> +	if (ret) {
> +		dev_err(&pdev->dev, "pci_request_regions FAILED\n");
> +		goto out_disable_device;
> +	}
> +
> +	pci_set_master(pdev);
> +	pci_enable_msi(pdev);
> +
> +	addr = pci_iomap(pdev, 0, pci_resource_len(pdev, 0));
> +	if (!addr) {
> +		dev_err(&pdev->dev,
> +			"device has no PCI memory resources, "
> +			"failing adapter\n");
> +		ret = -ENOMEM;
> +		goto out_release_regions;
> +	}
> +
> +	/* allocate the c_can device */
> +	dev = alloc_c_can_dev();
> +	if (!dev) {
> +		ret = -ENOMEM;
> +		goto out_iounmap;
> +	}
> +
> +	priv = netdev_priv(dev);
> +	pci_set_drvdata(pdev, dev);
> +	SET_NETDEV_DEV(dev, &pdev->dev);
> +
> +	dev->irq = pdev->irq;
> +	priv->regs = addr;
> +
> +	if (!c_can_pci_data->freq) {
> +		/* get the appropriate clk */
> +		clk = clk_get(&pdev->dev, NULL);
> +		if (IS_ERR(clk)) {
> +			dev_err(&pdev->dev, "no clock defined\n");
> +			ret = -ENODEV;
> +			goto out_free_c_can;
> +		}
> +		priv->can.clock.freq = clk_get_rate(clk);
> +		priv->priv = clk;
> +	} else {
> +		priv->can.clock.freq = c_can_pci_data->freq;
> +		priv->priv = NULL;
> +	}
> +
> +	switch (c_can_pci_data->reg_align) {
> +	case C_CAN_REG_ALIGN_32:
> +		priv->read_reg = c_can_pci_read_reg_aligned_to_32bit;
> +		priv->write_reg = c_can_pci_write_reg_aligned_to_32bit;
> +		break;
> +	case C_CAN_REG_ALIGN_16:
> +	default:
> +		priv->read_reg = c_can_pci_read_reg_aligned_to_16bit;
> +		priv->write_reg = c_can_pci_write_reg_aligned_to_16bit;
> +		break;
> +	}
> +
> +	ret = register_c_can_dev(dev);
> +	if (ret) {
> +		dev_err(&pdev->dev, "registering %s failed (err=%d)\n",
> +			KBUILD_MODNAME, ret);
> +		goto out_free_clock;
> +	}
> +
> +	dev_info(&pdev->dev, "%s device registered (regs=%p, irq=%d)\n",
> +		 KBUILD_MODNAME, priv->regs, dev->irq);
> +
> +	return 0;
> +
> +out_free_clock:
> +	if (!priv->priv)
           ^^^

looks fishy

> +		clk_put(priv->priv);
> +out_free_c_can:
> +	pci_set_drvdata(pdev, NULL);
> +	free_c_can_dev(dev);
> +out_iounmap:
> +	pci_iounmap(pdev, addr);
> +out_release_regions:
> +	pci_disable_msi(pdev);
> +	pci_clear_master(pdev);
> +	pci_release_regions(pdev);
> +out_disable_device:
> +	/*
> +	 * do not call pci_disable_device on sta2x11 because it
> +	 * break all other Bus masters on this EP
> +	 */
> +	if(pdev->vendor == PCI_VENDOR_ID_STMICRO &&
> +	   pdev->device == PCI_DEVICE_ID_STMICRO_CAN)
> +		goto out;
> +	pci_disable_device(pdev);
> +out:
> +	return ret;
> +}
> +
> +static void __devexit c_can_pci_remove(struct pci_dev *pdev)
> +{
> +	struct net_device *dev = pci_get_drvdata(pdev);
> +	struct c_can_priv *priv = netdev_priv(dev);
> +
> +	pci_set_drvdata(pdev, NULL);
> +	free_c_can_dev(dev);
> +	if (!priv->priv)
dito
> +		clk_put(priv->priv);
> +	pci_iounmap(pdev, priv->regs);
> +	pci_disable_msi(pdev);
> +	pci_clear_master(pdev);
> +	pci_release_regions(pdev);
> +	/*
> +	 * do not call pci_disable_device on sta2x11 because it
> +	 * break all other Bus masters on this EP
> +	 */
> +	if(pdev->vendor == PCI_VENDOR_ID_STMICRO &&
> +	   pdev->device == PCI_DEVICE_ID_STMICRO_CAN)
> +		return;
> +	pci_disable_device(pdev);
> +}
> +
> +static struct c_can_pci_data c_can_sta2x11= {
> +	.reg_align = C_CAN_REG_ALIGN_32,
> +	.freq = 52000000, /* 52 Mhz */
> +};
> +
> +#define C_CAN_ID(_vend, _dev, _driverdata) {		\
> +	PCI_DEVICE(_vend, _dev),			\
> +	.driver_data = (unsigned long)&_driverdata,	\
> +}
> +DEFINE_PCI_DEVICE_TABLE(c_can_pci_tbl) = {
^^^^

static?

> +	C_CAN_ID(PCI_VENDOR_ID_STMICRO, PCI_DEVICE_ID_STMICRO_CAN,
> +		 c_can_sta2x11),
> +	{},
> +};
> +static struct pci_driver sta2x11_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = c_can_pci_tbl,
> +	.probe = c_can_pci_probe,
> +	.remove = __devexit_p(c_can_pci_remove),
> +};
> +
> +module_pci_driver(sta2x11_pci_driver);
> +
> +MODULE_AUTHOR("Federico Vaga <federico.vaga@gmail.com>");
> +MODULE_LICENSE("GPL V2");

IIRC, the correct case is "GPL v2"

> +MODULE_DESCRIPTION("PCI CAN bus driver for Bosch C_CAN controller");
> +MODULE_DEVICE_TABLE(pci, c_can_pci_tbl);


-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Eric Dumazet @ 2012-06-04 14:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338818501.2760.1821.camel@edumazet-glaptop>

On Mon, 2012-06-04 at 16:01 +0200, Eric Dumazet wrote:

> Not sure 12 bytes of headroom would be enough (instead of the
> NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
> done indeed is to use the first page as the skb->head, so using
> build_skb() indeed, removing one fragment, one (small) copy and one
> {put|get}_page() pair.

It would also avoid 'pulling' tcp data payload in linear part.

page_to_skb() does :

copy = len;
if (copy > skb_tailroom(skb))
	copy = skb_tailroom(skb);
memcpy(skb_put(skb, copy), p, copy);

This means GRO or TCP coalescing (or splice()) has to handle two
segments to fetch data.

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Michael S. Tsirkin @ 2012-06-04 14:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <1338818501.2760.1821.camel@edumazet-glaptop>

On Mon, Jun 04, 2012 at 04:01:41PM +0200, Eric Dumazet wrote:
> On Mon, 2012-06-04 at 16:41 +0300, Michael S. Tsirkin wrote:
> 
> > This is generally what virtio does, take a look:
> > page_to_skb fills the first fragment and receive_mergeable fills the
> > rest (other modes are for legacy hardware).
> > 
> > The way hypervisor now works is this (we call it mergeable buffers):
> > 
> > - pages are passed to hardware
> > - hypervisor puts virtio specific stuff in first 12 bytes
> >   on first page
> > - following this, the rest of the first page and all following
> >   pages have data
> > 
> > The driver gets the 1st page, allocates the skb, copies out the 12 byte
> > header and copies the first 128 bytes of data into skb.
> > The rest if any is populated by the pages.
> > 
> > So I guess I'm asking for advice, would it make sense to switch to build_skb
> > and how best to handle the data copying above? Maybe it would help
> > if we changed the hypervisor to write the 12 bytes separately?
> >   
> 
> Thanks for these details.
> 
> Not sure 12 bytes of headroom would be enough (instead of the
> NET_SKB_PAD reserved in netdev_alloc_skb_ip_align(), but what could be
> done indeed is to use the first page as the skb->head, so using
> build_skb() indeed, removing one fragment, one (small) copy and one
> {put|get}_page() pair.
> 

bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
after build_skb. You are saying it's not a must?

Hmm so maybe we should teach the hypervisor to write data
out at an offset. Interesting.

Another question is about very small packets truesize.
build_skb sets truesize to frag_size but isn't
this too small? We keep the whole page around, no?

^ permalink raw reply

* [PATCH 0/1] net/hyperv: Use wait_event on outstanding sends during device removal
From: Haiyang Zhang @ 2012-06-04 14:35 UTC (permalink / raw)
  To: davem, netdev; +Cc: devel, haiyangz, olaf, linux-kernel

This patch is targeting net-next tree (when it's available for check in).

Haiyang Zhang (1):
  net/hyperv: Use wait_event on outstanding sends during device removal

 drivers/net/hyperv/hyperv_net.h |    1 +
 drivers/net/hyperv/netvsc.c     |   12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

-- 
1.7.4.1

^ permalink raw reply

* [PATCH 1/1] net/hyperv: Use wait_event on outstanding sends during device removal
From: Haiyang Zhang @ 2012-06-04 14:35 UTC (permalink / raw)
  To: davem, netdev; +Cc: haiyangz, kys, olaf, linux-kernel, devel
In-Reply-To: <1338820532-2345-1-git-send-email-haiyangz@microsoft.com>

Change the busy-waiting/udelay to wait_event on outstanding sends.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>

---
 drivers/net/hyperv/hyperv_net.h |    1 +
 drivers/net/hyperv/netvsc.c     |   12 ++++++------
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 4ffcd57..2857ab0 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -478,6 +478,7 @@ struct netvsc_device {
 	u32 nvsp_version;
 
 	atomic_t num_outstanding_sends;
+	wait_queue_head_t wait_drain;
 	bool start_remove;
 	bool destroy;
 	/*
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 8b91947..dee7b23e 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -42,6 +42,7 @@ static struct netvsc_device *alloc_net_device(struct hv_device *device)
 	if (!net_device)
 		return NULL;
 
+	init_waitqueue_head(&net_device->wait_drain);
 	net_device->start_remove = false;
 	net_device->destroy = false;
 	net_device->dev = device;
@@ -387,12 +388,8 @@ int netvsc_device_remove(struct hv_device *device)
 	spin_unlock_irqrestore(&device->channel->inbound_lock, flags);
 
 	/* Wait for all send completions */
-	while (atomic_read(&net_device->num_outstanding_sends)) {
-		dev_info(&device->device,
-			"waiting for %d requests to complete...\n",
-			atomic_read(&net_device->num_outstanding_sends));
-		udelay(100);
-	}
+	wait_event(net_device->wait_drain,
+		atomic_read(&net_device->num_outstanding_sends) == 0);
 
 	netvsc_disconnect_vsp(net_device);
 
@@ -486,6 +483,9 @@ static void netvsc_send_completion(struct hv_device *device,
 		num_outstanding_sends =
 			atomic_dec_return(&net_device->num_outstanding_sends);
 
+		if (net_device->destroy && num_outstanding_sends == 0)
+			wake_up(&net_device->wait_drain);
+
 		if (netif_queue_stopped(ndev) && !net_device->start_remove &&
 			(hv_ringbuf_avail_percent(&device->channel->outbound)
 			> RING_AVAIL_PERCENT_HIWATER ||
-- 
1.7.4.1

^ permalink raw reply related

* Re: [RFC PATCH v1 2/3] net: add VEPA, VEB bridge mode
From: Krishna Kumar2 @ 2012-06-04 14:59 UTC (permalink / raw)
  To: John Fastabend
  Cc: bhutchings, buytenh, eilong, eric.w.multanen, gregory.v.rose,
	hadi, jeffrey.t.kirsher, mst, netdev, shemminger, sri
In-Reply-To: <20120530030705.7443.22155.stgit@jf-dev1-dcblab>

John Fastabend <john.r.fastabend@intel.com> wrote on 05/30/2012 08:37:06
AM:

Some comments below:

> +static int rtnl_bridge_notify(struct net_device *dev, u16 flags)
> +{
> ...
> +		 if (!flags && master && master->netdev_ops->
ndo_bridge_getlink)
> +		 		 err = master->netdev_ops->ndo_bridge_getlink(skb,
0, 0, dev);
> +		 else if (dev->netdev_ops->ndo_bridge_getlink)
> +		 		 err = dev->netdev_ops->ndo_bridge_getlink(skb, 0,
0, dev);

I think you should do something like:

        if ((flags == BRIDGE_FLAGS_MASTER) && ...)
                ...

Also you could use BRIDGE_FLAGS_MASTER=1, SELF=2, and use
"if (flags & BRIDGE_FLAGS_MASTER)" for consistency?


+		 if (!err)
+		 		 err = rtnl_bridge_notify(dev, flags);

It is possible to return a reporting error even though
the operation succeeded. Maybe something that could be
done here to indicate that the operation succeeded, or
is that a TODO?

>  static int rtnl_bridge_setlink(struct sk_buff *skb, struct nlmsghdr
*nlh,
>                  void *arg)
>  {
..
> +   if (!flags && dev->master &&
> +       dev->master->netdev_ops->ndo_bridge_setlink)
> +      err = dev->master->netdev_ops->ndo_bridge_setlink(dev, nlh);
> +   else if ((flags & BRIDGE_FLAGS_SELF) &&
> +         dev->netdev_ops->ndo_bridge_setlink)

Same usage of MASTER here.

Thanks,
- KK

^ permalink raw reply

* Re: [PATCH net-next] net: netdev_alloc_skb() use build_skb()
From: Eric Dumazet @ 2012-06-04 15:01 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Willy Tarreau, David Miller, netdev
In-Reply-To: <20120604141731.GA30226@redhat.com>

On Mon, 2012-06-04 at 17:17 +0300, Michael S. Tsirkin wrote:

> bnx2 and tg3 both do skb_reserve of at least NET_SKB_PAD
> after build_skb. You are saying it's not a must?
> 

32 would be the minimum. NETS_SKB_PAD is using a cache line (64 bytes
on most x86 current cpus) to avoid using half a cache line.

> Hmm so maybe we should teach the hypervisor to write data
> out at an offset. Interesting.
> 
> Another question is about very small packets truesize.
> build_skb sets truesize to frag_size but isn't
> this too small? We keep the whole page around, no?

We keep one page per cpu, at most.

For example on MTU=1500 and PAGE_SIZE=4096, one page is splitted into
two fragments, of 1500 + NET_SKB_PAD + align(shared_info), so its good
enough (this is very close from 2048 'truesize')

But yes, for some uses (wifi for example), we might use a full page per
skb, yet underestimate skb->truesize. Hopefully we can track these uses
and fix them.

ath9k for example could be changed, to be able to reassemble up to 3
frags instead of 2 frags, ie extending what did commit
0d95521ea74735826cb2e28bebf6a07392c75bfa (ath9k: use split rx
buffers to get rid of order-1 skb allocations)

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox