Netdev List
 help / color / mirror / Atom feed
* Re: [PATCHv5] virtio-spec: virtio network device RFS support
From: Ben Hutchings @ 2012-12-05 20:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jason Wang, rusty, virtualization, netdev, kvm
In-Reply-To: <20121203105843.GA26194@redhat.com>

On Mon, 2012-12-03 at 12:58 +0200, Michael S. Tsirkin wrote:
> Add RFS support to virtio network device.
> Add a new feature flag VIRTIO_NET_F_RFS for this feature, a new
> configuration field max_virtqueue_pairs to detect supported number of
> virtqueues as well as a new command VIRTIO_NET_CTRL_RFS to program
> packet steering for unidirectional protocols.
[...]
> +Programming of the receive flow classificator is implicit.
> + Transmitting a packet of a specific flow on transmitqX will cause incoming
> + packets for this flow to be steered to receiveqX.
> + For uni-directional protocols, or where no packets have been transmitted
> + yet, device will steer a packet to a random queue out of the specified
> + receiveq0..receiveqn.
[...]

It doesn't seem like this is usable to implement accelerated RFS in the
guest, though perhaps that doesn't matter.  On the host side, presumably
you'll want vhost_net to do the equivalent of sock_rps_record_flow() -
only without a socket?  But in any case, that requires an rxhash, so I
don't see how this is supposed to work.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [RFC PATCH v2 3/3] tun: fix LSM/SELinux labeling of tun/tap devices
From: Paul Moore @ 2012-12-05 20:26 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang, mst
In-Reply-To: <20121205202144.18626.61966.stgit@localhost>

This patch corrects some problems with LSM/SELinux that were introduced
with the multiqueue patchset.  The problem stems from the fact that the
multiqueue work changed the relationship between the tun device and its
associated socket; before the socket persisted for the life of the
device, however after the multiqueue changes the socket only persisted
for the life of the userspace connection (fd open).  For non-persistent
devices this is not an issue, but for persistent devices this can cause
the tun device to lose its SELinux label.

We correct this problem by adding an opaque LSM security blob to the
tun device struct which allows us to have the LSM security state, e.g.
SELinux labeling information, persist for the lifetime of the tun
device.  In the process we tweak the LSM hooks to work with this new
approach to TUN device/socket labeling and introduce a new LSM hook,
security_tun_dev_create_queue(), to approve requests to create a new
TUN queue via TUNSETQUEUE.

The SELinux code has been adjusted to match the new LSM hooks, the
other LSMs do not make use of the LSM TUN controls.  This patch makes
use of the recently added "tun_socket:create_queue" permission to
restrict access to the TUNSETQUEUE operation.  On older SELinux
policies which do not define the "tun_socket:create_queue" permission
the access control decision for TUNSETQUEUE will be handled according
to the SELinux policy's unknown permission setting.

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 drivers/net/tun.c                 |   26 +++++++++++++---
 include/linux/security.h          |   59 +++++++++++++++++++++++++++++--------
 security/capability.c             |   24 +++++++++++++--
 security/security.c               |   28 ++++++++++++++----
 security/selinux/hooks.c          |   50 ++++++++++++++++++++++++-------
 security/selinux/include/objsec.h |    4 +++
 6 files changed, 153 insertions(+), 38 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 14a0454..fb8148b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -182,6 +182,7 @@ struct tun_struct {
 	struct hlist_head flows[TUN_NUM_FLOW_ENTRIES];
 	struct timer_list flow_gc_timer;
 	unsigned long ageing_time;
+	void *security;
 };
 
 static inline u32 tun_hashfn(u32 rxhash)
@@ -465,6 +466,10 @@ static int tun_attach(struct tun_struct *tun, struct file *file)
 	struct tun_file *tfile = file->private_data;
 	int err;
 
+	err = security_tun_dev_attach(tfile->socket.sk, tun->security);
+	if (err < 0)
+		goto out;
+
 	err = -EINVAL;
 	if (rcu_dereference_protected(tfile->tun, lockdep_rtnl_is_held()))
 		goto out;
@@ -1348,6 +1353,7 @@ static void tun_free_netdev(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	tun_flow_uninit(tun);
+	security_tun_dev_free_security(tun->security);
 	free_netdev(dev);
 }
 
@@ -1534,7 +1540,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		if (tun_not_capable(tun))
 			return -EPERM;
-		err = security_tun_dev_attach(tfile->socket.sk);
+		err = security_tun_dev_open(tun->security);
 		if (err < 0)
 			return err;
 
@@ -1587,7 +1593,9 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		spin_lock_init(&tun->lock);
 
-		security_tun_dev_post_create(&tfile->sk);
+		err = security_tun_dev_alloc_security(&tun->security);
+		if (err < 0)
+			goto err_free_dev;
 
 		tun_net_init(dev);
 
@@ -1767,12 +1775,18 @@ static int tun_set_queue(struct file *file, struct ifreq *ifr)
 
 		tun = netdev_priv(dev);
 		if (dev->netdev_ops != &tap_netdev_ops &&
-			dev->netdev_ops != &tun_netdev_ops)
+			dev->netdev_ops != &tun_netdev_ops) {
 			ret = -EINVAL;
-		else if (tun_not_capable(tun))
+			goto unlock;
+		}
+		if (tun_not_capable(tun)) {
 			ret = -EPERM;
-		else
-			ret = tun_attach(tun, file);
+			goto unlock;
+		}
+		ret = security_tun_dev_create_queue(tun->security);
+		if (ret < 0)
+			goto unlock;
+		ret = tun_attach(tun, file);
 	} else if (ifr->ifr_flags & IFF_DETACH_QUEUE)
 		__tun_detach(tfile, false);
 	else
diff --git a/include/linux/security.h b/include/linux/security.h
index 05e88bd..8ea923b 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -983,17 +983,29 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts)
  *	tells the LSM to decrement the number of secmark labeling rules loaded
  * @req_classify_flow:
  *	Sets the flow's sid to the openreq sid.
+ * @tun_dev_alloc_security:
+ *	This hook allows a module to allocate a security structure for a TUN
+ *	device.
+ *	@security pointer to a security structure pointer.
+ *	Returns a zero on success, negative values on failure.
+ * @tun_dev_free_security:
+ *	This hook allows a module to free the security structure for a TUN
+ *	device.
+ *	@security pointer to the TUN device's security structure
  * @tun_dev_create:
  *	Check permissions prior to creating a new TUN device.
- * @tun_dev_post_create:
- *	This hook allows a module to update or allocate a per-socket security
- *	structure.
- *	@sk contains the newly created sock structure.
+ * @tun_dev_create_queue:
+ *	Check permissions prior to creating a new TUN device queue.
+ *	@security pointer to the TUN device's security structure.
  * @tun_dev_attach:
- *	Check permissions prior to attaching to a persistent TUN device.  This
- *	hook can also be used by the module to update any security state
+ *	This hook can be used by the module to update any security state
  *	associated with the TUN device's sock structure.
  *	@sk contains the existing sock structure.
+ *	@security pointer to the TUN device's security structure.
+ * @tun_dev_open:
+ *	This hook can be used by the module to update any security state
+ *	associated with the TUN device's security structure.
+ *	@security pointer to the TUN devices's security structure.
  *
  * Security hooks for XFRM operations.
  *
@@ -1613,9 +1625,12 @@ struct security_operations {
 	void (*secmark_refcount_inc) (void);
 	void (*secmark_refcount_dec) (void);
 	void (*req_classify_flow) (const struct request_sock *req, struct flowi *fl);
-	int (*tun_dev_create)(void);
-	void (*tun_dev_post_create)(struct sock *sk);
-	int (*tun_dev_attach)(struct sock *sk);
+	int (*tun_dev_alloc_security) (void **security);
+	void (*tun_dev_free_security) (void *security);
+	int (*tun_dev_create) (void);
+	int (*tun_dev_create_queue) (void *security);
+	int (*tun_dev_attach) (struct sock *sk, void *security);
+	int (*tun_dev_open) (void *security);
 #endif	/* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
@@ -2553,9 +2568,12 @@ void security_inet_conn_established(struct sock *sk,
 int security_secmark_relabel_packet(u32 secid);
 void security_secmark_refcount_inc(void);
 void security_secmark_refcount_dec(void);
+int security_tun_dev_alloc_security(void **security);
+void security_tun_dev_free_security(void *security);
 int security_tun_dev_create(void);
-void security_tun_dev_post_create(struct sock *sk);
-int security_tun_dev_attach(struct sock *sk);
+int security_tun_dev_create_queue(void *security);
+int security_tun_dev_attach(struct sock *sk, void *security);
+int security_tun_dev_open(void *security);
 
 #else	/* CONFIG_SECURITY_NETWORK */
 static inline int security_unix_stream_connect(struct sock *sock,
@@ -2720,16 +2738,31 @@ static inline void security_secmark_refcount_dec(void)
 {
 }
 
+static inline int security_tun_dev_alloc_security(void **security)
+{
+	return 0;
+}
+
+static inline void security_tun_dev_free_security(void *security)
+{
+}
+
 static inline int security_tun_dev_create(void)
 {
 	return 0;
 }
 
-static inline void security_tun_dev_post_create(struct sock *sk)
+static inline int security_tun_dev_create_queue(void *security)
+{
+	return 0;
+}
+
+static inline int security_tun_dev_attach(struct sock *sk, void *security)
 {
+	return 0;
 }
 
-static inline int security_tun_dev_attach(struct sock *sk)
+static inline int security_tun_dev_open(void *security)
 {
 	return 0;
 }
diff --git a/security/capability.c b/security/capability.c
index b14a30c..bf4cbf2 100644
--- a/security/capability.c
+++ b/security/capability.c
@@ -704,16 +704,31 @@ static void cap_req_classify_flow(const struct request_sock *req,
 {
 }
 
+static int cap_tun_dev_alloc_security(void **security)
+{
+	return 0;
+}
+
+static void cap_tun_dev_free_security(void *security)
+{
+}
+
 static int cap_tun_dev_create(void)
 {
 	return 0;
 }
 
-static void cap_tun_dev_post_create(struct sock *sk)
+static int cap_tun_dev_create_queue(void *security)
+{
+	return 0;
+}
+
+static int cap_tun_dev_attach(struct sock *sk, void *security)
 {
+	return 0;
 }
 
-static int cap_tun_dev_attach(struct sock *sk)
+static int cap_tun_dev_open(void *security)
 {
 	return 0;
 }
@@ -1044,8 +1059,11 @@ void __init security_fixup_ops(struct security_operations *ops)
 	set_to_cap_if_null(ops, secmark_refcount_inc);
 	set_to_cap_if_null(ops, secmark_refcount_dec);
 	set_to_cap_if_null(ops, req_classify_flow);
+	set_to_cap_if_null(ops, tun_dev_alloc_security);
+	set_to_cap_if_null(ops, tun_dev_free_security);
 	set_to_cap_if_null(ops, tun_dev_create);
-	set_to_cap_if_null(ops, tun_dev_post_create);
+	set_to_cap_if_null(ops, tun_dev_create_queue);
+	set_to_cap_if_null(ops, tun_dev_open);
 	set_to_cap_if_null(ops, tun_dev_attach);
 #endif	/* CONFIG_SECURITY_NETWORK */
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
diff --git a/security/security.c b/security/security.c
index 8dcd4ae..4d82654 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1244,24 +1244,42 @@ void security_secmark_refcount_dec(void)
 }
 EXPORT_SYMBOL(security_secmark_refcount_dec);
 
+int security_tun_dev_alloc_security(void **security)
+{
+	return security_ops->tun_dev_alloc_security(security);
+}
+EXPORT_SYMBOL(security_tun_dev_alloc_security);
+
+void security_tun_dev_free_security(void *security)
+{
+	security_ops->tun_dev_free_security(security);
+}
+EXPORT_SYMBOL(security_tun_dev_free_security);
+
 int security_tun_dev_create(void)
 {
 	return security_ops->tun_dev_create();
 }
 EXPORT_SYMBOL(security_tun_dev_create);
 
-void security_tun_dev_post_create(struct sock *sk)
+int security_tun_dev_create_queue(void *security)
 {
-	return security_ops->tun_dev_post_create(sk);
+	return security_ops->tun_dev_create_queue(security);
 }
-EXPORT_SYMBOL(security_tun_dev_post_create);
+EXPORT_SYMBOL(security_tun_dev_create_queue);
 
-int security_tun_dev_attach(struct sock *sk)
+int security_tun_dev_attach(struct sock *sk, void *security)
 {
-	return security_ops->tun_dev_attach(sk);
+	return security_ops->tun_dev_attach(sk, security);
 }
 EXPORT_SYMBOL(security_tun_dev_attach);
 
+int security_tun_dev_open(void *security)
+{
+	return security_ops->tun_dev_open(security);
+}
+EXPORT_SYMBOL(security_tun_dev_open);
+
 #endif	/* CONFIG_SECURITY_NETWORK */
 
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 61a5336..f1efb08 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -4399,6 +4399,24 @@ static void selinux_req_classify_flow(const struct request_sock *req,
 	fl->flowi_secid = req->secid;
 }
 
+static int selinux_tun_dev_alloc_security(void **security)
+{
+	struct tun_security_struct *tunsec;
+
+	tunsec = kzalloc(sizeof(*tunsec), GFP_KERNEL);
+	if (!tunsec)
+		return -ENOMEM;
+	tunsec->sid = current_sid();
+
+	*security = tunsec;
+	return 0;
+}
+
+static void selinux_tun_dev_free_security(void *security)
+{
+	kfree(security);
+}
+
 static int selinux_tun_dev_create(void)
 {
 	u32 sid = current_sid();
@@ -4414,8 +4432,17 @@ static int selinux_tun_dev_create(void)
 			    NULL);
 }
 
-static void selinux_tun_dev_post_create(struct sock *sk)
+static int selinux_tun_dev_create_queue(void *security)
 {
+	struct tun_security_struct *tunsec = security;
+
+	return avc_has_perm(current_sid(), tunsec->sid, SECCLASS_TUN_SOCKET,
+			    TUN_SOCKET__CREATE_QUEUE, NULL);
+}
+
+static int selinux_tun_dev_attach(struct sock *sk, void *security)
+{
+	struct tun_security_struct *tunsec = security;
 	struct sk_security_struct *sksec = sk->sk_security;
 
 	/* we don't currently perform any NetLabel based labeling here and it
@@ -4425,20 +4452,19 @@ static void selinux_tun_dev_post_create(struct sock *sk)
 	 * cause confusion to the TUN user that had no idea network labeling
 	 * protocols were being used */
 
-	/* see the comments in selinux_tun_dev_create() about why we don't use
-	 * the sockcreate SID here */
-
-	sksec->sid = current_sid();
+	sksec->sid = tunsec->sid;
 	sksec->sclass = SECCLASS_TUN_SOCKET;
+
+	return 0;
 }
 
-static int selinux_tun_dev_attach(struct sock *sk)
+static int selinux_tun_dev_open(void *security)
 {
-	struct sk_security_struct *sksec = sk->sk_security;
+	struct tun_security_struct *tunsec = security;
 	u32 sid = current_sid();
 	int err;
 
-	err = avc_has_perm(sid, sksec->sid, SECCLASS_TUN_SOCKET,
+	err = avc_has_perm(sid, tunsec->sid, SECCLASS_TUN_SOCKET,
 			   TUN_SOCKET__RELABELFROM, NULL);
 	if (err)
 		return err;
@@ -4446,8 +4472,7 @@ static int selinux_tun_dev_attach(struct sock *sk)
 			   TUN_SOCKET__RELABELTO, NULL);
 	if (err)
 		return err;
-
-	sksec->sid = sid;
+	tunsec->sid = sid;
 
 	return 0;
 }
@@ -5642,9 +5667,12 @@ static struct security_operations selinux_ops = {
 	.secmark_refcount_inc =		selinux_secmark_refcount_inc,
 	.secmark_refcount_dec =		selinux_secmark_refcount_dec,
 	.req_classify_flow =		selinux_req_classify_flow,
+	.tun_dev_alloc_security =	selinux_tun_dev_alloc_security,
+	.tun_dev_free_security =	selinux_tun_dev_free_security,
 	.tun_dev_create =		selinux_tun_dev_create,
-	.tun_dev_post_create = 		selinux_tun_dev_post_create,
+	.tun_dev_create_queue =		selinux_tun_dev_create_queue,
 	.tun_dev_attach =		selinux_tun_dev_attach,
+	.tun_dev_open =			selinux_tun_dev_open,
 
 #ifdef CONFIG_SECURITY_NETWORK_XFRM
 	.xfrm_policy_alloc_security =	selinux_xfrm_policy_alloc,
diff --git a/security/selinux/include/objsec.h b/security/selinux/include/objsec.h
index 26c7eee..aa47bca 100644
--- a/security/selinux/include/objsec.h
+++ b/security/selinux/include/objsec.h
@@ -110,6 +110,10 @@ struct sk_security_struct {
 	u16 sclass;			/* sock security class */
 };
 
+struct tun_security_struct {
+	u32 sid;			/* SID for the tun device sockets */
+};
+
 struct key_security_struct {
 	u32 sid;	/* SID of key */
 };

^ permalink raw reply related

* [RFC PATCH v2 2/3] selinux: add the "create_queue" permission to the "tun_socket" class
From: Paul Moore @ 2012-12-05 20:26 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang, mst
In-Reply-To: <20121205202144.18626.61966.stgit@localhost>

Add a new permission to align with the new TUN multiqueue support,
"tun_socket:create_queue".

The corresponding SELinux reference policy patch is show below:

 diff --git a/policy/flask/access_vectors b/policy/flask/access_vectors
 index 28802c5..a0664a1 100644
 --- a/policy/flask/access_vectors
 +++ b/policy/flask/access_vectors
 @@ -827,6 +827,9 @@ class kernel_service

  class tun_socket
  inherits socket
 +{
 +       create_queue
 +}

  class x_pointer
  inherits x_device

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 security/selinux/include/classmap.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index df2de54..7e9a3d1 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -150,6 +150,6 @@ struct security_class_mapping secclass_map[] = {
 	    NULL } },
 	{ "kernel_service", { "use_as_override", "create_files_as", NULL } },
 	{ "tun_socket",
-	  { COMMON_SOCK_PERMS, NULL } },
+	  { COMMON_SOCK_PERMS, "create_queue", NULL } },
 	{ NULL }
   };

^ permalink raw reply related

* [RFC PATCH v2 1/3] tun: correctly report an error in tun_flow_init()
From: Paul Moore @ 2012-12-05 20:26 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang, mst
In-Reply-To: <20121205202144.18626.61966.stgit@localhost>

On error, the error code from tun_flow_init() is lost inside
tun_set_iff(), this patch fixes this by assigning the tun_flow_init()
error code to the "err" variable which is returned by
the tun_flow_init() function on error.

Signed-off-by: Paul Moore <pmoore@redhat.com>
---
 drivers/net/tun.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1b2389..14a0454 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1591,7 +1591,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 
 		tun_net_init(dev);
 
-		if (tun_flow_init(tun))
+		err = tun_flow_init(tun);
+		if (err < 0)
 			goto err_free_dev;
 
 		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |

^ permalink raw reply related

* [RFC PATCH v2 0/3] Fix some multiqueue TUN problems
From: Paul Moore @ 2012-12-05 20:25 UTC (permalink / raw)
  To: netdev, linux-security-module, selinux; +Cc: jasowang, mst

Second draft of the LSM/SELinux fixes to the upcoming multiqueue TUN
functionality.  This draft incorporates all the comments/decisions
from the first draft, notably the new LSM and SELinux hook for the
TUNSETQUEUE operation.  Other LSMs do not provide TUN controls so they
are not affected.

Once we decide this is the right approach I'll push the associated
SELinux policy FLASK definitions upstream; for those who are interested
the SELinux policy diff in included in the description of patch 1/2.

I don't expect this to be the final patch, just a starting point for
further discussion so I didn't really do any testing, simply making
sure that it compiled cleanly.

---

Paul Moore (3):
      tun: correctly report an error in tun_flow_init()
      selinux: add the "create_queue" permission to the "tun_socket" class
      tun: fix LSM/SELinux labeling of tun/tap devices


 drivers/net/tun.c                   |   29 +++++++++++++----
 include/linux/security.h            |   59 +++++++++++++++++++++++++++--------
 security/capability.c               |   24 ++++++++++++--
 security/security.c                 |   28 ++++++++++++++---
 security/selinux/hooks.c            |   50 +++++++++++++++++++++++-------
 security/selinux/include/classmap.h |    2 +
 security/selinux/include/objsec.h   |    4 ++
 7 files changed, 156 insertions(+), 40 deletions(-)

^ permalink raw reply

* Re: [PATCH] 3com: make 3c59x depend on HAS_IOPORT
From: David Miller @ 2012-12-05 20:22 UTC (permalink / raw)
  To: jang; +Cc: netdev
In-Reply-To: <1354736666.14042.7.camel@localhost.localdomain>

From: Jan Glauber <jang@linux.vnet.ibm.com>
Date: Wed, 05 Dec 2012 20:44:26 +0100

> On Wed, 2012-12-05 at 12:58 -0500, David Miller wrote:
>> From: Jan Glauber <jang@linux.vnet.ibm.com>
>> Date: Wed, 05 Dec 2012 15:04:40 +0100
>> 
>> > From: Jan Glauber <jang@linux.vnet.ibm.com>
>> > 
>> > The 3com driver for 3c59x requires ioport_map. Since not all
>> > architectures support IO port mapping make 3c59x dependent on HAS_IOPORT.
>> > 
>> > Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
>> 
>> Which platforms support PCI or EISA yet do not set HAS_IOPORT?
>> 
> 
> s390. We wont get support for port I/O in the PCI hardware/firmware
> layer. (The patches for PCI on s390 are currently in linux-next).

Ok, I'll apply this to net-next, thanks.

^ permalink raw reply

* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Willem de Bruijn @ 2012-12-05 20:10 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: netfilter-devel, netdev, Eric Dumazet, David Miller, kaber
In-Reply-To: <20121205194854.GB28730@1984>

On Wed, Dec 5, 2012 at 2:48 PM, Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Hi Willem,
>
> On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
>> A new match that executes sk_run_filter on every packet. BPF filters
>> can access skbuff fields that are out of scope for existing iptables
>> rules, allow more expressive logic, and on platforms with JIT support
>> can even be faster.
>>
>> I have a corresponding iptables patch that takes `tcpdump -ddd`
>> output, as used in the examples below. The two parts communicate
>> using a variable length structure. This is similar to ebt_among,
>> but new for iptables.
>>
>> Verified functionality by inserting an ip source filter on chain
>> INPUT and an ip dest filter on chain OUTPUT and noting that ping
>> failed while a rule was active:
>>
>> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
>> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP
>
> I like this BPF idea for iptables.
>
> I made a similar extension time ago, but it was taking a file as
> parameter. That file contained in BPF code. I made a simple bison
> parser that takes BPF code and put it into the bpf array of
> instructions. It would be a bit more intuitive to define a filter and
> we can distribute it with iptables.

That's cleaner, indeed. I actually like how tcpdump operates as a
code generator if you pass -ddd. Unfortunately, it generates code only
for link layer types of its supported devices, such as DLT_EN10MB and
DLT_LINUX_SLL. The network layer interface of basic iptables
(forgetting device dependent mechanisms as used in xt_mac) is DLT_RAW,
but that is rarely supported.

> Let me check on my internal trees, I can put that user-space code
> somewhere in case you're interested.

Absolutely. I'll be happy to revise to get it in. I'm also considering
sending a patch to tcpdump to make it generate code independent of the
installed hardware when specifying -y.

>> Evaluated throughput by running netperf TCP_STREAM over loopback on
>> x86_64. I expected the BPF filter to outperform hardcoded iptables
>> filters when replacing multiple matches with a single bpf match, but
>> even a single comparison to u32 appears to do better. Relative to the
>> benchmark with no filter applied, rate with 100 BPF filters dropped
>> to 81%. With 100 U32 filters it dropped to 55%. The difference sounds
>> excessive to me, but was consistent on my hardware. Commands used:
>>
>> for i in `seq 100`; do iptables -A OUTPUT -m bpf --bytecode '4,48 0 0 9,21 0 1 20,6 0 0 96,6 0 0 0,' -j DROP; done
>> for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done
>>
>> iptables -F OUTPUT
>>
>> for i in `seq 100`; do iptables -A OUTPUT -m u32 --u32 '6&0xFF=0x20' -j DROP; done
>> for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done
>>
>> FYI: perf top
>>
>> [bpf]
>>     33.94%  [kernel]          [k] copy_user_generic_string
>>      8.92%  [kernel]          [k] sk_run_filter
>>      7.77%  [ip_tables]       [k] ipt_do_table
>>
>> [u32]
>>     22.63%  [kernel]          [k] copy_user_generic_string
>>     14.46%  [kernel]          [k] memcpy
>>      9.19%  [ip_tables]       [k] ipt_do_table
>>      8.47%  [xt_u32]          [k] u32_mt
>>      5.32%  [kernel]          [k] skb_copy_bits
>>
>> The big difference appears to be in memory copying. I have not
>> looked into u32, so cannot explain this right now. More interestingly,
>> at higher rate, sk_run_filter appears to use as many cycles as u32_mt
>> (both traces have roughly the same number of events).
>>
>> One caveat: to work independent of device link layer, the filter
>> expects DLT_RAW style BPF programs, i.e., those that expect the
>> packet to start at the IP layer.
>> ---
>>  include/linux/netfilter/xt_bpf.h |   17 +++++++
>>  net/netfilter/Kconfig            |    9 ++++
>>  net/netfilter/Makefile           |    1 +
>>  net/netfilter/x_tables.c         |    5 +-
>>  net/netfilter/xt_bpf.c           |   88 ++++++++++++++++++++++++++++++++++++++
>>  5 files changed, 118 insertions(+), 2 deletions(-)
>>  create mode 100644 include/linux/netfilter/xt_bpf.h
>>  create mode 100644 net/netfilter/xt_bpf.c
>>
>> diff --git a/include/linux/netfilter/xt_bpf.h b/include/linux/netfilter/xt_bpf.h
>> new file mode 100644
>> index 0000000..23502c0
>> --- /dev/null
>> +++ b/include/linux/netfilter/xt_bpf.h
>> @@ -0,0 +1,17 @@
>> +#ifndef _XT_BPF_H
>> +#define _XT_BPF_H
>> +
>> +#include <linux/filter.h>
>> +#include <linux/types.h>
>> +
>> +struct xt_bpf_info {
>> +     __u16 bpf_program_num_elem;
>> +
>> +     /* only used in kernel */
>> +     struct sk_filter *filter __attribute__((aligned(8)));
>> +
>> +     /* variable size, based on program_num_elem */
>> +     struct sock_filter bpf_program[0];
>> +};
>> +
>> +#endif /*_XT_BPF_H */
>> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
>> index c9739c6..c7cc0b8 100644
>> --- a/net/netfilter/Kconfig
>> +++ b/net/netfilter/Kconfig
>> @@ -798,6 +798,15 @@ config NETFILTER_XT_MATCH_ADDRTYPE
>>         If you want to compile it as a module, say M here and read
>>         <file:Documentation/kbuild/modules.txt>.  If unsure, say `N'.
>>
>> +config NETFILTER_XT_MATCH_BPF
>> +     tristate '"bpf" match support'
>> +     depends on NETFILTER_ADVANCED
>> +     help
>> +       BPF matching applies a linux socket filter to each packet and
>> +          accepts those for which the filter returns non-zero.
>> +
>> +       To compile it as a module, choose M here.  If unsure, say N.
>> +
>>  config NETFILTER_XT_MATCH_CLUSTER
>>       tristate '"cluster" match support'
>>       depends on NF_CONNTRACK
>> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
>> index 8e5602f..9f12eeb 100644
>> --- a/net/netfilter/Makefile
>> +++ b/net/netfilter/Makefile
>> @@ -98,6 +98,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
>>
>>  # matches
>>  obj-$(CONFIG_NETFILTER_XT_MATCH_ADDRTYPE) += xt_addrtype.o
>> +obj-$(CONFIG_NETFILTER_XT_MATCH_BPF) += xt_bpf.o
>>  obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
>>  obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
>>  obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
>> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
>> index 8d987c3..26306be 100644
>> --- a/net/netfilter/x_tables.c
>> +++ b/net/netfilter/x_tables.c
>> @@ -379,8 +379,9 @@ int xt_check_match(struct xt_mtchk_param *par,
>>       if (XT_ALIGN(par->match->matchsize) != size &&
>>           par->match->matchsize != -1) {
>>               /*
>> -              * ebt_among is exempt from centralized matchsize checking
>> -              * because it uses a dynamic-size data set.
>> +              * matches of variable size length, such as ebt_among,
>> +              * are exempt from centralized matchsize checking. They
>> +              * skip the test by setting xt_match.matchsize to -1.
>>                */
>>               pr_err("%s_tables: %s.%u match: invalid size "
>>                      "%u (kernel) != (user) %u\n",
>> diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
>> new file mode 100644
>> index 0000000..07077c5
>> --- /dev/null
>> +++ b/net/netfilter/xt_bpf.c
>> @@ -0,0 +1,88 @@
>> +/* Xtables module to match packets using a BPF filter.
>> + * Copyright 2012 Google Inc.
>> + * Written by Willem de Bruijn <willemb@google.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/skbuff.h>
>> +#include <linux/ipv6.h>
>> +#include <linux/filter.h>
>> +#include <net/ip.h>
>> +
>> +#include <linux/netfilter/xt_bpf.h>
>> +#include <linux/netfilter/x_tables.h>
>> +
>> +MODULE_AUTHOR("Willem de Bruijn <willemb@google.com>");
>> +MODULE_DESCRIPTION("Xtables: BPF filter match");
>> +MODULE_LICENSE("GPL");
>> +MODULE_ALIAS("ipt_bpf");
>> +MODULE_ALIAS("ip6t_bpf");
>> +
>> +static int bpf_mt_check(const struct xt_mtchk_param *par)
>> +{
>> +     struct xt_bpf_info *info = par->matchinfo;
>> +     const struct xt_entry_match *match;
>> +     struct sock_fprog program;
>> +     int expected_len;
>> +
>> +     match = container_of(par->matchinfo, const struct xt_entry_match, data);
>> +     expected_len = sizeof(struct xt_entry_match) +
>> +                    sizeof(struct xt_bpf_info) +
>> +                    (sizeof(struct sock_filter) *
>> +                     info->bpf_program_num_elem);
>> +
>> +     if (match->u.match_size != expected_len) {
>> +             pr_info("bpf: check failed: incorrect length\n");
>> +             return -EINVAL;
>> +     }
>> +
>> +     program.len = info->bpf_program_num_elem;
>> +     program.filter = info->bpf_program;
>> +     if (sk_unattached_filter_create(&info->filter, &program)) {
>> +             pr_info("bpf: check failed: parse error\n");
>> +             return -EINVAL;
>> +     }
>> +
>> +     return 0;
>> +}
>> +
>> +static bool bpf_mt(const struct sk_buff *skb, struct xt_action_param *par)
>> +{
>> +     const struct xt_bpf_info *info = par->matchinfo;
>> +
>> +     return SK_RUN_FILTER(info->filter, skb);
>> +}
>> +
>> +static void bpf_mt_destroy(const struct xt_mtdtor_param *par)
>> +{
>> +     const struct xt_bpf_info *info = par->matchinfo;
>> +     sk_unattached_filter_destroy(info->filter);
>> +}
>> +
>> +static struct xt_match bpf_mt_reg __read_mostly = {
>> +     .name           = "bpf",
>> +     .revision       = 0,
>> +     .family         = NFPROTO_UNSPEC,
>> +     .checkentry     = bpf_mt_check,
>> +     .match          = bpf_mt,
>> +     .destroy        = bpf_mt_destroy,
>> +     .matchsize      = -1, /* skip xt_check_match because of dynamic len */
>> +     .me             = THIS_MODULE,
>> +};
>> +
>> +static int __init bpf_mt_init(void)
>> +{
>> +     return xt_register_match(&bpf_mt_reg);
>> +}
>> +
>> +static void __exit bpf_mt_exit(void)
>> +{
>> +     xt_unregister_match(&bpf_mt_reg);
>> +}
>> +
>> +module_init(bpf_mt_init);
>> +module_exit(bpf_mt_exit);
>> --
>> 1.7.7.3
>>

^ permalink raw reply

* Re: [PATCH rfc] netfilter: two xtables matches
From: Jan Engelhardt @ 2012-12-05 20:00 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: netfilter-devel, netdev, Eric Dumazet, David Miller, kaber, pablo
In-Reply-To: <CA+FuTSeYvcdtNf7N=POD+RzsE01+WVsub6F_nqbn06RZyoWx8w@mail.gmail.com>

On Wednesday 2012-12-05 20:28, Willem de Bruijn wrote:

>Somehow, the first part of this email went missing. Not critical,
>but for completeness:
>
>These two patches each add an xtables match.
>
>The xt_priority match is a straighforward addition in the style of
>xt_mark, adding the option to filter on one more sk_buff field. I
>have an immediate application for this. The amount of code (in
>kernel + userspace) to add a single check proved quite large.

Hm so yeah, can't we just place this in xt_mark.c?

^ permalink raw reply

* [PATCH 2/2 net-next] cnic: Fix rare race condition during iSCSI disconnect.
From: Michael Chan @ 2012-12-05 20:10 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1354738215-6644-1-git-send-email-mchan@broadcom.com>

From: Eddie Wai <eddie.wai@broadcom.com>

If the initiator and target try to close the connection at about the same
time, there is a race condition in the termination sequence for bnx2x.
Fix the problem by waiting for the remote termination to complete before
deleting the Connection ID.  This will prevent the firmware assert.

Update version to 2.5.15.

Signed-off-by: Eddie Wai <eddie.wai@broadcom.com>
Signed-off-by: Michael Chan <mchan@broadcom.com>
---
 drivers/net/ethernet/broadcom/cnic.c    |   13 +++++++++++--
 drivers/net/ethernet/broadcom/cnic_if.h |    4 ++--
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index 2c1f66d..1c2a851 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -3853,12 +3853,17 @@ static int cnic_cm_abort(struct cnic_sock *csk)
 		return cnic_cm_abort_req(csk);
 
 	/* Getting here means that we haven't started connect, or
-	 * connect was not successful.
+	 * connect was not successful, or it has been reset by the target.
 	 */
 
 	cp->close_conn(csk, opcode);
-	if (csk->state != opcode)
+	if (csk->state != opcode) {
+		/* Wait for remote reset sequence to complete */
+		while (test_bit(SK_F_PG_OFFLD_COMPLETE, &csk->flags))
+			msleep(1);
+
 		return -EALREADY;
+	}
 
 	return 0;
 }
@@ -3872,6 +3877,10 @@ static int cnic_cm_close(struct cnic_sock *csk)
 		csk->state = L4_KCQE_OPCODE_VALUE_CLOSE_COMP;
 		return cnic_cm_close_req(csk);
 	} else {
+		/* Wait for remote reset sequence to complete */
+		while (test_bit(SK_F_PG_OFFLD_COMPLETE, &csk->flags))
+			msleep(1);
+
 		return -EALREADY;
 	}
 	return 0;
diff --git a/drivers/net/ethernet/broadcom/cnic_if.h b/drivers/net/ethernet/broadcom/cnic_if.h
index 865095a..502e11e 100644
--- a/drivers/net/ethernet/broadcom/cnic_if.h
+++ b/drivers/net/ethernet/broadcom/cnic_if.h
@@ -14,8 +14,8 @@
 
 #include "bnx2x/bnx2x_mfw_req.h"
 
-#define CNIC_MODULE_VERSION	"2.5.14"
-#define CNIC_MODULE_RELDATE	"Sep 30, 2012"
+#define CNIC_MODULE_VERSION	"2.5.15"
+#define CNIC_MODULE_RELDATE	"Dec 04, 2012"
 
 #define CNIC_ULP_RDMA		0
 #define CNIC_ULP_ISCSI		1
-- 
1.7.1

^ permalink raw reply related

* [PATCH 1/2 net-next] cnic: Reset iSCSI EQ during shutdown.
From: Michael Chan @ 2012-12-05 20:10 UTC (permalink / raw)
  To: davem; +Cc: netdev

Without the reset, reloading the cnic driver can cause the iSCSI
Event Queue to be out of sync with the driver and cause intermittent
crash.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Acked-by: Ariel Elior <ariele@broadcom.com>
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_fw_defs.h    |    5 +++++
 drivers/net/ethernet/broadcom/cnic.c               |   19 +++++++++++++++++++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_fw_defs.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_fw_defs.h
index 620fe93..60a83ad 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_fw_defs.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_fw_defs.h
@@ -23,6 +23,11 @@
 	(IRO[159].base + ((funcId) * IRO[159].m1))
 #define CSTORM_FUNC_EN_OFFSET(funcId) \
 	(IRO[149].base + ((funcId) * IRO[149].m1))
+#define CSTORM_HC_SYNC_LINE_INDEX_E1X_OFFSET(hcIndex, sbId) \
+	(IRO[139].base + ((hcIndex) * IRO[139].m1) + ((sbId) * IRO[139].m2))
+#define CSTORM_HC_SYNC_LINE_INDEX_E2_OFFSET(hcIndex, sbId) \
+	(IRO[138].base + (((hcIndex)>>2) * IRO[138].m1) + (((hcIndex)&3) \
+	* IRO[138].m2) + ((sbId) * IRO[138].m3))
 #define CSTORM_IGU_MODE_OFFSET (IRO[157].base)
 #define CSTORM_ISCSI_CQ_SIZE_OFFSET(pfId) \
 	(IRO[316].base + ((pfId) * IRO[316].m1))
diff --git a/drivers/net/ethernet/broadcom/cnic.c b/drivers/net/ethernet/broadcom/cnic.c
index cc8434f..2c1f66d 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -5344,8 +5344,27 @@ static void cnic_stop_bnx2_hw(struct cnic_dev *dev)
 static void cnic_stop_bnx2x_hw(struct cnic_dev *dev)
 {
 	struct cnic_local *cp = dev->cnic_priv;
+	u32 hc_index = HC_INDEX_ISCSI_EQ_CONS;
+	u32 sb_id = cp->status_blk_num;
+	u32 idx_off, syn_off;
 
 	cnic_free_irq(dev);
+
+	if (BNX2X_CHIP_IS_E2_PLUS(cp->chip_id)) {
+		idx_off = offsetof(struct hc_status_block_e2, index_values) +
+			  (hc_index * sizeof(u16));
+
+		syn_off = CSTORM_HC_SYNC_LINE_INDEX_E2_OFFSET(hc_index, sb_id);
+	} else {
+		idx_off = offsetof(struct hc_status_block_e1x, index_values) +
+			  (hc_index * sizeof(u16));
+
+		syn_off = CSTORM_HC_SYNC_LINE_INDEX_E1X_OFFSET(hc_index, sb_id);
+	}
+	CNIC_WR16(dev, BAR_CSTRORM_INTMEM + syn_off, 0);
+	CNIC_WR16(dev, BAR_CSTRORM_INTMEM + CSTORM_STATUS_BLOCK_OFFSET(sb_id) +
+		  idx_off, 0);
+
 	*cp->kcq1.hw_prod_idx_ptr = 0;
 	CNIC_WR(dev, BAR_CSTRORM_INTMEM +
 		CSTORM_ISCSI_EQ_CONS_OFFSET(cp->pfid, 0), 0);
-- 
1.7.1

^ permalink raw reply related

* Re: [Patch 1/1] net/phy: Add interrupt support for dp83640 phy.
From: Stephan Gatzka @ 2012-12-05 19:52 UTC (permalink / raw)
  To: Richard Cochran; +Cc: netdev, davem
In-Reply-To: <20121205100544.GA2293@netboy.at.omicron.at>


> The patch looks okay to me, but I worry that this might fail on boards
> which have not connected the phyer's PWERDOWN/INTN pin to anything.
> Such designs really need the PHY_POLL working.
> Taking a brief glance at the drivers for two such boards I know of
> (m5234bcc and an IXP), it looks like their MAC drivers set mii_bus irq
> to PHY_POLL, so it might work fine, but this patch still makes me
> nervous that some other board might break.
>
> Maybe this should be a kconfig option?

I don't think so.

Systems using device tree just don't specify the interrupt tag in the 
mdio section.

I have to admit that I don't know how how systems without employing 
device tree get the phy interrupt configured, maybe someone can explain 
that shortly?

Nevertheless, other drivers for very common phys like the lxt971 also 
just set the function pointers to config_intr and ack_interrupt and also 
set the flag PHY_HAS_INTERRUPT. So I don't think that the my patch 
breaks something.

Regards,

Stephan

^ permalink raw reply

* Re: [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Pablo Neira Ayuso @ 2012-12-05 19:48 UTC (permalink / raw)
  To: Willem de Bruijn; +Cc: netfilter-devel, netdev, edumazet, davem, kaber
In-Reply-To: <1354735339-13402-3-git-send-email-willemb@google.com>

Hi Willem,

On Wed, Dec 05, 2012 at 02:22:19PM -0500, Willem de Bruijn wrote:
> A new match that executes sk_run_filter on every packet. BPF filters
> can access skbuff fields that are out of scope for existing iptables
> rules, allow more expressive logic, and on platforms with JIT support
> can even be faster.
> 
> I have a corresponding iptables patch that takes `tcpdump -ddd`
> output, as used in the examples below. The two parts communicate
> using a variable length structure. This is similar to ebt_among,
> but new for iptables.
> 
> Verified functionality by inserting an ip source filter on chain
> INPUT and an ip dest filter on chain OUTPUT and noting that ping
> failed while a rule was active:
> 
> iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
> iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP

I like this BPF idea for iptables.

I made a similar extension time ago, but it was taking a file as
parameter. That file contained in BPF code. I made a simple bison
parser that takes BPF code and put it into the bpf array of
instructions. It would be a bit more intuitive to define a filter and
we can distribute it with iptables.

Let me check on my internal trees, I can put that user-space code
somewhere in case you're interested.

> Evaluated throughput by running netperf TCP_STREAM over loopback on
> x86_64. I expected the BPF filter to outperform hardcoded iptables
> filters when replacing multiple matches with a single bpf match, but
> even a single comparison to u32 appears to do better. Relative to the
> benchmark with no filter applied, rate with 100 BPF filters dropped
> to 81%. With 100 U32 filters it dropped to 55%. The difference sounds
> excessive to me, but was consistent on my hardware. Commands used:
> 
> for i in `seq 100`; do iptables -A OUTPUT -m bpf --bytecode '4,48 0 0 9,21 0 1 20,6 0 0 96,6 0 0 0,' -j DROP; done
> for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done
> 
> iptables -F OUTPUT
> 
> for i in `seq 100`; do iptables -A OUTPUT -m u32 --u32 '6&0xFF=0x20' -j DROP; done
> for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done
> 
> FYI: perf top
> 
> [bpf]
>     33.94%  [kernel]          [k] copy_user_generic_string
>      8.92%  [kernel]          [k] sk_run_filter
>      7.77%  [ip_tables]       [k] ipt_do_table
> 
> [u32]
>     22.63%  [kernel]          [k] copy_user_generic_string
>     14.46%  [kernel]          [k] memcpy
>      9.19%  [ip_tables]       [k] ipt_do_table
>      8.47%  [xt_u32]          [k] u32_mt
>      5.32%  [kernel]          [k] skb_copy_bits
> 
> The big difference appears to be in memory copying. I have not
> looked into u32, so cannot explain this right now. More interestingly,
> at higher rate, sk_run_filter appears to use as many cycles as u32_mt
> (both traces have roughly the same number of events).
> 
> One caveat: to work independent of device link layer, the filter
> expects DLT_RAW style BPF programs, i.e., those that expect the
> packet to start at the IP layer.
> ---
>  include/linux/netfilter/xt_bpf.h |   17 +++++++
>  net/netfilter/Kconfig            |    9 ++++
>  net/netfilter/Makefile           |    1 +
>  net/netfilter/x_tables.c         |    5 +-
>  net/netfilter/xt_bpf.c           |   88 ++++++++++++++++++++++++++++++++++++++
>  5 files changed, 118 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/netfilter/xt_bpf.h
>  create mode 100644 net/netfilter/xt_bpf.c
> 
> diff --git a/include/linux/netfilter/xt_bpf.h b/include/linux/netfilter/xt_bpf.h
> new file mode 100644
> index 0000000..23502c0
> --- /dev/null
> +++ b/include/linux/netfilter/xt_bpf.h
> @@ -0,0 +1,17 @@
> +#ifndef _XT_BPF_H
> +#define _XT_BPF_H
> +
> +#include <linux/filter.h>
> +#include <linux/types.h>
> +
> +struct xt_bpf_info {
> +	__u16 bpf_program_num_elem;
> +
> +	/* only used in kernel */
> +	struct sk_filter *filter __attribute__((aligned(8)));
> +
> +	/* variable size, based on program_num_elem */
> +	struct sock_filter bpf_program[0];
> +};
> +
> +#endif /*_XT_BPF_H */
> diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
> index c9739c6..c7cc0b8 100644
> --- a/net/netfilter/Kconfig
> +++ b/net/netfilter/Kconfig
> @@ -798,6 +798,15 @@ config NETFILTER_XT_MATCH_ADDRTYPE
>  	  If you want to compile it as a module, say M here and read
>  	  <file:Documentation/kbuild/modules.txt>.  If unsure, say `N'.
>  
> +config NETFILTER_XT_MATCH_BPF
> +	tristate '"bpf" match support'
> +	depends on NETFILTER_ADVANCED
> +	help
> +	  BPF matching applies a linux socket filter to each packet and
> +          accepts those for which the filter returns non-zero.
> +
> +	  To compile it as a module, choose M here.  If unsure, say N.
> +
>  config NETFILTER_XT_MATCH_CLUSTER
>  	tristate '"cluster" match support'
>  	depends on NF_CONNTRACK
> diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
> index 8e5602f..9f12eeb 100644
> --- a/net/netfilter/Makefile
> +++ b/net/netfilter/Makefile
> @@ -98,6 +98,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
>  
>  # matches
>  obj-$(CONFIG_NETFILTER_XT_MATCH_ADDRTYPE) += xt_addrtype.o
> +obj-$(CONFIG_NETFILTER_XT_MATCH_BPF) += xt_bpf.o
>  obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
>  obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
>  obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
> diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> index 8d987c3..26306be 100644
> --- a/net/netfilter/x_tables.c
> +++ b/net/netfilter/x_tables.c
> @@ -379,8 +379,9 @@ int xt_check_match(struct xt_mtchk_param *par,
>  	if (XT_ALIGN(par->match->matchsize) != size &&
>  	    par->match->matchsize != -1) {
>  		/*
> -		 * ebt_among is exempt from centralized matchsize checking
> -		 * because it uses a dynamic-size data set.
> +		 * matches of variable size length, such as ebt_among,
> +		 * are exempt from centralized matchsize checking. They
> +		 * skip the test by setting xt_match.matchsize to -1.
>  		 */
>  		pr_err("%s_tables: %s.%u match: invalid size "
>  		       "%u (kernel) != (user) %u\n",
> diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
> new file mode 100644
> index 0000000..07077c5
> --- /dev/null
> +++ b/net/netfilter/xt_bpf.c
> @@ -0,0 +1,88 @@
> +/* Xtables module to match packets using a BPF filter.
> + * Copyright 2012 Google Inc.
> + * Written by Willem de Bruijn <willemb@google.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/ipv6.h>
> +#include <linux/filter.h>
> +#include <net/ip.h>
> +
> +#include <linux/netfilter/xt_bpf.h>
> +#include <linux/netfilter/x_tables.h>
> +
> +MODULE_AUTHOR("Willem de Bruijn <willemb@google.com>");
> +MODULE_DESCRIPTION("Xtables: BPF filter match");
> +MODULE_LICENSE("GPL");
> +MODULE_ALIAS("ipt_bpf");
> +MODULE_ALIAS("ip6t_bpf");
> +
> +static int bpf_mt_check(const struct xt_mtchk_param *par)
> +{
> +	struct xt_bpf_info *info = par->matchinfo;
> +	const struct xt_entry_match *match;
> +	struct sock_fprog program;
> +	int expected_len;
> +
> +	match = container_of(par->matchinfo, const struct xt_entry_match, data);
> +	expected_len = sizeof(struct xt_entry_match) +
> +		       sizeof(struct xt_bpf_info) +
> +		       (sizeof(struct sock_filter) *
> +			info->bpf_program_num_elem);
> +
> +	if (match->u.match_size != expected_len) {
> +		pr_info("bpf: check failed: incorrect length\n");
> +		return -EINVAL;
> +	}
> +
> +	program.len = info->bpf_program_num_elem;
> +	program.filter = info->bpf_program;
> +	if (sk_unattached_filter_create(&info->filter, &program)) {
> +		pr_info("bpf: check failed: parse error\n");
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static bool bpf_mt(const struct sk_buff *skb, struct xt_action_param *par)
> +{
> +	const struct xt_bpf_info *info = par->matchinfo;
> +
> +	return SK_RUN_FILTER(info->filter, skb);
> +}
> +
> +static void bpf_mt_destroy(const struct xt_mtdtor_param *par)
> +{
> +	const struct xt_bpf_info *info = par->matchinfo;
> +	sk_unattached_filter_destroy(info->filter);
> +}
> +
> +static struct xt_match bpf_mt_reg __read_mostly = {
> +	.name		= "bpf",
> +	.revision	= 0,
> +	.family		= NFPROTO_UNSPEC,
> +	.checkentry	= bpf_mt_check,
> +	.match		= bpf_mt,
> +	.destroy	= bpf_mt_destroy,
> +	.matchsize	= -1, /* skip xt_check_match because of dynamic len */
> +	.me		= THIS_MODULE,
> +};
> +
> +static int __init bpf_mt_init(void)
> +{
> +	return xt_register_match(&bpf_mt_reg);
> +}
> +
> +static void __exit bpf_mt_exit(void)
> +{
> +	xt_unregister_match(&bpf_mt_reg);
> +}
> +
> +module_init(bpf_mt_init);
> +module_exit(bpf_mt_exit);
> -- 
> 1.7.7.3
> 

^ permalink raw reply

* Re: [PATCH] 3com: make 3c59x depend on HAS_IOPORT
From: Jan Glauber @ 2012-12-05 19:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20121205.125804.979819753893520607.davem@davemloft.net>

On Wed, 2012-12-05 at 12:58 -0500, David Miller wrote:
> From: Jan Glauber <jang@linux.vnet.ibm.com>
> Date: Wed, 05 Dec 2012 15:04:40 +0100
> 
> > From: Jan Glauber <jang@linux.vnet.ibm.com>
> > 
> > The 3com driver for 3c59x requires ioport_map. Since not all
> > architectures support IO port mapping make 3c59x dependent on HAS_IOPORT.
> > 
> > Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
> 
> Which platforms support PCI or EISA yet do not set HAS_IOPORT?
> 

s390. We wont get support for port I/O in the PCI hardware/firmware
layer. (The patches for PCI on s390 are currently in linux-next).

--Jan

^ permalink raw reply

* Re: [RFT PATCH] 8139cp: properly support change of MTU values [v2]
From: John Greene @ 2012-12-05 19:41 UTC (permalink / raw)
  To: Francois Romieu; +Cc: David Miller, netdev, dwmw2
In-Reply-To: <20121203204608.GA9815@electric-eye.fr.zoreil.com>

On 12/03/2012 03:46 PM, Francois Romieu wrote:
> David Miller <davem@davemloft.net> :
> [...]
>> I've applied this to net-next, if it triggers any problems we have
>> some time to work it out before 3.8 is released.
>
> I have bounced the messages to David Woodhouse since he authored the
> last 8139cp changes in net-next and owns the hardware to notice
> regressions.
>
> My message of two days ago was wrong : it is not possible for the irq
> handler to process a Tx event after the rings have been freed. Things
> still look racy wrt netpoll though.
>
> Any objection against the patch below ?
>
> (I did not gotoize the dev == NULL test: it is really unlikely and
> should go away).
>
> diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
> index 0da3f5e..57cd542 100644
> --- a/drivers/net/ethernet/realtek/8139cp.c
> +++ b/drivers/net/ethernet/realtek/8139cp.c
> @@ -577,28 +577,30 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
>   {
>   	struct net_device *dev = dev_instance;
>   	struct cp_private *cp;
> +	int handled = 0;
>   	u16 status;
>
>   	if (unlikely(dev == NULL))
>   		return IRQ_NONE;
>   	cp = netdev_priv(dev);
>
> +	spin_lock(&cp->lock);
> +
>   	status = cpr16(IntrStatus);
>   	if (!status || (status == 0xFFFF))
> -		return IRQ_NONE;
> +		goto out_unlock;
> +
> +	handled = 1;
>
>   	netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x\n",
>   		  status, cpr8(Cmd), cpr16(CpCmd));
>
>   	cpw16(IntrStatus, status & ~cp_rx_intr_mask);
>
> -	spin_lock(&cp->lock);
> -
>   	/* close possible race's with dev_close */
>   	if (unlikely(!netif_running(dev))) {
>   		cpw16(IntrMask, 0);
> -		spin_unlock(&cp->lock);
> -		return IRQ_HANDLED;
> +		goto out_unlock;
>   	}
>
>   	if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr))
> @@ -612,8 +614,6 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
>   	if (status & LinkChg)
>   		mii_check_media(&cp->mii_if, netif_msg_link(cp), false);
>
> -	spin_unlock(&cp->lock);
> -
>   	if (status & PciErr) {
>   		u16 pci_status;
>
> @@ -625,7 +625,10 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance)
>   		/* TODO: reset hardware */
>   	}
>
> -	return IRQ_HANDLED;
> +out_unlock:
> +	spin_unlock(&cp->lock);
> +
> +	return IRQ_RETVAL(handled);
>   }
>
>   #ifdef CONFIG_NET_POLL_CONTROLLER
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
I think this is a good change, interesting it isn't in already or 
causing more issues on multi-processor boxes already. (perhaps it is?).

So do you think these patches need to go together? I could make a case 
either way.

Is this upstream yet?

-- 
John Greene
jogreene@redhat.com

^ permalink raw reply

* Re: [PATCH rfc] netfilter: two xtables matches
From: Willem de Bruijn @ 2012-12-05 19:28 UTC (permalink / raw)
  To: netfilter-devel, netdev, Eric Dumazet, David Miller, kaber, pablo
In-Reply-To: <1354735339-13402-1-git-send-email-willemb@google.com>

Somehow, the first part of this email went missing. Not critical,
but for completeness:

These two patches each add an xtables match.

The xt_priority match is a straighforward addition in the style of
xt_mark, adding the option to filter on one more sk_buff field. I
have an immediate application for this. The amount of code (in
kernel + userspace) to add a single check proved quite large.

On Wed, Dec 5, 2012 at 2:22 PM, Willem de Bruijn <willemb@google.com> wrote:
> The second patch is more speculative and aims to be a more general
> workaround, as well as a performance optimization: support
> (preferably JIT compiled) BPF programs as iptables match rules.
>
> Potentially, the skb->priority match can be implemented by applying
> only the second patch and adding a new BPF_S_ANC ancillary field to
> Linux Socket Filters.
>
> I also wrote corresponding userspace patches to iptables. The process
> for submitting both kernel and user patches is not 100% clear to me.
> Sending the kernel bits to both netdev and netfilter-devel for
> initial feedback. Please correct me if you want it another way.
>
> The patches apply to net-next.
>

^ permalink raw reply

* [PATCH 2/2] netfilter: add xt_bpf xtables match
From: Willem de Bruijn @ 2012-12-05 19:22 UTC (permalink / raw)
  To: netfilter-devel, netdev, edumazet, davem, kaber, pablo; +Cc: Willem de Bruijn
In-Reply-To: <1354735339-13402-1-git-send-email-willemb@google.com>

A new match that executes sk_run_filter on every packet. BPF filters
can access skbuff fields that are out of scope for existing iptables
rules, allow more expressive logic, and on platforms with JIT support
can even be faster.

I have a corresponding iptables patch that takes `tcpdump -ddd`
output, as used in the examples below. The two parts communicate
using a variable length structure. This is similar to ebt_among,
but new for iptables.

Verified functionality by inserting an ip source filter on chain
INPUT and an ip dest filter on chain OUTPUT and noting that ping
failed while a rule was active:

iptables -v -A INPUT -m bpf --bytecode '4,32 0 0 12,21 0 1 $SADDR,6 0 0 96,6 0 0 0,' -j DROP
iptables -v -A OUTPUT -m bpf --bytecode '4,32 0 0 16,21 0 1 $DADDR,6 0 0 96,6 0 0 0,' -j DROP

Evaluated throughput by running netperf TCP_STREAM over loopback on
x86_64. I expected the BPF filter to outperform hardcoded iptables
filters when replacing multiple matches with a single bpf match, but
even a single comparison to u32 appears to do better. Relative to the
benchmark with no filter applied, rate with 100 BPF filters dropped
to 81%. With 100 U32 filters it dropped to 55%. The difference sounds
excessive to me, but was consistent on my hardware. Commands used:

for i in `seq 100`; do iptables -A OUTPUT -m bpf --bytecode '4,48 0 0 9,21 0 1 20,6 0 0 96,6 0 0 0,' -j DROP; done
for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done

iptables -F OUTPUT

for i in `seq 100`; do iptables -A OUTPUT -m u32 --u32 '6&0xFF=0x20' -j DROP; done
for i in `seq 3`; do netperf -t TCP_STREAM -I 99 -H localhost; done

FYI: perf top

[bpf]
    33.94%  [kernel]          [k] copy_user_generic_string
     8.92%  [kernel]          [k] sk_run_filter
     7.77%  [ip_tables]       [k] ipt_do_table

[u32]
    22.63%  [kernel]          [k] copy_user_generic_string
    14.46%  [kernel]          [k] memcpy
     9.19%  [ip_tables]       [k] ipt_do_table
     8.47%  [xt_u32]          [k] u32_mt
     5.32%  [kernel]          [k] skb_copy_bits

The big difference appears to be in memory copying. I have not
looked into u32, so cannot explain this right now. More interestingly,
at higher rate, sk_run_filter appears to use as many cycles as u32_mt
(both traces have roughly the same number of events).

One caveat: to work independent of device link layer, the filter
expects DLT_RAW style BPF programs, i.e., those that expect the
packet to start at the IP layer.
---
 include/linux/netfilter/xt_bpf.h |   17 +++++++
 net/netfilter/Kconfig            |    9 ++++
 net/netfilter/Makefile           |    1 +
 net/netfilter/x_tables.c         |    5 +-
 net/netfilter/xt_bpf.c           |   88 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 118 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/netfilter/xt_bpf.h
 create mode 100644 net/netfilter/xt_bpf.c

diff --git a/include/linux/netfilter/xt_bpf.h b/include/linux/netfilter/xt_bpf.h
new file mode 100644
index 0000000..23502c0
--- /dev/null
+++ b/include/linux/netfilter/xt_bpf.h
@@ -0,0 +1,17 @@
+#ifndef _XT_BPF_H
+#define _XT_BPF_H
+
+#include <linux/filter.h>
+#include <linux/types.h>
+
+struct xt_bpf_info {
+	__u16 bpf_program_num_elem;
+
+	/* only used in kernel */
+	struct sk_filter *filter __attribute__((aligned(8)));
+
+	/* variable size, based on program_num_elem */
+	struct sock_filter bpf_program[0];
+};
+
+#endif /*_XT_BPF_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index c9739c6..c7cc0b8 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -798,6 +798,15 @@ config NETFILTER_XT_MATCH_ADDRTYPE
 	  If you want to compile it as a module, say M here and read
 	  <file:Documentation/kbuild/modules.txt>.  If unsure, say `N'.
 
+config NETFILTER_XT_MATCH_BPF
+	tristate '"bpf" match support'
+	depends on NETFILTER_ADVANCED
+	help
+	  BPF matching applies a linux socket filter to each packet and
+          accepts those for which the filter returns non-zero.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+
 config NETFILTER_XT_MATCH_CLUSTER
 	tristate '"cluster" match support'
 	depends on NF_CONNTRACK
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 8e5602f..9f12eeb 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_NETFILTER_XT_TARGET_IDLETIMER) += xt_IDLETIMER.o
 
 # matches
 obj-$(CONFIG_NETFILTER_XT_MATCH_ADDRTYPE) += xt_addrtype.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_BPF) += xt_bpf.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CLUSTER) += xt_cluster.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_COMMENT) += xt_comment.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_CONNBYTES) += xt_connbytes.o
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 8d987c3..26306be 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -379,8 +379,9 @@ int xt_check_match(struct xt_mtchk_param *par,
 	if (XT_ALIGN(par->match->matchsize) != size &&
 	    par->match->matchsize != -1) {
 		/*
-		 * ebt_among is exempt from centralized matchsize checking
-		 * because it uses a dynamic-size data set.
+		 * matches of variable size length, such as ebt_among,
+		 * are exempt from centralized matchsize checking. They
+		 * skip the test by setting xt_match.matchsize to -1.
 		 */
 		pr_err("%s_tables: %s.%u match: invalid size "
 		       "%u (kernel) != (user) %u\n",
diff --git a/net/netfilter/xt_bpf.c b/net/netfilter/xt_bpf.c
new file mode 100644
index 0000000..07077c5
--- /dev/null
+++ b/net/netfilter/xt_bpf.c
@@ -0,0 +1,88 @@
+/* Xtables module to match packets using a BPF filter.
+ * Copyright 2012 Google Inc.
+ * Written by Willem de Bruijn <willemb@google.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/ipv6.h>
+#include <linux/filter.h>
+#include <net/ip.h>
+
+#include <linux/netfilter/xt_bpf.h>
+#include <linux/netfilter/x_tables.h>
+
+MODULE_AUTHOR("Willem de Bruijn <willemb@google.com>");
+MODULE_DESCRIPTION("Xtables: BPF filter match");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_bpf");
+MODULE_ALIAS("ip6t_bpf");
+
+static int bpf_mt_check(const struct xt_mtchk_param *par)
+{
+	struct xt_bpf_info *info = par->matchinfo;
+	const struct xt_entry_match *match;
+	struct sock_fprog program;
+	int expected_len;
+
+	match = container_of(par->matchinfo, const struct xt_entry_match, data);
+	expected_len = sizeof(struct xt_entry_match) +
+		       sizeof(struct xt_bpf_info) +
+		       (sizeof(struct sock_filter) *
+			info->bpf_program_num_elem);
+
+	if (match->u.match_size != expected_len) {
+		pr_info("bpf: check failed: incorrect length\n");
+		return -EINVAL;
+	}
+
+	program.len = info->bpf_program_num_elem;
+	program.filter = info->bpf_program;
+	if (sk_unattached_filter_create(&info->filter, &program)) {
+		pr_info("bpf: check failed: parse error\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static bool bpf_mt(const struct sk_buff *skb, struct xt_action_param *par)
+{
+	const struct xt_bpf_info *info = par->matchinfo;
+
+	return SK_RUN_FILTER(info->filter, skb);
+}
+
+static void bpf_mt_destroy(const struct xt_mtdtor_param *par)
+{
+	const struct xt_bpf_info *info = par->matchinfo;
+	sk_unattached_filter_destroy(info->filter);
+}
+
+static struct xt_match bpf_mt_reg __read_mostly = {
+	.name		= "bpf",
+	.revision	= 0,
+	.family		= NFPROTO_UNSPEC,
+	.checkentry	= bpf_mt_check,
+	.match		= bpf_mt,
+	.destroy	= bpf_mt_destroy,
+	.matchsize	= -1, /* skip xt_check_match because of dynamic len */
+	.me		= THIS_MODULE,
+};
+
+static int __init bpf_mt_init(void)
+{
+	return xt_register_match(&bpf_mt_reg);
+}
+
+static void __exit bpf_mt_exit(void)
+{
+	xt_unregister_match(&bpf_mt_reg);
+}
+
+module_init(bpf_mt_init);
+module_exit(bpf_mt_exit);
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH 1/2] netfilter: add xt_priority xtables match
From: Willem de Bruijn @ 2012-12-05 19:22 UTC (permalink / raw)
  To: netfilter-devel, netdev, edumazet, davem, kaber, pablo; +Cc: Willem de Bruijn
In-Reply-To: <1354735339-13402-1-git-send-email-willemb@google.com>

Add an iptables match based on the skb->priority field. This field
can be set by socket option SO_PRIORITY, among others.

The match supports range based matching on packet priority, with
optional inversion. Before matching, a mask can be applied to the
priority field to handle the case where different regions of the
bitfield are reserved for unrelated uses.
---
 include/linux/netfilter/xt_priority.h |   13 ++++++++
 net/netfilter/Kconfig                 |    9 ++++++
 net/netfilter/Makefile                |    1 +
 net/netfilter/xt_priority.c           |   51 +++++++++++++++++++++++++++++++++
 4 files changed, 74 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/netfilter/xt_priority.h
 create mode 100644 net/netfilter/xt_priority.c

diff --git a/include/linux/netfilter/xt_priority.h b/include/linux/netfilter/xt_priority.h
new file mode 100644
index 0000000..da9a288
--- /dev/null
+++ b/include/linux/netfilter/xt_priority.h
@@ -0,0 +1,13 @@
+#ifndef _XT_PRIORITY_H
+#define _XT_PRIORITY_H
+
+#include <linux/types.h>
+
+struct xt_priority_info {
+	__u32 min;
+	__u32 max;
+	__u32 mask;
+	__u8  invert;
+};
+
+#endif /*_XT_PRIORITY_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index fefa514..c9739c6 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -1093,6 +1093,15 @@ config NETFILTER_XT_MATCH_PKTTYPE
 
 	  To compile it as a module, choose M here.  If unsure, say N.
 
+config NETFILTER_XT_MATCH_PRIORITY
+	tristate '"priority" match support'
+	depends on NETFILTER_ADVANCED
+	help
+	  This option adds a match based on the value of the sk_buff
+	  priority field.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+
 config NETFILTER_XT_MATCH_QUOTA
 	tristate '"quota" match support'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 3259697..8e5602f 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -124,6 +124,7 @@ obj-$(CONFIG_NETFILTER_XT_MATCH_OWNER) += xt_owner.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_PHYSDEV) += xt_physdev.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_PKTTYPE) += xt_pkttype.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_POLICY) += xt_policy.o
+obj-$(CONFIG_NETFILTER_XT_MATCH_PRIORITY) += xt_priority.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_QUOTA) += xt_quota.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_RATEEST) += xt_rateest.o
 obj-$(CONFIG_NETFILTER_XT_MATCH_REALM) += xt_realm.o
diff --git a/net/netfilter/xt_priority.c b/net/netfilter/xt_priority.c
new file mode 100644
index 0000000..4982eee
--- /dev/null
+++ b/net/netfilter/xt_priority.c
@@ -0,0 +1,51 @@
+/* Xtables module to match packets based on their sk_buff priority field.
+ * Copyright 2012 Google Inc.
+ * Written by Willem de Bruijn <willemb@google.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/skbuff.h>
+
+#include <linux/netfilter/xt_priority.h>
+#include <linux/netfilter/x_tables.h>
+
+MODULE_AUTHOR("Willem de Bruijn <willemb@google.com>");
+MODULE_DESCRIPTION("Xtables: priority filter match");
+MODULE_LICENSE("GPL");
+MODULE_ALIAS("ipt_priority");
+MODULE_ALIAS("ip6t_priority");
+
+static bool priority_mt(const struct sk_buff *skb,
+			struct xt_action_param *par)
+{
+	const struct xt_priority_info *info = par->matchinfo;
+
+	__u32 priority = skb->priority & info->mask;
+	return (priority >= info->min && priority <= info->max) ^ info->invert;
+}
+
+static struct xt_match priority_mt_reg __read_mostly = {
+	.name		= "priority",
+	.revision	= 0,
+	.family		= NFPROTO_UNSPEC,
+	.match		= priority_mt,
+	.matchsize	= sizeof(struct xt_priority_info),
+	.me		= THIS_MODULE,
+};
+
+static int __init priority_mt_init(void)
+{
+	return xt_register_match(&priority_mt_reg);
+}
+
+static void __exit priority_mt_exit(void)
+{
+	xt_unregister_match(&priority_mt_reg);
+}
+
+module_init(priority_mt_init);
+module_exit(priority_mt_exit);
-- 
1.7.7.3

^ permalink raw reply related

* [PATCH rfc] netfilter: two xtables matches
From: Willem de Bruijn @ 2012-12-05 19:22 UTC (permalink / raw)
  To: netfilter-devel, netdev, edumazet, davem, kaber, pablo

The second patch is more speculative and aims to be a more general
workaround, as well as a performance optimization: support
(preferably JIT compiled) BPF programs as iptables match rules.

Potentially, the skb->priority match can be implemented by applying
only the second patch and adding a new BPF_S_ANC ancillary field to
Linux Socket Filters.

I also wrote corresponding userspace patches to iptables. The process
for submitting both kernel and user patches is not 100% clear to me.
Sending the kernel bits to both netdev and netfilter-devel for
initial feedback. Please correct me if you want it another way.

The patches apply to net-next.


^ permalink raw reply

* [PATCH net-next] ipv6: avoid taking locks at socket dismantle
From: Eric Dumazet @ 2012-12-05 19:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

ipv6_sock_mc_close() is called for ipv6 sockets at close time, and most
of them don't use multicast.

Add a test to avoid contention on a shared spinlock.

Same heuristic applies for ipv6_sock_ac_close(), to avoid contention
on a shared rwlock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv6/anycast.c |    3 +++
 net/ipv6/mcast.c   |    3 +++
 2 files changed, 6 insertions(+)

diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 2f4f584..757a810 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -189,6 +189,9 @@ void ipv6_sock_ac_close(struct sock *sk)
 	struct net *net = sock_net(sk);
 	int	prev_index;
 
+	if (!np->ipv6_ac_list)
+		return;
+
 	write_lock_bh(&ipv6_sk_ac_lock);
 	pac = np->ipv6_ac_list;
 	np->ipv6_ac_list = NULL;
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index b19ed51..28dfa5f 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -284,6 +284,9 @@ void ipv6_sock_mc_close(struct sock *sk)
 	struct ipv6_mc_socklist *mc_lst;
 	struct net *net = sock_net(sk);
 
+	if (!rcu_access_pointer(np->ipv6_mc_list))
+		return;
+
 	spin_lock(&ipv6_sk_mc_lock);
 	while ((mc_lst = rcu_dereference_protected(np->ipv6_mc_list,
 				lockdep_is_held(&ipv6_sk_mc_lock))) != NULL) {

^ permalink raw reply related

* Re: [PATCH] net/macb: increase RX buffer size for GEM
From: David Miller @ 2012-12-05 17:58 UTC (permalink / raw)
  To: nicolas.ferre; +Cc: netdev, linux-arm-kernel, linux-kernel, manabian, plagnioj
In-Reply-To: <50BF6366.8080600@atmel.com>

From: Nicolas Ferre <nicolas.ferre@atmel.com>
Date: Wed, 5 Dec 2012 16:08:22 +0100

> On 12/04/2012 07:22 PM, David Miller :
>> From: Nicolas Ferre <nicolas.ferre@atmel.com>
>> Date: Mon, 3 Dec 2012 13:15:43 +0100
>> 
>>> Macb Ethernet controller requires a RX buffer of 128 bytes. It is
>>> highly sub-optimal for Gigabit-capable GEM that is able to use
>>> a bigger DMA buffer. Change this constant and associated macros
>>> with data stored in the private structure.
>>> I also kept the result of buffers per page calculation to lower the
>>> impact of this move to a variable rx buffer size on rx hot path.
>>> RX DMA buffer size has to be multiple of 64 bytes as indicated in
>>> DMA Configuration Register specification.
>>>
>>> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
>> 
>> This looks like it will waste a couple hundred bytes for 1500 MTU
>> frames, am I right?
> 
> Yep! But buffers get recycled, and with the current memory management by
> pages, it seems that I have to rework some part of it to optimize this
> memory usage (8KB memory blocks split into 5 buffers each as David said...).
> 
> Do you think it is worth digging this way or may I rework the rx buffer
> management in case of the GEM interface. If I implement a different path
> for GEM interface, I will have the possibility to tailor rx DMA buffers
> from 1500 Bytes up to 10KB jumbo frames...

I almost think you have to.

^ permalink raw reply

* Re: [PATCH] 3com: make 3c59x depend on HAS_IOPORT
From: David Miller @ 2012-12-05 17:58 UTC (permalink / raw)
  To: jang; +Cc: netdev
In-Reply-To: <1354716280.29038.2.camel@hal>

From: Jan Glauber <jang@linux.vnet.ibm.com>
Date: Wed, 05 Dec 2012 15:04:40 +0100

> From: Jan Glauber <jang@linux.vnet.ibm.com>
> 
> The 3com driver for 3c59x requires ioport_map. Since not all
> architectures support IO port mapping make 3c59x dependent on HAS_IOPORT.
> 
> Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>

Which platforms support PCI or EISA yet do not set HAS_IOPORT?

^ permalink raw reply

* Re: [PATCH] net: ICMPv6 packets transmitted on wrong interface if nfmark is mangled
From: David Miller @ 2012-12-05 17:57 UTC (permalink / raw)
  To: dries.dewinter; +Cc: pablo, kaber, netdev, netfilter-devel
In-Reply-To: <CA+e04fgiMDoaCx0fpdK5F6YfJ_CSOZ13q1fn9xs7De4JO02c6g@mail.gmail.com>

From: Dries De Winter <dries.dewinter@gmail.com>
Date: Wed, 5 Dec 2012 14:41:59 +0100

> My "noreroute" patch will not fix this. Therefore it's indeed maybe
> better to add a simple check to ip6_route_me_harder(): not a check for
> ICMPv6, but a check for (ipv6_addr_type(&iph->daddr) &
> IPV6_ADDR_LINKLOCAL) instead. What do you think?

What if a packet is rewritten from a non-link-local destination address
into a link-local one?  Or vice versa?

Your test will fail in those cases.

^ permalink raw reply

* Re: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: David Miller @ 2012-12-05 17:54 UTC (permalink / raw)
  To: David.Laight; +Cc: nicolas.dichtel, netdev
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6026B70DA@saturn3.aculab.com>

From: "David Laight" <David.Laight@ACULAB.COM>
Date: Wed, 5 Dec 2012 11:41:33 -0000

> Probably worth commenting that the 64bit items might only be 32bit aligned.
> Just to stop anyone trying to read/write them with pointer casts.

Rather, let's not create this situation at all.

It's totally inappropriate to have special code to handle every single
time we want to put 64-bit values into netlink messages.

We need a real solution to this issue.

^ permalink raw reply

* Re: [PATCH net-next 0/7] Allow to monitor multicast cache event via rtnetlink
From: David Miller @ 2012-12-05 17:53 UTC (permalink / raw)
  To: nicolas.dichtel; +Cc: netdev
In-Reply-To: <50BF29DA.7020903@6wind.com>

From: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Date: Wed, 05 Dec 2012 12:02:50 +0100

> Le 04/12/2012 21:02, Nicolas Dichtel a écrit :
>> I can have a try on a tile platform. I don't have access to sparc or
>> mips.
> Hmm, I've read arm instead of mips! So I've tried on mips. Data are
> aligned on 32-bit, like for all netlink messages. nla_put_u64() will
> do the same, as it calls nla_put().
> 
> And the kernel will only use memcpy() to treat this attribute. Reader
> will be in userland.

Then userland will trap if the 64-bit values are only 32-bit aligned.

That's the problem I'm talking about.

I don't want to export any more unaligned 64-bit values in netlink
messages, it's a complete mess.

^ permalink raw reply

* Re: WARNING: drivers/net/ethernet/dlink/sundance.o(.text+0x2e87): Section mismatch in reference from the function sundance_probe1() to the variable .devinit.rodata:sundance_pci_tbl
From: David Miller @ 2012-12-05 17:50 UTC (permalink / raw)
  To: kda; +Cc: fengguang.wu, wfp5p, netdev, gregkh
In-Reply-To: <CAOJe8K3xBAuhVND-9dJgSxShhpHQ3zko=bE2CNrAKVa-gvDouA@mail.gmail.com>

From: Denis Kirjanov <kda@linux-powerpc.org>
Date: Wed, 5 Dec 2012 11:12:32 +0300

> I"ll fix it.

You can't.

It's only going to get fixed by a change in Greg KH's device tree
which updates the PCI table macros to not use __dev* section tags.

Fengguant, _PLEASE_, as Greg requested, stop reporting these section
mismatch errors, they aren't helpful and are wasting valuable
developer time.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox