Netdev List
 help / color / mirror / Atom feed
* Re: [patch 1/2 -next] cxgb4: clean up a type issue
From: David Miller @ 2014-10-03 22:46 UTC (permalink / raw)
  To: dan.carpenter; +Cc: hariprasad, netdev, kernel-janitors
In-Reply-To: <20141002112219.GA25606@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Thu, 2 Oct 2014 14:22:19 +0300

> The tx_desc struct hold 8 __be64 values.  The original code took a
> tx_desc pointer then casted it to an int pointer and then casted it to a
> u64 pointer.  It was confusing and triggered some static checker
> warnings.
> 
> I have changed the cxgb_pio_copy() to only take tx_desc pointers.  This
> isn't really a loss of flexibility because anything else was buggy to
> begin with.
> 
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Please address the feedback you've received, resubmit this series, and actually
number this second change "2/2" instead of "1/2" :-)

Thanks!

^ permalink raw reply

* [PATCH v2 net-next 0/4] net: Generic UDP Encapsulation
From: Tom Herbert @ 2014-10-03 22:48 UTC (permalink / raw)
  To: davem, netdev

Generic UDP Encapsulation (GUE) is UDP encapsulation protocol which
encapsulates packets of various IP protocols. The GUE protocol is
described in http://tools.ietf.org/html/draft-herbert-gue-01.

The receive path of GUE is implemented in the FOU over UDP module (FOU).
This includes a UDP encap receive function for GUE as well as GUE
specific GRO functions. Management and configuration of GUE ports shares
most of the same code with FOU.

For the transmit path, the previous FOU support for IPIP, sit, and GRE
was simply extended for GUE (when GUE is enabled insert the GUE
header on transmit in addition to UDP header inserted for FOU).

Semantically GUE is the same as FOU in that the encapsulation (UDP
and GUE headers) that are inserted on transmission and removed on
reception so that IP packet is processed with the inner header.

This patch set includes:
 - Some fixes to FOU, removal of IPv4,v6 specific GRO functions
 - Support to configure a GUE receive port
 - Implementation of GUE receive path (normal and GRO)
 - Additions to ip_tunnel netlink to configure GUE
 - GUE header inserion in ip_tunnel transmit path

v2:
 - Include net/gue.h in patch set

Testing:

I ran performance numbers using netperf TCP_RR with 200 streams,
comparing encapsulation without GUE, encapsulation with GUE, and
encapsulation with FOU.

 GRE
    TCP_STREAM
      IPv4, FOU, UDP checksum enabled
        14.04% TX CPU utilization
        13.17% RX CPU utilization
        9211 Mbps
      IPv4, GUE, UDP checksum enabled
        14.99% TX CPU utilization
        13.79% RX CPU utilization
        9185 Mbps
      IPv4, FOU, UDP checksum disabled
        13.14% TX CPU utilization
        23.18% RX CPU utilization
        9277 Mbps
      IPv4, GUE, UDP checksum disabled
        13.66% TX CPU utilization
        23.57% RX CPU utilization
        9184 Mbps
    TCP_RR
      IPv4, FOU, UDP checksum enabled
        94.2% CPU utilization
        155/249/460 90/95/99% latencies
        1.17018e+06 tps
      IPv4, GUE, UDP checksum enabled
        93.9% CPU utilization
        158/253/472 90/95/99% latencies
        1.15045e+06 tps

  IPIP
    TCP_STREAM
      FOU, UDP checksum enabled
        15.28% TX CPU utilization
        13.92% RX CPU utilization
        9342 Mbps
      GUE, UDP checksum enabled
        13.99% TX CPU utilization
        13.34% RX CPU utilization
        9210 Mbps
      FOU, UDP checksum disabled
        15.08% TX CPU utilization
        24.64% RX CPU utilization
        9226 Mbps
      GUE, UDP checksum disabled
        15.90% TX CPU utilization
        24.77% RX CPU utilization
        9197 Mbps
    TCP_RR
      FOU, UDP checksum enabled
        94.23% CPU utilization
        149/237/429 90/95/99% latencies
        1.19553e+06 tps
      GUE, UDP checksum enabled
        93.75% CPU utilization
        152/243/442 90/95/99% latencies
        1.17027e+06 tps

  SIT
    TCP_STREAM
      FOU, UDP checksum enabled
        14.47% TX CPU utilization
        14.58% RX CPU utilization
        9106 Mbps
      GUE, UDP checksum enabled
        15.09% TX CPU utilization
        14.84% RX CPU utilization
        9080 Mbps
      FOU, UDP checksum disabled
        15.70% TX CPU utilization
        27.93% RX CPU utilization
        9097 Mbps
      GUE, UDP checksum disabled
        15.04% TX CPU utilization
        27.54% RX CPU utilization
        9073 Mbps
    TCP_RR
      FOU, UDP checksum enabled
        96.9% CPU utilization
        170/281/581 90/95/99% latencies
        1.03372e+06 tps
      GUE, UDP checksum enabled
        97.16% CPU utilization
        172/286/576 90/95/99% latencies
        1.00469e+06 tps

Tom Herbert (4):
  ip_tunnel: Account for secondary encapsulation header in max_headroom
  fou: eliminate IPv4,v6 specific GRO functions
  gue: Receive side for Generic UDP Encapsulation
  ip_tunnel: Add GUE support

 include/linux/netdevice.h      |   3 +
 include/net/gue.h              |  23 +++++
 include/uapi/linux/fou.h       |   7 ++
 include/uapi/linux/if_tunnel.h |   1 +
 net/ipv4/fou.c                 | 224 ++++++++++++++++++++++++++++++++++-------
 net/ipv4/ip_tunnel.c           |  15 ++-
 net/ipv4/udp_offload.c         |   1 +
 net/ipv6/udp_offload.c         |   1 +
 8 files changed, 235 insertions(+), 40 deletions(-)
 create mode 100644 include/net/gue.h

-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply

* [PATCH v2 net-next 1/4] ip_tunnel: Account for secondary encapsulation header in max_headroom
From: Tom Herbert @ 2014-10-03 22:48 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1412376490-8774-1-git-send-email-therbert@google.com>

When adjusting max_header for the tunnel interface based on egress
device we need to account for any extra bytes in secondary encapsulation
(e.g. FOU).

Signed-off-by: Tom Herbert <therbert@google.com>
---
 net/ipv4/ip_tunnel.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index b75b47b..54ace25 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -759,7 +759,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
 		df |= (inner_iph->frag_off&htons(IP_DF));
 
 	max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr)
-			+ rt->dst.header_len;
+			+ rt->dst.header_len + ip_encap_hlen(&tunnel->encap);
 	if (max_headroom > dev->needed_headroom)
 		dev->needed_headroom = max_headroom;
 
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH v2 net-next 2/4] fou: eliminate IPv4,v6 specific GRO functions
From: Tom Herbert @ 2014-10-03 22:48 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1412376490-8774-1-git-send-email-therbert@google.com>

This patch removes fou[46]_gro_receive and fou[46]_gro_complete
functions. The v4 or v6 variants were chosen for the UDP offloads
based on the address family of the socket this is not necessary
or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
This is set in udp6_gro_receive and unset in udp4_gro_receive. In
fou_gro_receive the value is used to select the correct inet_offloads
for the protocol of the outer IP header.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/linux/netdevice.h |  3 +++
 net/ipv4/fou.c            | 48 ++++++++---------------------------------------
 net/ipv4/udp_offload.c    |  1 +
 net/ipv6/udp_offload.c    |  1 +
 4 files changed, 13 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9b7fbac..640f8d8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1886,6 +1886,9 @@ struct napi_gro_cb {
 	/* Number of checksums via CHECKSUM_UNNECESSARY */
 	u8	csum_cnt:3;
 
+	/* Used in foo-over-udp, set in udp[46]_gro_receive */
+	u8	is_ipv6:1;
+
 	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
 	__wsum	csum;
 
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index dced89f..7e2126a 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -65,14 +65,15 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb)
 }
 
 static struct sk_buff **fou_gro_receive(struct sk_buff **head,
-					struct sk_buff *skb,
-					const struct net_offload **offloads)
+					struct sk_buff *skb)
 {
 	const struct net_offload *ops;
 	struct sk_buff **pp = NULL;
 	u8 proto = NAPI_GRO_CB(skb)->proto;
+	const struct net_offload **offloads;
 
 	rcu_read_lock();
+	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
 	if (!ops || !ops->callbacks.gro_receive)
 		goto out_unlock;
@@ -85,14 +86,15 @@ out_unlock:
 	return pp;
 }
 
-static int fou_gro_complete(struct sk_buff *skb, int nhoff,
-			    const struct net_offload **offloads)
+static int fou_gro_complete(struct sk_buff *skb, int nhoff)
 {
 	const struct net_offload *ops;
 	u8 proto = NAPI_GRO_CB(skb)->proto;
 	int err = -ENOSYS;
+	const struct net_offload **offloads;
 
 	rcu_read_lock();
+	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
 	ops = rcu_dereference(offloads[proto]);
 	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
 		goto out_unlock;
@@ -105,28 +107,6 @@ out_unlock:
 	return err;
 }
 
-static struct sk_buff **fou4_gro_receive(struct sk_buff **head,
-					 struct sk_buff *skb)
-{
-	return fou_gro_receive(head, skb, inet_offloads);
-}
-
-static int fou4_gro_complete(struct sk_buff *skb, int nhoff)
-{
-	return fou_gro_complete(skb, nhoff, inet_offloads);
-}
-
-static struct sk_buff **fou6_gro_receive(struct sk_buff **head,
-					 struct sk_buff *skb)
-{
-	return fou_gro_receive(head, skb, inet6_offloads);
-}
-
-static int fou6_gro_complete(struct sk_buff *skb, int nhoff)
-{
-	return fou_gro_complete(skb, nhoff, inet6_offloads);
-}
-
 static int fou_add_to_port_list(struct fou *fou)
 {
 	struct fou *fout;
@@ -199,20 +179,8 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
 	sk->sk_allocation = GFP_ATOMIC;
 
-	switch (cfg->udp_config.family) {
-	case AF_INET:
-		fou->udp_offloads.callbacks.gro_receive = fou4_gro_receive;
-		fou->udp_offloads.callbacks.gro_complete = fou4_gro_complete;
-		break;
-	case AF_INET6:
-		fou->udp_offloads.callbacks.gro_receive = fou6_gro_receive;
-		fou->udp_offloads.callbacks.gro_complete = fou6_gro_complete;
-		break;
-	default:
-		err = -EPFNOSUPPORT;
-		goto error;
-	}
-
+	fou->udp_offloads.callbacks.gro_receive = fou_gro_receive;
+	fou->udp_offloads.callbacks.gro_complete = fou_gro_complete;
 	fou->udp_offloads.port = cfg->udp_config.local_udp_port;
 	fou->udp_offloads.ipproto = cfg->protocol;
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 8c35f2c..507310e 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -334,6 +334,7 @@ static struct sk_buff **udp4_gro_receive(struct sk_buff **head,
 		skb_gro_checksum_try_convert(skb, IPPROTO_UDP, uh->check,
 					     inet_gro_compute_pseudo);
 skip:
+	NAPI_GRO_CB(skb)->is_ipv6 = 0;
 	return udp_gro_receive(head, skb, uh);
 
 flush:
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 8f96988..6b8f543 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -140,6 +140,7 @@ static struct sk_buff **udp6_gro_receive(struct sk_buff **head,
 					     ip6_gro_compute_pseudo);
 
 skip:
+	NAPI_GRO_CB(skb)->is_ipv6 = 1;
 	return udp_gro_receive(head, skb, uh);
 
 flush:
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH v2 net-next 3/4] gue: Receive side for Generic UDP Encapsulation
From: Tom Herbert @ 2014-10-03 22:48 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1412376490-8774-1-git-send-email-therbert@google.com>

This patch adds support receiving for GUE packets in the fou module. The
fou module now supports direct foo-over-udp (no encapsulation header)
and GUE. To support this a type parameter is added to the fou netlink
parameters.

For a GUE socket we define gue_udp_recv, gue_gro_receive, and
gue_gro_complete to handle the specifics of the GUE protocol. Most
of the code to manage and configure sockets is common with the fou.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/net/gue.h        |  23 ++++++
 include/uapi/linux/fou.h |   7 ++
 net/ipv4/fou.c           | 196 ++++++++++++++++++++++++++++++++++++++++++++---
 3 files changed, 217 insertions(+), 9 deletions(-)
 create mode 100644 include/net/gue.h

diff --git a/include/net/gue.h b/include/net/gue.h
new file mode 100644
index 0000000..b6c3327
--- /dev/null
+++ b/include/net/gue.h
@@ -0,0 +1,23 @@
+#ifndef __NET_GUE_H
+#define __NET_GUE_H
+
+struct guehdr {
+	union {
+		struct {
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+			__u8	hlen:4,
+			version:4;
+#elif defined (__BIG_ENDIAN_BITFIELD)
+			__u8	version:4,
+				hlen:4;
+#else
+#error  "Please fix <asm/byteorder.h>"
+#endif
+			__u8    next_hdr;
+			__u16   flags;
+		};
+		__u32 word;
+	};
+};
+
+#endif
diff --git a/include/uapi/linux/fou.h b/include/uapi/linux/fou.h
index e03376d..8df0689 100644
--- a/include/uapi/linux/fou.h
+++ b/include/uapi/linux/fou.h
@@ -13,6 +13,7 @@ enum {
 	FOU_ATTR_PORT,				/* u16 */
 	FOU_ATTR_AF,				/* u8 */
 	FOU_ATTR_IPPROTO,			/* u8 */
+	FOU_ATTR_TYPE,				/* u8 */
 
 	__FOU_ATTR_MAX,
 };
@@ -27,6 +28,12 @@ enum {
 	__FOU_CMD_MAX,
 };
 
+enum {
+	FOU_ENCAP_UNSPEC,
+	FOU_ENCAP_DIRECT,
+	FOU_ENCAP_GUE,
+};
+
 #define FOU_CMD_MAX	(__FOU_CMD_MAX - 1)
 
 #endif /* _UAPI_LINUX_FOU_H */
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 7e2126a..efa70ad 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -7,6 +7,7 @@
 #include <linux/types.h>
 #include <linux/kernel.h>
 #include <net/genetlink.h>
+#include <net/gue.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/udp.h>
@@ -27,6 +28,7 @@ struct fou {
 };
 
 struct fou_cfg {
+	u16 type;
 	u8 protocol;
 	struct udp_port_cfg udp_config;
 };
@@ -64,6 +66,41 @@ static int fou_udp_recv(struct sock *sk, struct sk_buff *skb)
 					  sizeof(struct udphdr));
 }
 
+static int gue_udp_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct fou *fou = fou_from_sock(sk);
+	size_t len;
+	struct guehdr *guehdr;
+	struct udphdr *uh;
+
+	if (!fou)
+		return 1;
+
+	len = sizeof(struct udphdr) + sizeof(struct guehdr);
+	if (!pskb_may_pull(skb, len))
+		goto drop;
+
+	uh = udp_hdr(skb);
+	guehdr = (struct guehdr *)&uh[1];
+
+	len += guehdr->hlen << 2;
+	if (!pskb_may_pull(skb, len))
+		goto drop;
+
+	if (guehdr->version != 0)
+		goto drop;
+
+	if (guehdr->flags) {
+		/* No support yet */
+		goto drop;
+	}
+
+	return fou_udp_encap_recv_deliver(skb, guehdr->next_hdr, len);
+drop:
+	kfree_skb(skb);
+	return 0;
+}
+
 static struct sk_buff **fou_gro_receive(struct sk_buff **head,
 					struct sk_buff *skb)
 {
@@ -107,6 +144,112 @@ out_unlock:
 	return err;
 }
 
+static struct sk_buff **gue_gro_receive(struct sk_buff **head,
+					struct sk_buff *skb)
+{
+	const struct net_offload **offloads;
+	const struct net_offload *ops;
+	struct sk_buff **pp = NULL;
+	struct sk_buff *p;
+	u8 proto;
+	struct guehdr *guehdr;
+	unsigned int hlen, guehlen;
+	unsigned int off;
+	int flush = 1;
+
+	off = skb_gro_offset(skb);
+	hlen = off + sizeof(*guehdr);
+	guehdr = skb_gro_header_fast(skb, off);
+	if (skb_gro_header_hard(skb, hlen)) {
+		guehdr = skb_gro_header_slow(skb, hlen, off);
+		if (unlikely(!guehdr))
+			goto out;
+	}
+
+	proto = guehdr->next_hdr;
+
+	rcu_read_lock();
+	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
+	ops = rcu_dereference(offloads[proto]);
+	if (WARN_ON(!ops || !ops->callbacks.gro_receive))
+		goto out_unlock;
+
+	guehlen = sizeof(*guehdr) + (guehdr->hlen << 2);
+
+	hlen = off + guehlen;
+	if (skb_gro_header_hard(skb, hlen)) {
+		guehdr = skb_gro_header_slow(skb, hlen, off);
+		if (unlikely(!guehdr))
+			goto out_unlock;
+	}
+
+	flush = 0;
+
+	for (p = *head; p; p = p->next) {
+		const struct guehdr *guehdr2;
+
+		if (!NAPI_GRO_CB(p)->same_flow)
+			continue;
+
+		guehdr2 = (struct guehdr *)(p->data + off);
+
+		/* Compare base GUE header to be equal (covers
+		 * hlen, version, next_hdr, and flags.
+		 */
+		if (guehdr->word != guehdr2->word) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+
+		/* Compare optional fields are the same. */
+		if (guehdr->hlen && memcmp(&guehdr[1], &guehdr2[1],
+					   guehdr->hlen << 2)) {
+			NAPI_GRO_CB(p)->same_flow = 0;
+			continue;
+		}
+	}
+
+	skb_gro_pull(skb, guehlen);
+
+	/* Adjusted NAPI_GRO_CB(skb)->csum after skb_gro_pull()*/
+	skb_gro_postpull_rcsum(skb, guehdr, guehlen);
+
+	pp = ops->callbacks.gro_receive(head, skb);
+
+out_unlock:
+	rcu_read_unlock();
+out:
+	NAPI_GRO_CB(skb)->flush |= flush;
+
+	return pp;
+}
+
+static int gue_gro_complete(struct sk_buff *skb, int nhoff)
+{
+	const struct net_offload **offloads;
+	struct guehdr *guehdr = (struct guehdr *)(skb->data + nhoff);
+	const struct net_offload *ops;
+	unsigned int guehlen;
+	u8 proto;
+	int err = -ENOENT;
+
+	proto = guehdr->next_hdr;
+
+	guehlen = sizeof(*guehdr) + (guehdr->hlen << 2);
+
+	rcu_read_lock();
+	offloads = NAPI_GRO_CB(skb)->is_ipv6 ? inet6_offloads : inet_offloads;
+	ops = rcu_dereference(offloads[proto]);
+	if (WARN_ON(!ops || !ops->callbacks.gro_complete))
+		goto out_unlock;
+
+	err = ops->callbacks.gro_complete(skb, nhoff + guehlen);
+
+out_unlock:
+	rcu_read_unlock();
+	return err;
+}
+
 static int fou_add_to_port_list(struct fou *fou)
 {
 	struct fou *fout;
@@ -142,6 +285,28 @@ static void fou_release(struct fou *fou)
 	kfree(fou);
 }
 
+static int fou_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg *cfg)
+{
+	udp_sk(sk)->encap_rcv = fou_udp_recv;
+	fou->protocol = cfg->protocol;
+	fou->udp_offloads.callbacks.gro_receive = fou_gro_receive;
+	fou->udp_offloads.callbacks.gro_complete = fou_gro_complete;
+	fou->udp_offloads.port = cfg->udp_config.local_udp_port;
+	fou->udp_offloads.ipproto = cfg->protocol;
+
+	return 0;
+}
+
+static int gue_encap_init(struct sock *sk, struct fou *fou, struct fou_cfg *cfg)
+{
+	udp_sk(sk)->encap_rcv = gue_udp_recv;
+	fou->udp_offloads.callbacks.gro_receive = gue_gro_receive;
+	fou->udp_offloads.callbacks.gro_complete = gue_gro_complete;
+	fou->udp_offloads.port = cfg->udp_config.local_udp_port;
+
+	return 0;
+}
+
 static int fou_create(struct net *net, struct fou_cfg *cfg,
 		      struct socket **sockp)
 {
@@ -164,10 +329,24 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
 	sk = sock->sk;
 
-	/* Mark socket as an encapsulation socket. See net/ipv4/udp.c */
-	fou->protocol = cfg->protocol;
-	fou->port =  cfg->udp_config.local_udp_port;
-	udp_sk(sk)->encap_rcv = fou_udp_recv;
+	fou->port = cfg->udp_config.local_udp_port;
+
+	/* Initial for fou type */
+	switch (cfg->type) {
+	case FOU_ENCAP_DIRECT:
+		err = fou_encap_init(sk, fou, cfg);
+		if (err)
+			goto error;
+		break;
+	case FOU_ENCAP_GUE:
+		err = gue_encap_init(sk, fou, cfg);
+		if (err)
+			goto error;
+		break;
+	default:
+		err = -EINVAL;
+		goto error;
+	}
 
 	udp_sk(sk)->encap_type = 1;
 	udp_encap_enable();
@@ -179,11 +358,6 @@ static int fou_create(struct net *net, struct fou_cfg *cfg,
 
 	sk->sk_allocation = GFP_ATOMIC;
 
-	fou->udp_offloads.callbacks.gro_receive = fou_gro_receive;
-	fou->udp_offloads.callbacks.gro_complete = fou_gro_complete;
-	fou->udp_offloads.port = cfg->udp_config.local_udp_port;
-	fou->udp_offloads.ipproto = cfg->protocol;
-
 	if (cfg->udp_config.family == AF_INET) {
 		err = udp_add_offload(&fou->udp_offloads);
 		if (err)
@@ -240,6 +414,7 @@ static struct nla_policy fou_nl_policy[FOU_ATTR_MAX + 1] = {
 	[FOU_ATTR_PORT] = { .type = NLA_U16, },
 	[FOU_ATTR_AF] = { .type = NLA_U8, },
 	[FOU_ATTR_IPPROTO] = { .type = NLA_U8, },
+	[FOU_ATTR_TYPE] = { .type = NLA_U8, },
 };
 
 static int parse_nl_config(struct genl_info *info,
@@ -267,6 +442,9 @@ static int parse_nl_config(struct genl_info *info,
 	if (info->attrs[FOU_ATTR_IPPROTO])
 		cfg->protocol = nla_get_u8(info->attrs[FOU_ATTR_IPPROTO]);
 
+	if (info->attrs[FOU_ATTR_TYPE])
+		cfg->type = nla_get_u8(info->attrs[FOU_ATTR_TYPE]);
+
 	return 0;
 }
 
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH v2 net-next 4/4] ip_tunnel: Add GUE support
From: Tom Herbert @ 2014-10-03 22:48 UTC (permalink / raw)
  To: davem, netdev
In-Reply-To: <1412376490-8774-1-git-send-email-therbert@google.com>

This patch allows configuring IPIP, sit, and GRE tunnels to use GUE.
This is very similar to fou excpet that we need to insert the GUE header
in addition to the UDP header on transmit.

Signed-off-by: Tom Herbert <therbert@google.com>
---
 include/uapi/linux/if_tunnel.h |  1 +
 net/ipv4/ip_tunnel.c           | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 7c832af..280d9e0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -64,6 +64,7 @@ enum {
 enum tunnel_encap_types {
 	TUNNEL_ENCAP_NONE,
 	TUNNEL_ENCAP_FOU,
+	TUNNEL_ENCAP_GUE,
 };
 
 #define TUNNEL_ENCAP_FLAG_CSUM		(1<<0)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 54ace25..79f2ac0 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -56,6 +56,7 @@
 #include <net/netns/generic.h>
 #include <net/rtnetlink.h>
 #include <net/udp.h>
+#include <net/gue.h>
 
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6.h>
@@ -495,6 +496,8 @@ static int ip_encap_hlen(struct ip_tunnel_encap *e)
 		return 0;
 	case TUNNEL_ENCAP_FOU:
 		return sizeof(struct udphdr);
+	case TUNNEL_ENCAP_GUE:
+		return sizeof(struct udphdr) + sizeof(struct guehdr);
 	default:
 		return -EINVAL;
 	}
@@ -546,6 +549,15 @@ static int fou_build_header(struct sk_buff *skb, struct ip_tunnel_encap *e,
 	skb_reset_transport_header(skb);
 	uh = udp_hdr(skb);
 
+	if (e->type == TUNNEL_ENCAP_GUE) {
+		struct guehdr *guehdr = (struct guehdr *)&uh[1];
+
+		guehdr->version = 0;
+		guehdr->hlen = 0;
+		guehdr->flags = 0;
+		guehdr->next_hdr = *protocol;
+	}
+
 	uh->dest = e->dport;
 	uh->source = sport;
 	uh->len = htons(skb->len);
@@ -565,6 +577,7 @@ int ip_tunnel_encap(struct sk_buff *skb, struct ip_tunnel *t,
 	case TUNNEL_ENCAP_NONE:
 		return 0;
 	case TUNNEL_ENCAP_FOU:
+	case TUNNEL_ENCAP_GUE:
 		return fou_build_header(skb, &t->encap, t->encap_hlen,
 					protocol, fl4);
 	default:
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [Patch net-next] net_sched: refactor out tcf_exts
From: Cong Wang @ 2014-10-03 22:51 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, Jamal Hadi Salim, John Fastabend, David S. Miller

As Jamal pointed it out, tcf_exts is really unnecessary,
we can definitely refactor it out without losing any functionality.
This could also remove an indirect layer which makes the code
much easier to read.

This patch:

1) moves exts->action and exts->police into tp->ops, since they
are statically assigned

2) moves exts->actions list head out

3) removes exts->type, act->type does the same thing

4) renames tcf_exts_*() functions to tcf_act_*()

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/pkt_cls.h     | 80 ++++++++++++++++++-----------------------------
 include/net/sch_generic.h |  2 ++
 net/sched/act_api.c       |  9 ++++--
 net/sched/cls_api.c       | 77 ++++++++++++++++++++++++---------------------
 net/sched/cls_basic.c     | 23 +++++++-------
 net/sched/cls_bpf.c       | 23 +++++++-------
 net/sched/cls_cgroup.c    | 24 +++++++-------
 net/sched/cls_flow.c      | 23 +++++++-------
 net/sched/cls_fw.c        | 27 ++++++++--------
 net/sched/cls_route.c     | 25 ++++++++-------
 net/sched/cls_rsvp.h      | 27 ++++++++--------
 net/sched/cls_tcindex.c   | 36 ++++++++++-----------
 net/sched/cls_u32.c       | 26 +++++++--------
 13 files changed, 198 insertions(+), 204 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index ef44ad9..24bc41f 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -56,88 +56,68 @@ tcf_unbind_filter(struct tcf_proto *tp, struct tcf_result *r)
 		tp->q->ops->cl_ops->unbind_tcf(tp->q, cl);
 }
 
-struct tcf_exts {
-#ifdef CONFIG_NET_CLS_ACT
-	__u32	type; /* for backward compat(TCA_OLD_COMPAT) */
-	struct list_head actions;
-#endif
-	/* Map to export classifier specific extension TLV types to the
-	 * generic extensions API. Unsupported extensions must be set to 0.
-	 */
-	int action;
-	int police;
-};
-
-static inline void tcf_exts_init(struct tcf_exts *exts, int action, int police)
-{
-#ifdef CONFIG_NET_CLS_ACT
-	exts->type = 0;
-	INIT_LIST_HEAD(&exts->actions);
-#endif
-	exts->action = action;
-	exts->police = police;
-}
-
 /**
- * tcf_exts_is_predicative - check if a predicative extension is present
- * @exts: tc filter extensions handle
+ * tcf_act_is_predicative - check if a predicative action is present
+ * @actions: tc filter actions
  *
- * Returns 1 if a predicative extension is present, i.e. an extension which
+ * Returns 1 if a predicative action is present, i.e. an action which
  * might cause further actions and thus overrule the regular tcf_result.
  */
 static inline int
-tcf_exts_is_predicative(struct tcf_exts *exts)
+tcf_act_is_predicative(struct list_head *actions)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	return !list_empty(&exts->actions);
+	return !list_empty(actions);
 #else
 	return 0;
 #endif
 }
 
 /**
- * tcf_exts_is_available - check if at least one extension is present
- * @exts: tc filter extensions handle
+ * tcf_act_is_available - check if at least one action is present
+ * @actions: tc filter actions
  *
- * Returns 1 if at least one extension is present.
+ * Returns 1 if at least one action is present.
  */
 static inline int
-tcf_exts_is_available(struct tcf_exts *exts)
+tcf_act_is_available(struct list_head *actions)
 {
-	/* All non-predicative extensions must be added here. */
-	return tcf_exts_is_predicative(exts);
+	/* All non-predicative actions must be added here. */
+	return tcf_act_is_predicative(actions);
 }
 
 /**
- * tcf_exts_exec - execute tc filter extensions
+ * tcf_act_exec - execute tc filter actions
  * @skb: socket buffer
- * @exts: tc filter extensions handle
+ * @actions: list of actions
  * @res: desired result
  *
- * Executes all configured extensions. Returns 0 on a normal execution,
+ * Executes all configured actions. Returns 0 on a normal execution,
  * a negative number if the filter must be considered unmatched or
  * a positive action code (TC_ACT_*) which must be returned to the
  * underlying layer.
  */
-static inline int
-tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
-	       struct tcf_result *res)
-{
 #ifdef CONFIG_NET_CLS_ACT
-	if (!list_empty(&exts->actions))
-		return tcf_action_exec(skb, &exts->actions, res);
-#endif
+int tcf_act_exec(struct sk_buff *skb, struct list_head *actions,
+		 struct tcf_result *res);
+#else
+static inline
+int tcf_act_exec(struct sk_buff *skb, struct list_head *actions,
+		 struct tcf_result *res)
+{
 	return 0;
 }
+#endif
 
-int tcf_exts_validate(struct net *net, struct tcf_proto *tp,
+int tcf_act_validate(struct net *net, struct tcf_proto *tp,
 		      struct nlattr **tb, struct nlattr *rate_tlv,
-		      struct tcf_exts *exts, bool ovr);
-void tcf_exts_destroy(struct tcf_exts *exts);
-void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
-		     struct tcf_exts *src);
-int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts);
-int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts);
+		      struct list_head *actions, bool ovr);
+void tcf_act_destroy(struct list_head *actions);
+void tcf_act_change(struct tcf_proto *tp, struct list_head *dst,
+		     struct list_head *src);
+int tcf_act_dump(struct sk_buff *skb, const struct tcf_proto *tp,
+		 struct list_head *actions);
+int tcf_act_dump_stats(struct sk_buff *skb, struct list_head *actions);
 
 /**
  * struct tcf_pkt_info - packet information
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d17ed6f..3d9fac9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -211,6 +211,8 @@ struct tcf_result {
 struct tcf_proto_ops {
 	struct list_head	head;
 	char			kind[IFNAMSIZ];
+	int			action;
+	int			police;
 
 	int			(*classify)(struct sk_buff *,
 					    const struct tcf_proto *,
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 3d43e49..a350598 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -378,12 +378,15 @@ static struct tc_action_ops *tc_lookup_action(struct nlattr *kind)
 	return res;
 }
 
-int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
-		    struct tcf_result *res)
+int tcf_act_exec(struct sk_buff *skb, const struct list_head *actions,
+		 struct tcf_result *res)
 {
 	const struct tc_action *a;
 	int ret = -1;
 
+	if (list_empty(actions))
+		return 0;
+
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
 		ret = TC_ACT_OK;
@@ -405,7 +408,7 @@ int tcf_action_exec(struct sk_buff *skb, const struct list_head *actions,
 exec_done:
 	return ret;
 }
-EXPORT_SYMBOL(tcf_action_exec);
+EXPORT_SYMBOL(tcf_act_exec);
 
 int tcf_action_destroy(struct list_head *actions, int bind)
 {
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 77147c8..d6f0059 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -496,89 +496,94 @@ static int tc_dump_tfilter(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
-void tcf_exts_destroy(struct tcf_exts *exts)
+void tcf_act_destroy(struct list_head *actions)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	tcf_action_destroy(&exts->actions, TCA_ACT_UNBIND);
-	INIT_LIST_HEAD(&exts->actions);
+	tcf_action_destroy(actions, TCA_ACT_UNBIND);
+	INIT_LIST_HEAD(actions);
 #endif
 }
-EXPORT_SYMBOL(tcf_exts_destroy);
+EXPORT_SYMBOL(tcf_act_destroy);
 
-int tcf_exts_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
-		  struct nlattr *rate_tlv, struct tcf_exts *exts, bool ovr)
+int tcf_act_validate(struct net *net, struct tcf_proto *tp, struct nlattr **tb,
+		     struct nlattr *rate_tlv, struct list_head *actions,
+		     bool ovr)
 {
+	int police = tp->ops->police;
+	int action = tp->ops->action;
+
 #ifdef CONFIG_NET_CLS_ACT
 	{
 		struct tc_action *act;
 
-		INIT_LIST_HEAD(&exts->actions);
-		if (exts->police && tb[exts->police]) {
-			act = tcf_action_init_1(net, tb[exts->police], rate_tlv,
+		INIT_LIST_HEAD(actions);
+		if (police && tb[police]) {
+			act = tcf_action_init_1(net, tb[police], rate_tlv,
 						"police", ovr,
 						TCA_ACT_BIND);
 			if (IS_ERR(act))
 				return PTR_ERR(act);
 
-			act->type = exts->type = TCA_OLD_COMPAT;
-			list_add(&act->list, &exts->actions);
-		} else if (exts->action && tb[exts->action]) {
+			act->type = TCA_OLD_COMPAT;
+			list_add(&act->list, actions);
+		} else if (action && tb[action]) {
 			int err;
-			err = tcf_action_init(net, tb[exts->action], rate_tlv,
+			err = tcf_action_init(net, tb[action], rate_tlv,
 					      NULL, ovr,
-					      TCA_ACT_BIND, &exts->actions);
+					      TCA_ACT_BIND, actions);
 			if (err)
 				return err;
 		}
 	}
 #else
-	if ((exts->action && tb[exts->action]) ||
-	    (exts->police && tb[exts->police]))
+	if ((action && tb[action]) ||
+	    (police && tb[police]))
 		return -EOPNOTSUPP;
 #endif
 
 	return 0;
 }
-EXPORT_SYMBOL(tcf_exts_validate);
+EXPORT_SYMBOL(tcf_act_validate);
 
-void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
-		     struct tcf_exts *src)
+void tcf_act_change(struct tcf_proto *tp, struct list_head *dst,
+		    struct list_head *src)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	LIST_HEAD(tmp);
 	tcf_tree_lock(tp);
-	list_splice_init(&dst->actions, &tmp);
-	list_splice(&src->actions, &dst->actions);
+	list_splice_init(dst, &tmp);
+	list_splice(src, dst);
 	tcf_tree_unlock(tp);
 	tcf_action_destroy(&tmp, TCA_ACT_UNBIND);
 #endif
 }
-EXPORT_SYMBOL(tcf_exts_change);
+EXPORT_SYMBOL(tcf_act_change);
 
-#define tcf_exts_first_act(ext) \
-		list_first_entry(&(exts)->actions, struct tc_action, list)
+#define tcf_act_first_act(actions) \
+		list_first_entry(actions, struct tc_action, list)
 
-int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts)
+int tcf_act_dump(struct sk_buff *skb, const struct tcf_proto *tp,
+		 struct list_head *actions)
 {
 #ifdef CONFIG_NET_CLS_ACT
 	struct nlattr *nest;
 
-	if (exts->action && !list_empty(&exts->actions)) {
+	if (tp->ops->action && !list_empty(actions)) {
+		struct tc_action *act = tcf_act_first_act(actions);
 		/*
 		 * again for backward compatible mode - we want
 		 * to work with both old and new modes of entering
 		 * tc data even if iproute2  was newer - jhs
 		 */
-		if (exts->type != TCA_OLD_COMPAT) {
-			nest = nla_nest_start(skb, exts->action);
+		if (act->type != TCA_OLD_COMPAT) {
+			nest = nla_nest_start(skb, tp->ops->action);
 			if (nest == NULL)
 				goto nla_put_failure;
-			if (tcf_action_dump(skb, &exts->actions, 0, 0) < 0)
+			if (tcf_action_dump(skb, actions, 0, 0) < 0)
 				goto nla_put_failure;
 			nla_nest_end(skb, nest);
-		} else if (exts->police) {
-			struct tc_action *act = tcf_exts_first_act(exts);
-			nest = nla_nest_start(skb, exts->police);
+		} else if (tp->ops->police) {
+			nest = nla_nest_start(skb, tp->ops->police);
 			if (nest == NULL || !act)
 				goto nla_put_failure;
 			if (tcf_action_dump_old(skb, act, 0, 0) < 0)
@@ -595,19 +600,19 @@ int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts)
 	return 0;
 #endif
 }
-EXPORT_SYMBOL(tcf_exts_dump);
+EXPORT_SYMBOL(tcf_act_dump);
 
 
-int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts)
+int tcf_act_dump_stats(struct sk_buff *skb, struct list_head *actions)
 {
 #ifdef CONFIG_NET_CLS_ACT
-	struct tc_action *a = tcf_exts_first_act(exts);
+	struct tc_action *a = tcf_act_first_act(actions);
 	if (tcf_action_copy_stats(skb, a, 1) < 0)
 		return -1;
 #endif
 	return 0;
 }
-EXPORT_SYMBOL(tcf_exts_dump_stats);
+EXPORT_SYMBOL(tcf_act_dump_stats);
 
 static int __init tc_filter_init(void)
 {
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index fe20826..09c8db6 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -29,7 +29,7 @@ struct basic_head {
 
 struct basic_filter {
 	u32			handle;
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	struct tcf_ematch_tree	ematches;
 	struct tcf_result	res;
 	struct tcf_proto	*tp;
@@ -48,7 +48,7 @@ static int basic_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		if (!tcf_em_tree_match(skb, &f->ematches, NULL))
 			continue;
 		*res = f->res;
-		r = tcf_exts_exec(skb, &f->exts, res);
+		r = tcf_act_exec(skb, &f->actions, res);
 		if (r < 0)
 			continue;
 		return r;
@@ -94,7 +94,7 @@ static void basic_delete_filter(struct rcu_head *head)
 	struct tcf_proto *tp = f->tp;
 
 	tcf_unbind_filter(tp, &f->res);
-	tcf_exts_destroy(&f->exts);
+	tcf_act_destroy(&f->actions);
 	tcf_em_tree_destroy(tp, &f->ematches);
 	kfree(f);
 }
@@ -138,11 +138,10 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 			   struct nlattr *est, bool ovr)
 {
 	int err;
-	struct tcf_exts e;
 	struct tcf_ematch_tree t;
+	struct list_head actions;
 
-	tcf_exts_init(&e, TCA_BASIC_ACT, TCA_BASIC_POLICE);
-	err = tcf_exts_validate(net, tp, tb, est, &e, ovr);
+	err = tcf_act_validate(net, tp, tb, est, &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -155,13 +154,13 @@ static int basic_set_parms(struct net *net, struct tcf_proto *tp,
 		tcf_bind_filter(tp, &f->res, base);
 	}
 
-	tcf_exts_change(tp, &f->exts, &e);
+	tcf_act_change(tp, &f->actions, &actions);
 	tcf_em_tree_change(tp, &f->ematches, &t);
 	f->tp = tp;
 
 	return 0;
 errout:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -193,7 +192,7 @@ static int basic_change(struct net *net, struct sk_buff *in_skb,
 	if (fnew == NULL)
 		goto errout;
 
-	tcf_exts_init(&fnew->exts, TCA_BASIC_ACT, TCA_BASIC_POLICE);
+	INIT_LIST_HEAD(&fnew->actions);
 	err = -EINVAL;
 	if (handle) {
 		fnew->handle = handle;
@@ -270,13 +269,13 @@ static int basic_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	    nla_put_u32(skb, TCA_BASIC_CLASSID, f->res.classid))
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &f->exts) < 0 ||
+	if (tcf_act_dump(skb, tp, &f->actions) < 0 ||
 	    tcf_em_tree_dump(skb, &f->ematches, TCA_BASIC_EMATCHES) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &f->exts) < 0)
+	if (tcf_act_dump_stats(skb, &f->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -288,6 +287,8 @@ static int basic_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops cls_basic_ops __read_mostly = {
 	.kind		=	"basic",
+	.police		=	TCA_BASIC_POLICE,
+	.action		=	TCA_BASIC_ACT,
 	.classify	=	basic_classify,
 	.init		=	basic_init,
 	.destroy	=	basic_destroy,
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 4318d06..dbdf38c 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -33,7 +33,7 @@ struct cls_bpf_head {
 struct cls_bpf_prog {
 	struct bpf_prog *filter;
 	struct sock_filter *bpf_ops;
-	struct tcf_exts exts;
+	struct list_head actions;
 	struct tcf_result res;
 	struct list_head link;
 	u32 handle;
@@ -66,7 +66,7 @@ static int cls_bpf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		if (filter_res != -1)
 			res->classid = filter_res;
 
-		ret = tcf_exts_exec(skb, &prog->exts, res);
+		ret = tcf_act_exec(skb, &prog->actions, res);
 		if (ret < 0)
 			continue;
 
@@ -93,7 +93,7 @@ static int cls_bpf_init(struct tcf_proto *tp)
 static void cls_bpf_delete_prog(struct tcf_proto *tp, struct cls_bpf_prog *prog)
 {
 	tcf_unbind_filter(tp, &prog->res);
-	tcf_exts_destroy(&prog->exts);
+	tcf_act_destroy(&prog->actions);
 
 	bpf_prog_destroy(prog->filter);
 
@@ -167,7 +167,7 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
 				   struct nlattr *est, bool ovr)
 {
 	struct sock_filter *bpf_ops;
-	struct tcf_exts exts;
+	struct list_head actions;
 	struct sock_fprog_kern tmp;
 	struct bpf_prog *fp;
 	u16 bpf_size, bpf_len;
@@ -177,8 +177,7 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
 	if (!tb[TCA_BPF_OPS_LEN] || !tb[TCA_BPF_OPS] || !tb[TCA_BPF_CLASSID])
 		return -EINVAL;
 
-	tcf_exts_init(&exts, TCA_BPF_ACT, TCA_BPF_POLICE);
-	ret = tcf_exts_validate(net, tp, tb, est, &exts, ovr);
+	ret = tcf_act_validate(net, tp, tb, est, &actions, ovr);
 	if (ret < 0)
 		return ret;
 
@@ -211,13 +210,13 @@ static int cls_bpf_modify_existing(struct net *net, struct tcf_proto *tp,
 	prog->res.classid = classid;
 
 	tcf_bind_filter(tp, &prog->res, base);
-	tcf_exts_change(tp, &prog->exts, &exts);
+	tcf_act_change(tp, &prog->actions, &actions);
 
 	return 0;
 errout_free:
 	kfree(bpf_ops);
 errout:
-	tcf_exts_destroy(&exts);
+	tcf_act_destroy(&actions);
 	return ret;
 }
 
@@ -258,7 +257,7 @@ static int cls_bpf_change(struct net *net, struct sk_buff *in_skb,
 	if (!prog)
 		return -ENOBUFS;
 
-	tcf_exts_init(&prog->exts, TCA_BPF_ACT, TCA_BPF_POLICE);
+	INIT_LIST_HEAD(&prog->actions);
 
 	if (oldprog) {
 		if (handle && oldprog->handle != handle) {
@@ -322,12 +321,12 @@ static int cls_bpf_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 	memcpy(nla_data(nla), prog->bpf_ops, nla_len(nla));
 
-	if (tcf_exts_dump(skb, &prog->exts) < 0)
+	if (tcf_act_dump(skb, tp, &prog->actions) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &prog->exts) < 0)
+	if (tcf_act_dump_stats(skb, &prog->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -356,6 +355,8 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 
 static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
 	.kind		=	"bpf",
+	.action		=	TCA_BPF_ACT,
+	.police		=	TCA_BPF_POLICE,
 	.owner		=	THIS_MODULE,
 	.classify	=	cls_bpf_classify,
 	.init		=	cls_bpf_init,
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index 3409f16..9640b20 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -20,7 +20,7 @@
 
 struct cls_cgroup_head {
 	u32			handle;
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	struct tcf_ematch_tree	ematches;
 	struct tcf_proto	*tp;
 	struct rcu_head		rcu;
@@ -59,7 +59,7 @@ static int cls_cgroup_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 
 	res->classid = classid;
 	res->class = 0;
-	return tcf_exts_exec(skb, &head->exts, res);
+	return tcf_act_exec(skb, &head->actions, res);
 }
 
 static unsigned long cls_cgroup_get(struct tcf_proto *tp, u32 handle)
@@ -86,7 +86,7 @@ static void cls_cgroup_destroy_rcu(struct rcu_head *root)
 						    struct cls_cgroup_head,
 						    rcu);
 
-	tcf_exts_destroy(&head->exts);
+	tcf_act_destroy(&head->actions);
 	tcf_em_tree_destroy(head->tp, &head->ematches);
 	kfree(head);
 }
@@ -100,7 +100,7 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 	struct cls_cgroup_head *new;
 	struct tcf_ematch_tree t;
-	struct tcf_exts e;
+	struct list_head actions;
 	int err;
 
 	if (!tca[TCA_OPTIONS])
@@ -116,7 +116,6 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	if (!new)
 		return -ENOBUFS;
 
-	tcf_exts_init(&new->exts, TCA_CGROUP_ACT, TCA_CGROUP_POLICE);
 	if (head)
 		new->handle = head->handle;
 	else
@@ -128,18 +127,17 @@ static int cls_cgroup_change(struct net *net, struct sk_buff *in_skb,
 	if (err < 0)
 		goto errout;
 
-	tcf_exts_init(&e, TCA_CGROUP_ACT, TCA_CGROUP_POLICE);
-	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr);
+	err = tcf_act_validate(net, tp, tb, tca[TCA_RATE], &actions, ovr);
 	if (err < 0)
 		goto errout;
 
 	err = tcf_em_tree_validate(tp, tb[TCA_CGROUP_EMATCHES], &t);
 	if (err < 0) {
-		tcf_exts_destroy(&e);
+		tcf_act_destroy(&actions);
 		goto errout;
 	}
 
-	tcf_exts_change(tp, &new->exts, &e);
+	tcf_act_change(tp, &new->actions, &actions);
 	tcf_em_tree_change(tp, &new->ematches, &t);
 
 	rcu_assign_pointer(tp->root, new);
@@ -156,7 +154,7 @@ static void cls_cgroup_destroy(struct tcf_proto *tp)
 	struct cls_cgroup_head *head = rtnl_dereference(tp->root);
 
 	if (head) {
-		tcf_exts_destroy(&head->exts);
+		tcf_act_destroy(&head->actions);
 		tcf_em_tree_destroy(tp, &head->ematches);
 		RCU_INIT_POINTER(tp->root, NULL);
 		kfree_rcu(head, rcu);
@@ -196,13 +194,13 @@ static int cls_cgroup_dump(struct net *net, struct tcf_proto *tp, unsigned long
 	if (nest == NULL)
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &head->exts) < 0 ||
+	if (tcf_act_dump(skb, tp, &head->actions) < 0 ||
 	    tcf_em_tree_dump(skb, &head->ematches, TCA_CGROUP_EMATCHES) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &head->exts) < 0)
+	if (tcf_act_dump_stats(skb, &head->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -214,6 +212,8 @@ static int cls_cgroup_dump(struct net *net, struct tcf_proto *tp, unsigned long
 
 static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {
 	.kind		=	"cgroup",
+	.action		=	TCA_CGROUP_ACT,
+	.police		=	TCA_CGROUP_POLICE,
 	.init		=	cls_cgroup_init,
 	.change		=	cls_cgroup_change,
 	.classify	=	cls_cgroup_classify,
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index f18d27f7..5cbc556 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -39,7 +39,7 @@ struct flow_head {
 
 struct flow_filter {
 	struct list_head	list;
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	struct tcf_ematch_tree	ematches;
 	struct tcf_proto	*tp;
 	struct timer_list	perturb_timer;
@@ -317,7 +317,7 @@ static int flow_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 		res->class   = 0;
 		res->classid = TC_H_MAKE(f->baseclass, f->baseclass + classid);
 
-		r = tcf_exts_exec(skb, &f->exts, res);
+		r = tcf_act_exec(skb, &f->actions, res);
 		if (r < 0)
 			continue;
 		return r;
@@ -354,7 +354,7 @@ static void flow_destroy_filter(struct rcu_head *head)
 	struct flow_filter *f = container_of(head, struct flow_filter, rcu);
 
 	del_timer_sync(&f->perturb_timer);
-	tcf_exts_destroy(&f->exts);
+	tcf_act_destroy(&f->actions);
 	tcf_em_tree_destroy(f->tp, &f->ematches);
 	kfree(f);
 }
@@ -368,7 +368,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	struct flow_filter *fold, *fnew;
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_FLOW_MAX + 1];
-	struct tcf_exts e;
+	struct list_head actions;
 	struct tcf_ematch_tree t;
 	unsigned int nkeys = 0;
 	unsigned int perturb_period = 0;
@@ -405,8 +405,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 			return -EOPNOTSUPP;
 	}
 
-	tcf_exts_init(&e, TCA_FLOW_ACT, TCA_FLOW_POLICE);
-	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr);
+	err = tcf_act_validate(net, tp, tb, tca[TCA_RATE], &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -483,14 +482,14 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 		fnew->mask  = ~0U;
 		fnew->tp = tp;
 		get_random_bytes(&fnew->hashrnd, 4);
-		tcf_exts_init(&fnew->exts, TCA_FLOW_ACT, TCA_FLOW_POLICE);
+		INIT_LIST_HEAD(&fnew->actions);
 	}
 
 	fnew->perturb_timer.function = flow_perturbation;
 	fnew->perturb_timer.data = (unsigned long)fnew;
 	init_timer_deferrable(&fnew->perturb_timer);
 
-	tcf_exts_change(tp, &fnew->exts, &e);
+	tcf_act_change(tp, &fnew->actions, &actions);
 	tcf_em_tree_change(tp, &fnew->ematches, &t);
 
 	if (tb[TCA_FLOW_KEYS]) {
@@ -533,7 +532,7 @@ static int flow_change(struct net *net, struct sk_buff *in_skb,
 	tcf_em_tree_destroy(tp, &t);
 	kfree(fnew);
 err1:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -628,7 +627,7 @@ static int flow_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	    nla_put_u32(skb, TCA_FLOW_PERTURB, f->perturb_period / HZ))
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &f->exts) < 0)
+	if (tcf_act_dump(skb, tp, &f->actions) < 0)
 		goto nla_put_failure;
 #ifdef CONFIG_NET_EMATCH
 	if (f->ematches.hdr.nmatches &&
@@ -637,7 +636,7 @@ static int flow_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 #endif
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &f->exts) < 0)
+	if (tcf_act_dump_stats(skb, &f->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -666,6 +665,8 @@ static void flow_walk(struct tcf_proto *tp, struct tcf_walker *arg)
 
 static struct tcf_proto_ops cls_flow_ops __read_mostly = {
 	.kind		= "flow",
+	.action		= TCA_FLOW_ACT,
+	.police		= TCA_FLOW_POLICE,
 	.classify	= flow_classify,
 	.init		= flow_init,
 	.destroy	= flow_destroy,
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index da805ae..ba955c4 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -44,7 +44,7 @@ struct fw_filter {
 #ifdef CONFIG_NET_CLS_IND
 	int			ifindex;
 #endif /* CONFIG_NET_CLS_IND */
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	struct tcf_proto	*tp;
 	struct rcu_head		rcu;
 };
@@ -75,7 +75,7 @@ static int fw_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 				if (!tcf_match_indev(skb, f->ifindex))
 					continue;
 #endif /* CONFIG_NET_CLS_IND */
-				r = tcf_exts_exec(skb, &f->exts, res);
+				r = tcf_act_exec(skb, &f->actions, res);
 				if (r < 0)
 					continue;
 
@@ -126,7 +126,7 @@ static void fw_delete_filter(struct rcu_head *head)
 	struct tcf_proto *tp = f->tp;
 
 	tcf_unbind_filter(tp, &f->res);
-	tcf_exts_destroy(&f->exts);
+	tcf_act_destroy(&f->actions);
 	kfree(f);
 }
 
@@ -185,12 +185,11 @@ fw_change_attrs(struct net *net, struct tcf_proto *tp, struct fw_filter *f,
 	struct nlattr **tb, struct nlattr **tca, unsigned long base, bool ovr)
 {
 	struct fw_head *head = rtnl_dereference(tp->root);
-	struct tcf_exts e;
+	struct list_head actions;
 	u32 mask;
 	int err;
 
-	tcf_exts_init(&e, TCA_FW_ACT, TCA_FW_POLICE);
-	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr);
+	err = tcf_act_validate(net, tp, tb, tca[TCA_RATE], &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -219,11 +218,11 @@ fw_change_attrs(struct net *net, struct tcf_proto *tp, struct fw_filter *f,
 	} else if (head->mask != 0xFFFFFFFF)
 		goto errout;
 
-	tcf_exts_change(tp, &f->exts, &e);
+	tcf_act_change(tp, &f->actions, &actions);
 
 	return 0;
 errout:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -264,7 +263,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 #endif /* CONFIG_NET_CLS_IND */
 		fnew->tp = f->tp;
 
-		tcf_exts_init(&fnew->exts, TCA_FW_ACT, TCA_FW_POLICE);
+		INIT_LIST_HEAD(&fnew->actions);
 
 		err = fw_change_attrs(net, tp, fnew, tb, tca, base, ovr);
 		if (err < 0) {
@@ -306,7 +305,7 @@ static int fw_change(struct net *net, struct sk_buff *in_skb,
 	if (f == NULL)
 		return -ENOBUFS;
 
-	tcf_exts_init(&f->exts, TCA_FW_ACT, TCA_FW_POLICE);
+	INIT_LIST_HEAD(&f->actions);
 	f->id = handle;
 	f->tp = tp;
 
@@ -367,7 +366,7 @@ static int fw_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 	t->tcm_handle = f->id;
 
-	if (!f->res.classid && !tcf_exts_is_available(&f->exts))
+	if (!f->res.classid && !tcf_act_is_available(&f->actions))
 		return skb->len;
 
 	nest = nla_nest_start(skb, TCA_OPTIONS);
@@ -389,12 +388,12 @@ static int fw_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	    nla_put_u32(skb, TCA_FW_MASK, head->mask))
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &f->exts) < 0)
+	if (tcf_act_dump(skb, tp, &f->actions) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &f->exts) < 0)
+	if (tcf_act_dump_stats(skb, &f->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -406,6 +405,8 @@ static int fw_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops cls_fw_ops __read_mostly = {
 	.kind		=	"fw",
+	.action		=	TCA_FW_ACT,
+	.police		=	TCA_FW_POLICE,
 	.classify	=	fw_classify,
 	.init		=	fw_init,
 	.destroy	=	fw_destroy,
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index b665aee..03b6e20 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -53,7 +53,7 @@ struct route4_filter {
 	int			iif;
 
 	struct tcf_result	res;
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	u32			handle;
 	struct route4_bucket	*bkt;
 	struct tcf_proto	*tp;
@@ -113,8 +113,8 @@ static inline int route4_hash_wild(void)
 #define ROUTE4_APPLY_RESULT()					\
 {								\
 	*res = f->res;						\
-	if (tcf_exts_is_available(&f->exts)) {			\
-		int r = tcf_exts_exec(skb, &f->exts, res);	\
+	if (tcf_act_is_available(&f->actions)) {		\
+		int r = tcf_act_exec(skb, &f->actions, res);	\
 		if (r < 0) {					\
 			dont_cache = 1;				\
 			continue;				\
@@ -272,7 +272,7 @@ route4_delete_filter(struct rcu_head *head)
 	struct tcf_proto *tp = f->tp;
 
 	tcf_unbind_filter(tp, &f->res);
-	tcf_exts_destroy(&f->exts);
+	tcf_act_destroy(&f->actions);
 	kfree(f);
 }
 
@@ -377,10 +377,9 @@ static int route4_set_parms(struct net *net, struct tcf_proto *tp,
 	struct route4_filter *fp;
 	unsigned int h1;
 	struct route4_bucket *b;
-	struct tcf_exts e;
+	struct list_head actions;
 
-	tcf_exts_init(&e, TCA_ROUTE4_ACT, TCA_ROUTE4_POLICE);
-	err = tcf_exts_validate(net, tp, tb, est, &e, ovr);
+	err = tcf_act_validate(net, tp, tb, est, &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -452,11 +451,11 @@ static int route4_set_parms(struct net *net, struct tcf_proto *tp,
 		tcf_bind_filter(tp, &f->res, base);
 	}
 
-	tcf_exts_change(tp, &f->exts, &e);
+	tcf_act_change(tp, &f->actions, &actions);
 
 	return 0;
 errout:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -499,7 +498,7 @@ static int route4_change(struct net *net, struct sk_buff *in_skb,
 	if (!f)
 		goto errout;
 
-	tcf_exts_init(&f->exts, TCA_ROUTE4_ACT, TCA_ROUTE4_POLICE);
+	INIT_LIST_HEAD(&f->actions);
 	if (fold) {
 		f->id = fold->id;
 		f->iif = fold->iif;
@@ -625,12 +624,12 @@ static int route4_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	    nla_put_u32(skb, TCA_ROUTE4_CLASSID, f->res.classid))
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &f->exts) < 0)
+	if (tcf_act_dump(skb, tp, &f->actions) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &f->exts) < 0)
+	if (tcf_act_dump_stats(skb, &f->actions) < 0)
 		goto nla_put_failure;
 
 	return skb->len;
@@ -642,6 +641,8 @@ static int route4_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops cls_route4_ops __read_mostly = {
 	.kind		=	"route",
+	.action		=	TCA_ROUTE4_ACT,
+	.police		=	TCA_ROUTE4_POLICE,
 	.classify	=	route4_classify,
 	.init		=	route4_init,
 	.destroy	=	route4_destroy,
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h
index 6bb55f2..64e5b31 100644
--- a/net/sched/cls_rsvp.h
+++ b/net/sched/cls_rsvp.h
@@ -93,7 +93,7 @@ struct rsvp_filter {
 	u8				tunnelhdr;
 
 	struct tcf_result		res;
-	struct tcf_exts			exts;
+	struct list_head		actions;
 
 	u32				handle;
 	struct rsvp_session		*sess;
@@ -121,7 +121,7 @@ static inline unsigned int hash_src(__be32 *src)
 
 #define RSVP_APPLY_RESULT()				\
 {							\
-	int r = tcf_exts_exec(skb, &f->exts, res);	\
+	int r = tcf_act_exec(skb, &f->actions, res);	\
 	if (r < 0)					\
 		continue;				\
 	else if (r > 0)					\
@@ -291,7 +291,7 @@ static void
 rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
 {
 	tcf_unbind_filter(tp, &f->res);
-	tcf_exts_destroy(&f->exts);
+	tcf_act_destroy(&f->actions);
 	kfree_rcu(f, rcu);
 }
 
@@ -461,7 +461,7 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 	struct tc_rsvp_pinfo *pinfo = NULL;
 	struct nlattr *opt = tca[TCA_OPTIONS];
 	struct nlattr *tb[TCA_RSVP_MAX + 1];
-	struct tcf_exts e;
+	struct list_head actions;
 	unsigned int h1, h2;
 	__be32 *dst;
 	int err;
@@ -473,8 +473,7 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 	if (err < 0)
 		return err;
 
-	tcf_exts_init(&e, TCA_RSVP_ACT, TCA_RSVP_POLICE);
-	err = tcf_exts_validate(net, tp, tb, tca[TCA_RATE], &e, ovr);
+	err = tcf_act_validate(net, tp, tb, tca[TCA_RATE], &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -492,14 +491,14 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 			goto errout2;
 		}
 
-		tcf_exts_init(&n->exts, TCA_RSVP_ACT, TCA_RSVP_POLICE);
+		INIT_LIST_HEAD(&n->actions);
 
 		if (tb[TCA_RSVP_CLASSID]) {
 			n->res.classid = nla_get_u32(tb[TCA_RSVP_CLASSID]);
 			tcf_bind_filter(tp, &n->res, base);
 		}
 
-		tcf_exts_change(tp, &n->exts, &e);
+		tcf_act_change(tp, &n->actions, &actions);
 		rsvp_replace(tp, n, handle);
 		return 0;
 	}
@@ -516,7 +515,7 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 	if (f == NULL)
 		goto errout2;
 
-	tcf_exts_init(&f->exts, TCA_RSVP_ACT, TCA_RSVP_POLICE);
+	INIT_LIST_HEAD(&f->actions);
 	h2 = 16;
 	if (tb[TCA_RSVP_SRC]) {
 		memcpy(f->src, nla_data(tb[TCA_RSVP_SRC]), sizeof(f->src));
@@ -570,7 +569,7 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 			if (f->tunnelhdr == 0)
 				tcf_bind_filter(tp, &f->res, base);
 
-			tcf_exts_change(tp, &f->exts, &e);
+			tcf_act_change(tp, &f->actions, &actions);
 
 			fp = &s->ht[h2];
 			for (nfp = rtnl_dereference(*fp); nfp;
@@ -615,7 +614,7 @@ static int rsvp_change(struct net *net, struct sk_buff *in_skb,
 errout:
 	kfree(f);
 errout2:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -688,12 +687,12 @@ static int rsvp_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	    nla_put(skb, TCA_RSVP_SRC, sizeof(f->src), f->src))
 		goto nla_put_failure;
 
-	if (tcf_exts_dump(skb, &f->exts) < 0)
+	if (tcf_act_dump(skb, tp, &f->actions) < 0)
 		goto nla_put_failure;
 
 	nla_nest_end(skb, nest);
 
-	if (tcf_exts_dump_stats(skb, &f->exts) < 0)
+	if (tcf_act_dump_stats(skb, &f->actions) < 0)
 		goto nla_put_failure;
 	return skb->len;
 
@@ -704,6 +703,8 @@ static int rsvp_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops RSVP_OPS __read_mostly = {
 	.kind		=	RSVP_ID,
+	.action		=	TCA_RSVP_ACT,
+	.police		=	TCA_RSVP_POLICE,
 	.classify	=	rsvp_classify,
 	.init		=	rsvp_init,
 	.destroy	=	rsvp_destroy,
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index 30f10fb..c584195 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -25,7 +25,7 @@
 
 
 struct tcindex_filter_result {
-	struct tcf_exts		exts;
+	struct list_head	actions;
 	struct tcf_result	res;
 };
 
@@ -52,7 +52,7 @@ struct tcindex_data {
 static inline int
 tcindex_filter_is_set(struct tcindex_filter_result *r)
 {
-	return tcf_exts_is_predicative(&r->exts) || r->res.classid;
+	return tcf_act_is_predicative(&r->actions) || r->res.classid;
 }
 
 static struct tcindex_filter_result *
@@ -100,7 +100,7 @@ static int tcindex_classify(struct sk_buff *skb, const struct tcf_proto *tp,
 	*res = f->res;
 	pr_debug("map 0x%x\n", res->classid);
 
-	return tcf_exts_exec(skb, &f->exts, res);
+	return tcf_act_exec(skb, &f->actions, res);
 }
 
 
@@ -169,7 +169,7 @@ tcindex_delete(struct tcf_proto *tp, unsigned long arg)
 		rcu_assign_pointer(*walk, rtnl_dereference(f->next));
 	}
 	tcf_unbind_filter(tp, &r->res);
-	tcf_exts_destroy(&r->exts);
+	tcf_act_destroy(&r->actions);
 	if (f)
 		kfree_rcu(f, rcu);
 	return 0;
@@ -208,7 +208,7 @@ static const struct nla_policy tcindex_policy[TCA_TCINDEX_MAX + 1] = {
 static void tcindex_filter_result_init(struct tcindex_filter_result *r)
 {
 	memset(r, 0, sizeof(*r));
-	tcf_exts_init(&r->exts, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
+	INIT_LIST_HEAD(&r->actions);
 }
 
 static void __tcindex_partial_destroy(struct rcu_head *head)
@@ -230,10 +230,9 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	struct tcindex_filter_result cr;
 	struct tcindex_data *cp, *oldp;
 	struct tcindex_filter *f = NULL; /* make gcc behave */
-	struct tcf_exts e;
+	struct list_head actions;
 
-	tcf_exts_init(&e, TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
-	err = tcf_exts_validate(net, tp, tb, est, &e, ovr);
+	err = tcf_act_validate(net, tp, tb, est, &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -261,8 +260,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		if (!cp->perfect)
 			goto errout;
 		for (i = 0; i < cp->hash; i++)
-			tcf_exts_init(&cp->perfect[i].exts,
-				      TCA_TCINDEX_ACT, TCA_TCINDEX_POLICE);
+			INIT_LIST_HEAD(&cp->perfect[i].actions);
 		balloc = 1;
 	}
 	cp->h = p->h;
@@ -330,9 +328,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 			if (!cp->perfect)
 				goto errout_alloc;
 			for (i = 0; i < cp->hash; i++)
-				tcf_exts_init(&cp->perfect[i].exts,
-					      TCA_TCINDEX_ACT,
-					      TCA_TCINDEX_POLICE);
+				INIT_LIST_HEAD(&cp->perfect[i].actions);
 			balloc = 1;
 		} else {
 			struct tcindex_filter __rcu **hash;
@@ -369,9 +365,9 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 	}
 
 	if (old_r)
-		tcf_exts_change(tp, &r->exts, &e);
+		tcf_act_change(tp, &r->actions, &actions);
 	else
-		tcf_exts_change(tp, &cr.exts, &e);
+		tcf_act_change(tp, &cr.actions, &actions);
 
 	if (old_r && old_r != r)
 		tcindex_filter_result_init(old_r);
@@ -384,7 +380,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		struct tcindex_filter *nfp;
 		struct tcindex_filter __rcu **fp;
 
-		tcf_exts_change(tp, &f->result.exts, &r->exts);
+		tcf_act_change(tp, &f->result.actions, &r->actions);
 
 		fp = cp->h + (handle % cp->hash);
 		for (nfp = rtnl_dereference(*fp);
@@ -406,7 +402,7 @@ tcindex_set_parms(struct net *net, struct tcf_proto *tp, unsigned long base,
 		kfree(cp->h);
 errout:
 	kfree(cp);
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -539,11 +535,11 @@ static int tcindex_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 		    nla_put_u32(skb, TCA_TCINDEX_CLASSID, r->res.classid))
 			goto nla_put_failure;
 
-		if (tcf_exts_dump(skb, &r->exts) < 0)
+		if (tcf_act_dump(skb, tp, &r->actions) < 0)
 			goto nla_put_failure;
 		nla_nest_end(skb, nest);
 
-		if (tcf_exts_dump_stats(skb, &r->exts) < 0)
+		if (tcf_act_dump_stats(skb, &r->actions) < 0)
 			goto nla_put_failure;
 	}
 
@@ -556,6 +552,8 @@ static int tcindex_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops cls_tcindex_ops __read_mostly = {
 	.kind		=	"tcindex",
+	.action		=	TCA_TCINDEX_ACT,
+	.police		=	TCA_TCINDEX_POLICE,
 	.classify	=	tcindex_classify,
 	.init		=	tcindex_init,
 	.destroy	=	tcindex_destroy,
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 0472909..973e7f3 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -48,7 +48,7 @@ struct tc_u_knode {
 	struct tc_u_knode __rcu	*next;
 	u32			handle;
 	struct tc_u_hnode __rcu	*ht_up;
-	struct tcf_exts		exts;
+	struct list_head	actions;
 #ifdef CONFIG_NET_CLS_IND
 	int			ifindex;
 #endif
@@ -173,7 +173,7 @@ static int u32_classify(struct sk_buff *skb, const struct tcf_proto *tp, struct
 #ifdef CONFIG_CLS_U32_PERF
 				__this_cpu_inc(n->pf->rhit);
 #endif
-				r = tcf_exts_exec(skb, &n->exts, res);
+				r = tcf_act_exec(skb, &n->actions, res);
 				if (r < 0) {
 					n = rcu_dereference_bh(n->next);
 					goto next_knode;
@@ -358,7 +358,7 @@ static int u32_destroy_key(struct tcf_proto *tp,
 			   struct tc_u_knode *n,
 			   bool free_pf)
 {
-	tcf_exts_destroy(&n->exts);
+	tcf_act_destroy(&n->actions);
 	if (n->ht_down)
 		n->ht_down->refcnt--;
 #ifdef CONFIG_CLS_U32_PERF
@@ -560,10 +560,9 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
 			 struct nlattr *est, bool ovr)
 {
 	int err;
-	struct tcf_exts e;
+	struct list_head actions;
 
-	tcf_exts_init(&e, TCA_U32_ACT, TCA_U32_POLICE);
-	err = tcf_exts_validate(net, tp, tb, est, &e, ovr);
+	err = tcf_act_validate(net, tp, tb, est, &actions, ovr);
 	if (err < 0)
 		return err;
 
@@ -603,11 +602,11 @@ static int u32_set_parms(struct net *net, struct tcf_proto *tp,
 		n->ifindex = ret;
 	}
 #endif
-	tcf_exts_change(tp, &n->exts, &e);
+	tcf_act_change(tp, &n->actions, &actions);
 
 	return 0;
 errout:
-	tcf_exts_destroy(&e);
+	tcf_act_destroy(&actions);
 	return err;
 }
 
@@ -681,8 +680,7 @@ static struct tc_u_knode *u32_init_knode(struct tcf_proto *tp,
 #endif
 	new->tp = tp;
 	memcpy(&new->sel, s, sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key));
-
-	tcf_exts_init(&new->exts, TCA_U32_ACT, TCA_U32_POLICE);
+	INIT_LIST_HEAD(&new->actions);
 
 	return new;
 }
@@ -810,7 +808,7 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
 	RCU_INIT_POINTER(n->ht_up, ht);
 	n->handle = handle;
 	n->fshift = s->hmask ? ffs(ntohl(s->hmask)) - 1 : 0;
-	tcf_exts_init(&n->exts, TCA_U32_ACT, TCA_U32_POLICE);
+	INIT_LIST_HEAD(&n->actions);
 	n->tp = tp;
 
 #ifdef CONFIG_CLS_U32_MARK
@@ -965,7 +963,7 @@ static int u32_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 		}
 #endif
 
-		if (tcf_exts_dump(skb, &n->exts) < 0)
+		if (tcf_act_dump(skb, tp, &n->actions) < 0)
 			goto nla_put_failure;
 
 #ifdef CONFIG_NET_CLS_IND
@@ -1006,7 +1004,7 @@ static int u32_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 	nla_nest_end(skb, nest);
 
 	if (TC_U32_KEY(n->handle))
-		if (tcf_exts_dump_stats(skb, &n->exts) < 0)
+		if (tcf_act_dump_stats(skb, &n->actions) < 0)
 			goto nla_put_failure;
 	return skb->len;
 
@@ -1017,6 +1015,8 @@ static int u32_dump(struct net *net, struct tcf_proto *tp, unsigned long fh,
 
 static struct tcf_proto_ops cls_u32_ops __read_mostly = {
 	.kind		=	"u32",
+	.action		=	TCA_U32_ACT,
+	.police		=	TCA_U32_POLICE,
 	.classify	=	u32_classify,
 	.init		=	u32_init,
 	.destroy	=	u32_destroy,
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCHv1] xen-netfront: always keep the Rx ring full of requests
From: David Miller @ 2014-10-03 22:54 UTC (permalink / raw)
  To: david.vrabel; +Cc: netdev, xen-devel, konrad.wilk, boris.ostrovsky
In-Reply-To: <1412256826-18874-1-git-send-email-david.vrabel@citrix.com>

From: David Vrabel <david.vrabel@citrix.com>
Date: Thu, 2 Oct 2014 14:33:46 +0100

> A full Rx ring only requires 1 MiB of memory.  This is not enough
> memory that it is useful to dynamically scale the number of Rx
> requests in the ring based on traffic rates.
> 
> Keeping the ring full of Rx requests handles bursty traffic better
> than trying to converges on an optimal number of requests to keep
> filled.
> 
> On a 4 core host, an iperf -P 64 -t 60 run from dom0 to a 4 VCPU guest
> improved from 5.1 Gbit/s to 5.6 Gbit/s.  Gains with more bursty
> traffic are expected to be higher.
> 
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>

Can I get an ACK from someone else knowledgable about this area?

Thanks!

^ permalink raw reply

* Re: [PATCH net-next] net: do not export skb_gro_receive()
From: David Miller @ 2014-10-03 22:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1412260726.16704.99.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 02 Oct 2014 07:38:46 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> skb_gro_receive() is only called from tcp_gro_receive() which is
> not in a module.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH] drivers/net/dsa/Kconfig: Let NET_DSA_BCM_SF2 depend on HAS_IOMEM
From: David Miller @ 2014-10-03 22:52 UTC (permalink / raw)
  To: gang.chen.5i5j; +Cc: f.fainelli, leitec, andrew, netdev, richard, linux-kernel
In-Reply-To: <542D5DAC.4010001@gmail.com>

From: Chen Gang <gang.chen.5i5j@gmail.com>
Date: Thu, 02 Oct 2014 22:14:04 +0800

> NET_DSA_BCM_SF2 need HAS_IOMEM, so depend on it, the related error (with
> allmodconfig under um):
> 
>     CC [M]  drivers/net/dsa/bcm_sf2.o
>   drivers/net/dsa/bcm_sf2.c: In function ‘bcm_sf2_sw_setup’:
>   drivers/net/dsa/bcm_sf2.c:487:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
>      iounmap(*base);
>      ^
> 
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] drivers/net/ethernet/marvell/Kconfig: Let PXA168_ETH depend on HAS_IOMEM
From: David Miller @ 2014-10-03 22:52 UTC (permalink / raw)
  To: gang.chen.5i5j
  Cc: antoine.tenart, arnd, jason, richard, mw, thomas.petazzoni,
	netdev, linux-kernel
In-Reply-To: <542D5FE5.2040400@gmail.com>

From: Chen Gang <gang.chen.5i5j@gmail.com>
Date: Thu, 02 Oct 2014 22:23:33 +0800

> PXA168_ETH need HAS_IOMEM, so depend on it, the related error (with
> allmodconfig under um):
> 
>     CC [M]  drivers/net/ethernet/marvell/pxa168_eth.o
>   drivers/net/ethernet/marvell/pxa168_eth.c: In function ‘pxa168_eth_probe’:
>   drivers/net/ethernet/marvell/pxa168_eth.c:1605:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
>     iounmap(pep->base);
>     ^
> 
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH] drivers/net/irda/Kconfig: Let SH_IRDA depend on HAS_IOMEM
From: David Miller @ 2014-10-03 22:52 UTC (permalink / raw)
  To: gang.chen.5i5j; +Cc: samuel, richard, netdev, linux-kernel
In-Reply-To: <542D6218.3040609@gmail.com>

From: Chen Gang <gang.chen.5i5j@gmail.com>
Date: Thu, 02 Oct 2014 22:32:56 +0800

> SH_IRDA needs HAS_IOMEM, so depend on it. The related error(with
> allmodconfig under um):
> 
>     CC [M]  drivers/net/irda/sh_irda.o
>   drivers/net/irda/sh_irda.c: In function ‘sh_irda_probe’:
>   drivers/net/irda/sh_irda.c:776:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
>     self->membase = ioremap_nocache(res->start, resource_size(res));
>     ^
>   drivers/net/irda/sh_irda.c:776:16: warning: assignment makes pointer from integer without a cast [enabled by default]
>     self->membase = ioremap_nocache(res->start, resource_size(res));
>                   ^
>   drivers/net/irda/sh_irda.c:821:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
>     iounmap(self->membase);
>     ^
> 
> Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>

Applied.

^ permalink raw reply

* Re: [net-next PATCH V6 0/2] qdisc: bulk dequeue support
From: Eric Dumazet @ 2014-10-03 22:56 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David Miller, Jesper Dangaard Brouer, Linux Netdev List,
	Hannes Frederic Sowa, Florian Westphal, Daniel Borkmann,
	Jamal Hadi Salim, Alexander Duyck, John Fastabend, Dave Taht,
	Toke Høiland-Jørgensen
In-Reply-To: <CA+mtBx_TQhbS3VPVKjosaQjyO8YkNi=zBQMVYEbLqR8Hyczd3Q@mail.gmail.com>

On Fri, 2014-10-03 at 15:19 -0700, Tom Herbert wrote:
> On Fri, Oct 3, 2014 at 3:15 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Fri, 2014-10-03 at 14:57 -0700, Eric Dumazet wrote:
> >> On Fri, 2014-10-03 at 14:56 -0700, David Miller wrote:
> >>
> >> > I completely agree, and I sort of intended this to happen when
> >> > I split all the code into that new function.
> >> >
> >> >   GSO segmentation of TX checksuming should not prevent other
> >> > > cpus from queueing other skbs in the qdisc.
> >> > >
> >> > > I will spend some time on this.
> >> >
> >> > Thanks!
> >>
> >> I just did my first reboot, will make sure everything is working well
> >> before sending the patch ;)
> >
> > This is awesome...
> >
> > 40Gb rate, with TSO=on or TSO=off, it does not matter anymore.
> >
> This is with or without GSO?

GSO=on

But with GSO off, I also get line rate, only spending more cpu.

    25.05%  [kernel]          [k] copy_user_enhanced_fast_string
     3.17%  [kernel]          [k] _raw_spin_lock                
     2.65%  [kernel]          [k] tcp_sendmsg                   
     2.43%  [kernel]          [k] tcp_ack                       
     2.08%  [kernel]          [k] __netif_receive_skb_core      
     1.88%  [kernel]          [k] menu_select                   
     1.69%  [kernel]          [k] put_compound_page             
     1.50%  [kernel]          [k] flush_smp_call_function_queue 
     1.48%  [kernel]          [k] int_sqrt                      
     1.43%  [kernel]          [k] call_function_single_interrupt
     1.42%  [kernel]          [k] cpuidle_enter_state           
     1.21%  perf              [.] 0x00000000000353f2            
     1.10%  [kernel]          [k] tcp_init_tso_segs             
     1.05%  [kernel]          [k] memcpy                        
     1.01%  [kernel]          [k] __skb_clone                   
     0.91%  [kernel]          [k] irq_entries_start             
     0.84%  [kernel]          [k] get_nohz_timer_target         
     0.84%  [kernel]          [k] put_page                      
     0.82%  [kernel]          [k] tcp_write_xmit                
     0.82%  [kernel]          [k] get_page_from_freelist        
     0.82%  [kernel]          [k] llist_reverse_order           
     0.80%  [kernel]          [k] native_sched_clock            
     0.79%  [kernel]          [k] cpu_startup_entry             
     0.78%  [kernel]          [k] kmem_cache_free               

The killer combination is GSO=off and TX=off, of course.

     9.90%  [kernel]          [k] csum_partial_copy_generic     
     7.64%  [kernel]          [k] _raw_spin_lock                
     6.10%  [kernel]          [k] tcp_ack                       
     5.11%  [kernel]          [k] __skb_clone                   
     3.90%  [kernel]          [k] __alloc_skb                   
     3.78%  [kernel]          [k] skb_release_data              
     3.45%  [kernel]          [k] csum_partial                  
     3.17%  [kernel]          [k] __kfree_skb                   
     3.07%  [kernel]          [k] skb_clone                     
     2.44%  [kernel]          [k] __kmalloc_node_track_caller   
     2.26%  [kernel]          [k] tcp_init_tso_segs             
     2.25%  [kernel]          [k] kfree                         
     1.98%  [kernel]          [k] pfifo_fast_dequeue            
     1.61%  [kernel]          [k] tcp_set_skb_tso_segs          
     1.45%  [kernel]          [k] kmem_cache_free               

^ permalink raw reply

* Re: [PATCH net-next] qdisc: validate skb without holding lock
From: Eric Dumazet @ 2014-10-03 23:30 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, netdev, therbert, hannes, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend, dave.taht, toke
In-Reply-To: <20141003.153645.72976986956341944.davem@davemloft.net>

On Fri, 2014-10-03 at 15:36 -0700, David Miller wrote:

> Applied, thanks Eric!

Thanks David

Another problem we need to address is the quota in __qdisc_run()
is no longer meaningfull, if each qdisc_restart() can pump many packets.

An idea would be to use the bstats (or cpu_qstats if applicable)

^ permalink raw reply

* Re: [PATCH v2 net-next 15/15] tipc: remove old ASCII netlink API
From: David Miller @ 2014-10-03 23:50 UTC (permalink / raw)
  To: richard.alpe; +Cc: netdev, tipc-discussion
In-Reply-To: <1412261921-28510-16-git-send-email-richard.alpe@ericsson.com>

From: <richard.alpe@ericsson.com>
Date: Thu, 2 Oct 2014 16:58:41 +0200

> From: Richard Alpe <richard.alpe@ericsson.com>
> 
> The API has been deprecated along with its user-space tool
> "tipc-config". Users shall use the new kernel netlink API already in
> place along with the new user space tool "tipc" that's part of the
> tipc-utils package.
> 
> Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com>
> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
> Acked-by: Ying Xue <ying.xue@windriver.com>

Sorry, no matter what your circumstances, you cannot just break
binaries that might be out there.

The rest of this patch series is fine, but I'm really not going
to even entertain applying this one, sorry.

^ permalink raw reply

* Re: [PATCH v2 net-next 0/4] net: Generic UDP Encapsulation
From: David Miller @ 2014-10-03 23:57 UTC (permalink / raw)
  To: therbert; +Cc: netdev
In-Reply-To: <1412376490-8774-1-git-send-email-therbert@google.com>

From: Tom Herbert <therbert@google.com>
Date: Fri,  3 Oct 2014 15:48:06 -0700

> Generic UDP Encapsulation (GUE) is UDP encapsulation protocol which
> encapsulates packets of various IP protocols. The GUE protocol is
> described in http://tools.ietf.org/html/draft-herbert-gue-01.

This looks better, applied, thanks Tom.

^ permalink raw reply

* Re: [PATCH v1 2/2] net: sched: replace ematch calls to use struct net
From: John Fastabend @ 2014-10-04  0:19 UTC (permalink / raw)
  To: Cong Wang, John Fastabend
  Cc: Cong Wang, David Miller, netdev, Jamal Hadi Salim, Eric Dumazet
In-Reply-To: <CAHA+R7PWrAk1+ds8SNytf0dABwcSQYqSAvK9dTN44kQqRiq4Rw@mail.gmail.com>

On 10/03/2014 03:40 PM, Cong Wang wrote:
> On Thu, Oct 2, 2014 at 10:46 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
>> index 81ddfa6..f37e4fb 100644
>> --- a/net/sched/cls_basic.c
>> +++ b/net/sched/cls_basic.c
>> @@ -32,7 +32,7 @@ struct basic_filter {
>>         struct tcf_exts         exts;
>>         struct tcf_ematch_tree  ematches;
>>         struct tcf_result       res;
>> -       struct tcf_proto        *tp;
>> +       struct net              *net;
>>         struct list_head        link;
>>         struct rcu_head         rcu;
>>  };
> 
> I guess storing this net pointer to struct tcf_ematch_tree is better,
> since it is only used by ematch?

Sure, that is fine. It is used by em_ipset and three classifiers. It
does simplify the API slightly if its in tcf_ematch, I guess. So I'll 
move it there.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* [PATCH v7 net-next 1/2] bonding: display xmit_hash_policy for non-dynamic-tlb mode
From: Mahesh Bandewar @ 2014-10-04  0:48 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller
  Cc: netdev, Mahesh Bandewar, Eric Dumazet, Maciej Zenczykowski

It's a trivial fix to display xmit_hash_policy for this new TLB mode
since it uses transmit-hash-poilicy as part of bonding-master info
(/proc/net/bonding/<bonding-interface).

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
v1
 Rebase
v2
 Added bond_mode_uses_xmit_hash() inline function
v3-v7
 Rebase

 drivers/net/bonding/bond_procfs.c | 3 +--
 drivers/net/bonding/bonding.h     | 7 +++++++
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_procfs.c b/drivers/net/bonding/bond_procfs.c
index bb09d0442aa8..a3948f8d1e53 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -73,8 +73,7 @@ static void bond_info_show_master(struct seq_file *seq)
 
 	seq_printf(seq, "\n");
 
-	if (BOND_MODE(bond) == BOND_MODE_XOR ||
-		BOND_MODE(bond) == BOND_MODE_8023AD) {
+	if (bond_mode_uses_xmit_hash(bond)) {
 		optval = bond_opt_get_val(BOND_OPT_XMIT_HASH,
 					  bond->params.xmit_policy);
 		seq_printf(seq, "Transmit Hash Policy: %s (%d)\n",
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 57917e63b4e6..5b022da9cad2 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -274,6 +274,13 @@ static inline bool bond_is_nondyn_tlb(const struct bonding *bond)
 	       (bond->params.tlb_dynamic_lb == 0);
 }
 
+static inline bool bond_mode_uses_xmit_hash(const struct bonding *bond)
+{
+	return (BOND_MODE(bond) == BOND_MODE_8023AD ||
+		BOND_MODE(bond) == BOND_MODE_XOR ||
+		bond_is_nondyn_tlb(bond));
+}
+
 static inline bool bond_mode_uses_arp(int mode)
 {
 	return mode != BOND_MODE_8023AD && mode != BOND_MODE_TLB &&
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH v7 net-next 2/2] bonding: Simplify the xmit function for modes that use xmit_hash
From: Mahesh Bandewar @ 2014-10-04  0:48 UTC (permalink / raw)
  To: Jay Vosburgh, Veaceslav Falico, Andy Gospodarek, David Miller
  Cc: netdev, Mahesh Bandewar, Eric Dumazet, Maciej Zenczykowski,
	Nikolay Aleksandrov

Earlier change to use usable slave array for TLB mode had an additional
performance advantage. So extending the same logic to all other modes
that use xmit-hash for slave selection (viz 802.3AD, and XOR modes).
Also consolidating this with the earlier TLB change.

The main idea is to build the usable slaves array in the control path
and use that array for slave selection during xmit operation.

Measured performance in a setup with a bond of 4x1G NICs with 200
instances of netperf for the modes involved (3ad, xor, tlb)
cmd: netperf -t TCP_RR -H <TargetHost> -l 60 -s 5

Mode        TPS-Before   TPS-After

802.3ad   : 468,694      493,101
TLB (lb=0): 392,583      392,965
XOR       : 475,696      484,517

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
---
v1:
  (a) If bond_update_slave_arr() fails to allocate memory, it will overwrite
      the slave that need to be removed.
  (b) Freeing of array will assign NULL (to handle bond->down to bond->up
      transition gracefully.
  (c) Change from pr_debug() to pr_err() if bond_update_slave_arr() returns
      failure.
  (d) XOR: bond_update_slave_arr() will consider mii-mon, arp-mon cases and
      will populate the array even if these parameters are not used.
  (e) 3AD: Should handle the ad_agg_selection_logic correctly.
v2:
  (a) Removed rcu_read_{un}lock() calls from array manipulation code.
  (b) Slave link-events now refresh array for all these modes.
  (c) Moved free-array call from bond_close() to bond_uninit().
v3:
  (a) Fixed null pointer dereference.
  (b) Removed bond->lock lockdep dependency.
v4:
  (a) Made to changes to comply with Nikolay's locking changes
  (b) Added a work-queue to refresh slave-array when RTNL is not held
  (c) Array refresh happens ONLY with RTNL now.
  (d) alloc changed from GFP_ATOMIC to GFP_KERNEL
v5:
  (a) Consolidated all delayed slave-array updates at one place in
      3ad_state_machine_handler()
v6:
  (a) Free slave array when there is no active aggregator
v7:
  (a) Couple of trivial changes.

 drivers/net/bonding/bond_3ad.c  | 140 +++++++++++------------------
 drivers/net/bonding/bond_alb.c  |  51 ++---------
 drivers/net/bonding/bond_alb.h  |   8 --
 drivers/net/bonding/bond_main.c | 192 +++++++++++++++++++++++++++++++++++++---
 drivers/net/bonding/bonding.h   |  10 +++
 5 files changed, 249 insertions(+), 152 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 7e9e522fd476..2110215f3528 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -102,17 +102,20 @@ static const u8 lacpdu_mcast_addr[ETH_ALEN] = MULTICAST_LACPDU_ADDR;
 /* ================= main 802.3ad protocol functions ================== */
 static int ad_lacpdu_send(struct port *port);
 static int ad_marker_send(struct port *port, struct bond_marker *marker);
-static void ad_mux_machine(struct port *port);
+static void ad_mux_machine(struct port *port, bool *update_slave_arr);
 static void ad_rx_machine(struct lacpdu *lacpdu, struct port *port);
 static void ad_tx_machine(struct port *port);
 static void ad_periodic_machine(struct port *port);
-static void ad_port_selection_logic(struct port *port);
-static void ad_agg_selection_logic(struct aggregator *aggregator);
+static void ad_port_selection_logic(struct port *port, bool *update_slave_arr);
+static void ad_agg_selection_logic(struct aggregator *aggregator,
+				   bool *update_slave_arr);
 static void ad_clear_agg(struct aggregator *aggregator);
 static void ad_initialize_agg(struct aggregator *aggregator);
 static void ad_initialize_port(struct port *port, int lacp_fast);
-static void ad_enable_collecting_distributing(struct port *port);
-static void ad_disable_collecting_distributing(struct port *port);
+static void ad_enable_collecting_distributing(struct port *port,
+					      bool *update_slave_arr);
+static void ad_disable_collecting_distributing(struct port *port,
+					       bool *update_slave_arr);
 static void ad_marker_info_received(struct bond_marker *marker_info,
 				    struct port *port);
 static void ad_marker_response_received(struct bond_marker *marker,
@@ -796,8 +799,9 @@ static int ad_marker_send(struct port *port, struct bond_marker *marker)
 /**
  * ad_mux_machine - handle a port's mux state machine
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  */
-static void ad_mux_machine(struct port *port)
+static void ad_mux_machine(struct port *port, bool *update_slave_arr)
 {
 	mux_states_t last_state;
 
@@ -901,7 +905,8 @@ static void ad_mux_machine(struct port *port)
 		switch (port->sm_mux_state) {
 		case AD_MUX_DETACHED:
 			port->actor_oper_port_state &= ~AD_STATE_SYNCHRONIZATION;
-			ad_disable_collecting_distributing(port);
+			ad_disable_collecting_distributing(port,
+							   update_slave_arr);
 			port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
 			port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
 			port->ntt = true;
@@ -913,13 +918,15 @@ static void ad_mux_machine(struct port *port)
 			port->actor_oper_port_state |= AD_STATE_SYNCHRONIZATION;
 			port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
 			port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
-			ad_disable_collecting_distributing(port);
+			ad_disable_collecting_distributing(port,
+							   update_slave_arr);
 			port->ntt = true;
 			break;
 		case AD_MUX_COLLECTING_DISTRIBUTING:
 			port->actor_oper_port_state |= AD_STATE_COLLECTING;
 			port->actor_oper_port_state |= AD_STATE_DISTRIBUTING;
-			ad_enable_collecting_distributing(port);
+			ad_enable_collecting_distributing(port,
+							  update_slave_arr);
 			port->ntt = true;
 			break;
 		default:
@@ -1187,12 +1194,13 @@ static void ad_periodic_machine(struct port *port)
 /**
  * ad_port_selection_logic - select aggregation groups
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * Select aggregation groups, and assign each port for it's aggregetor. The
  * selection logic is called in the inititalization (after all the handshkes),
  * and after every lacpdu receive (if selected is off).
  */
-static void ad_port_selection_logic(struct port *port)
+static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
 {
 	struct aggregator *aggregator, *free_aggregator = NULL, *temp_aggregator;
 	struct port *last_port = NULL, *curr_port;
@@ -1347,7 +1355,7 @@ static void ad_port_selection_logic(struct port *port)
 			      __agg_ports_are_ready(port->aggregator));
 
 	aggregator = __get_first_agg(port);
-	ad_agg_selection_logic(aggregator);
+	ad_agg_selection_logic(aggregator, update_slave_arr);
 }
 
 /* Decide if "agg" is a better choice for the new active aggregator that
@@ -1435,6 +1443,7 @@ static int agg_device_up(const struct aggregator *agg)
 /**
  * ad_agg_selection_logic - select an aggregation group for a team
  * @aggregator: the aggregator we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * It is assumed that only one aggregator may be selected for a team.
  *
@@ -1457,7 +1466,8 @@ static int agg_device_up(const struct aggregator *agg)
  * __get_active_agg() won't work correctly. This function should be better
  * called with the bond itself, and retrieve the first agg from it.
  */
-static void ad_agg_selection_logic(struct aggregator *agg)
+static void ad_agg_selection_logic(struct aggregator *agg,
+				   bool *update_slave_arr)
 {
 	struct aggregator *best, *active, *origin;
 	struct bonding *bond = agg->slave->bond;
@@ -1550,6 +1560,8 @@ static void ad_agg_selection_logic(struct aggregator *agg)
 				__disable_port(port);
 			}
 		}
+		/* Slave array needs update. */
+		*update_slave_arr = true;
 	}
 
 	/* if the selected aggregator is of join individuals
@@ -1678,24 +1690,30 @@ static void ad_initialize_port(struct port *port, int lacp_fast)
 /**
  * ad_enable_collecting_distributing - enable a port's transmit/receive
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  *
  * Enable @port if it's in an active aggregator
  */
-static void ad_enable_collecting_distributing(struct port *port)
+static void ad_enable_collecting_distributing(struct port *port,
+					      bool *update_slave_arr)
 {
 	if (port->aggregator->is_active) {
 		pr_debug("Enabling port %d(LAG %d)\n",
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__enable_port(port);
+		/* Slave array needs update */
+		*update_slave_arr = true;
 	}
 }
 
 /**
  * ad_disable_collecting_distributing - disable a port's transmit/receive
  * @port: the port we're looking at
+ * @update_slave_arr: Does slave array need update?
  */
-static void ad_disable_collecting_distributing(struct port *port)
+static void ad_disable_collecting_distributing(struct port *port,
+					       bool *update_slave_arr)
 {
 	if (port->aggregator &&
 	    !MAC_ADDRESS_EQUAL(&(port->aggregator->partner_system),
@@ -1704,6 +1722,8 @@ static void ad_disable_collecting_distributing(struct port *port)
 			 port->actor_port_number,
 			 port->aggregator->aggregator_identifier);
 		__disable_port(port);
+		/* Slave array needs an update */
+		*update_slave_arr = true;
 	}
 }
 
@@ -1868,6 +1888,7 @@ void bond_3ad_unbind_slave(struct slave *slave)
 	struct bonding *bond = slave->bond;
 	struct slave *slave_iter;
 	struct list_head *iter;
+	bool dummy_slave_update; /* Ignore this value as caller updates array */
 
 	/* Sync against bond_3ad_state_machine_handler() */
 	spin_lock_bh(&bond->mode_lock);
@@ -1951,7 +1972,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 				ad_clear_agg(aggregator);
 
 				if (select_new_active_agg)
-					ad_agg_selection_logic(__get_first_agg(port));
+					ad_agg_selection_logic(__get_first_agg(port),
+							       &dummy_slave_update);
 			} else {
 				netdev_warn(bond->dev, "unbinding aggregator, and could not find a new aggregator for its ports\n");
 			}
@@ -1966,7 +1988,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 				/* select new active aggregator */
 				temp_aggregator = __get_first_agg(port);
 				if (temp_aggregator)
-					ad_agg_selection_logic(temp_aggregator);
+					ad_agg_selection_logic(temp_aggregator,
+							       &dummy_slave_update);
 			}
 		}
 	}
@@ -1996,7 +2019,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
 					if (select_new_active_agg) {
 						netdev_info(bond->dev, "Removing an active aggregator\n");
 						/* select new active aggregator */
-						ad_agg_selection_logic(__get_first_agg(port));
+						ad_agg_selection_logic(__get_first_agg(port),
+							               &dummy_slave_update);
 					}
 				}
 				break;
@@ -2031,6 +2055,7 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 	struct slave *slave;
 	struct port *port;
 	bool should_notify_rtnl = BOND_SLAVE_NOTIFY_LATER;
+	bool update_slave_arr = false;
 
 	/* Lock to protect data accessed by all (e.g., port->sm_vars) and
 	 * against running with bond_3ad_unbind_slave. ad_rx_machine may run
@@ -2058,7 +2083,7 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 			}
 
 			aggregator = __get_first_agg(port);
-			ad_agg_selection_logic(aggregator);
+			ad_agg_selection_logic(aggregator, &update_slave_arr);
 		}
 		bond_3ad_set_carrier(bond);
 	}
@@ -2074,8 +2099,8 @@ void bond_3ad_state_machine_handler(struct work_struct *work)
 
 		ad_rx_machine(NULL, port);
 		ad_periodic_machine(port);
-		ad_port_selection_logic(port);
-		ad_mux_machine(port);
+		ad_port_selection_logic(port, &update_slave_arr);
+		ad_mux_machine(port, &update_slave_arr);
 		ad_tx_machine(port);
 
 		/* turn off the BEGIN bit, since we already handled it */
@@ -2093,6 +2118,9 @@ re_arm:
 	rcu_read_unlock();
 	spin_unlock_bh(&bond->mode_lock);
 
+	if (update_slave_arr)
+		bond_slave_arr_work_rearm(bond, 0);
+
 	if (should_notify_rtnl && rtnl_trylock()) {
 		bond_slave_state_notify(bond);
 		rtnl_unlock();
@@ -2283,6 +2311,11 @@ void bond_3ad_handle_link_change(struct slave *slave, char link)
 	port->sm_vars |= AD_PORT_BEGIN;
 
 	spin_unlock_bh(&slave->bond->mode_lock);
+
+	/* RTNL is held and mode_lock is released so it's safe
+	 * to update slave_array here.
+	 */
+	bond_update_slave_arr(slave->bond, NULL);
 }
 
 /**
@@ -2377,73 +2410,6 @@ int bond_3ad_get_active_agg_info(struct bonding *bond, struct ad_info *ad_info)
 	return ret;
 }
 
-int bond_3ad_xmit_xor(struct sk_buff *skb, struct net_device *dev)
-{
-	struct bonding *bond = netdev_priv(dev);
-	struct slave *slave, *first_ok_slave;
-	struct aggregator *agg;
-	struct ad_info ad_info;
-	struct list_head *iter;
-	int slaves_in_agg;
-	int slave_agg_no;
-	int agg_id;
-
-	if (__bond_3ad_get_active_agg_info(bond, &ad_info)) {
-		netdev_dbg(dev, "__bond_3ad_get_active_agg_info failed\n");
-		goto err_free;
-	}
-
-	slaves_in_agg = ad_info.ports;
-	agg_id = ad_info.aggregator_id;
-
-	if (slaves_in_agg == 0) {
-		netdev_dbg(dev, "active aggregator is empty\n");
-		goto err_free;
-	}
-
-	slave_agg_no = bond_xmit_hash(bond, skb) % slaves_in_agg;
-	first_ok_slave = NULL;
-
-	bond_for_each_slave_rcu(bond, slave, iter) {
-		agg = SLAVE_AD_INFO(slave)->port.aggregator;
-		if (!agg || agg->aggregator_identifier != agg_id)
-			continue;
-
-		if (slave_agg_no >= 0) {
-			if (!first_ok_slave && bond_slave_can_tx(slave))
-				first_ok_slave = slave;
-			slave_agg_no--;
-			continue;
-		}
-
-		if (bond_slave_can_tx(slave)) {
-			bond_dev_queue_xmit(bond, skb, slave->dev);
-			goto out;
-		}
-	}
-
-	if (slave_agg_no >= 0) {
-		netdev_err(dev, "Couldn't find a slave to tx on for aggregator ID %d\n",
-			   agg_id);
-		goto err_free;
-	}
-
-	/* we couldn't find any suitable slave after the agg_no, so use the
-	 * first suitable found, if found.
-	 */
-	if (first_ok_slave)
-		bond_dev_queue_xmit(bond, skb, first_ok_slave->dev);
-	else
-		goto err_free;
-
-out:
-	return NETDEV_TX_OK;
-err_free:
-	/* no suitable interface, frame not sent */
-	dev_kfree_skb_any(skb);
-	goto out;
-}
-
 int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct bonding *bond,
 			 struct slave *slave)
 {
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 615f3bebd019..d2eadab787c5 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -177,7 +177,6 @@ static int tlb_initialize(struct bonding *bond)
 static void tlb_deinitialize(struct bonding *bond)
 {
 	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct tlb_up_slave *arr;
 
 	spin_lock_bh(&bond->mode_lock);
 
@@ -185,10 +184,6 @@ static void tlb_deinitialize(struct bonding *bond)
 	bond_info->tx_hashtbl = NULL;
 
 	spin_unlock_bh(&bond->mode_lock);
-
-	arr = rtnl_dereference(bond_info->slave_arr);
-	if (arr)
-		kfree_rcu(arr, rcu);
 }
 
 static long long compute_gap(struct slave *slave)
@@ -1336,39 +1331,9 @@ out:
 	return NETDEV_TX_OK;
 }
 
-static int bond_tlb_update_slave_arr(struct bonding *bond,
-				     struct slave *skipslave)
-{
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
-	struct slave *tx_slave;
-	struct list_head *iter;
-	struct tlb_up_slave *new_arr, *old_arr;
-
-	new_arr = kzalloc(offsetof(struct tlb_up_slave, arr[bond->slave_cnt]),
-			  GFP_ATOMIC);
-	if (!new_arr)
-		return -ENOMEM;
-
-	bond_for_each_slave(bond, tx_slave, iter) {
-		if (!bond_slave_can_tx(tx_slave))
-			continue;
-		if (skipslave == tx_slave)
-			continue;
-		new_arr->arr[new_arr->count++] = tx_slave;
-	}
-
-	old_arr = rtnl_dereference(bond_info->slave_arr);
-	rcu_assign_pointer(bond_info->slave_arr, new_arr);
-	if (old_arr)
-		kfree_rcu(old_arr, rcu);
-
-	return 0;
-}
-
 int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 {
 	struct bonding *bond = netdev_priv(bond_dev);
-	struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
 	struct ethhdr *eth_data;
 	struct slave *tx_slave = NULL;
 	u32 hash_index;
@@ -1389,12 +1354,14 @@ int bond_tlb_xmit(struct sk_buff *skb, struct net_device *bond_dev)
 							      hash_index & 0xFF,
 							      skb->len);
 			} else {
-				struct tlb_up_slave *slaves;
+				struct bond_up_slave *slaves;
+				unsigned int count;
 
-				slaves = rcu_dereference(bond_info->slave_arr);
-				if (slaves && slaves->count)
+				slaves = rcu_dereference(bond->slave_arr);
+				count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+				if (likely(count))
 					tx_slave = slaves->arr[hash_index %
-							       slaves->count];
+							       count];
 			}
 			break;
 		}
@@ -1641,10 +1608,6 @@ void bond_alb_deinit_slave(struct bonding *bond, struct slave *slave)
 		rlb_clear_slave(bond, slave);
 	}
 
-	if (bond_is_nondyn_tlb(bond))
-		if (bond_tlb_update_slave_arr(bond, slave))
-			pr_err("Failed to build slave-array for TLB mode.\n");
-
 }
 
 void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char link)
@@ -1669,7 +1632,7 @@ void bond_alb_handle_link_change(struct bonding *bond, struct slave *slave, char
 	}
 
 	if (bond_is_nondyn_tlb(bond)) {
-		if (bond_tlb_update_slave_arr(bond, NULL))
+		if (bond_update_slave_arr(bond, NULL))
 			pr_err("Failed to build slave-array for TLB mode.\n");
 	}
 }
diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
index 3c6a7ff974d7..1ad473b4ade5 100644
--- a/drivers/net/bonding/bond_alb.h
+++ b/drivers/net/bonding/bond_alb.h
@@ -139,19 +139,11 @@ struct tlb_slave_info {
 			 */
 };
 
-struct tlb_up_slave {
-	unsigned int	count;
-	struct rcu_head rcu;
-	struct slave	*arr[0];
-};
-
 struct alb_bond_info {
 	struct tlb_client_info	*tx_hashtbl; /* Dynamically allocated */
 	u32			unbalanced_load;
 	int			tx_rebalance_counter;
 	int			lp_counter;
-	/* -------- non-dynamic tlb mode only ---------*/
-	struct tlb_up_slave __rcu *slave_arr;	  /* Up slaves */
 	/* -------- rlb parameters -------- */
 	int rlb_enabled;
 	struct rlb_client_info	*rx_hashtbl;	/* Receive hash table */
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c2adc2755ff6..692dedf2b73b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -210,6 +210,7 @@ static int bond_init(struct net_device *bond_dev);
 static void bond_uninit(struct net_device *bond_dev);
 static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev,
 						struct rtnl_link_stats64 *stats);
+static void bond_slave_arr_handler(struct work_struct *work);
 
 /*---------------------------- General routines -----------------------------*/
 
@@ -1551,6 +1552,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		unblock_netpoll_tx();
 	}
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, NULL);
+
 	netdev_info(bond_dev, "Enslaving %s as %s interface with %s link\n",
 		    slave_dev->name,
 		    bond_is_active_slave(new_slave) ? "an active" : "a backup",
@@ -1668,6 +1672,9 @@ static int __bond_release_one(struct net_device *bond_dev,
 	if (BOND_MODE(bond) == BOND_MODE_8023AD)
 		bond_3ad_unbind_slave(slave);
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, slave);
+
 	netdev_info(bond_dev, "Releasing %s interface %s\n",
 		    bond_is_active_slave(slave) ? "active" : "backup",
 		    slave_dev->name);
@@ -1970,6 +1977,9 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_UP);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
+
 			if (!bond->curr_active_slave || slave == primary)
 				goto do_failover;
 
@@ -1997,6 +2007,9 @@ static void bond_miimon_commit(struct bonding *bond)
 				bond_alb_handle_link_change(bond, slave,
 							    BOND_LINK_DOWN);
 
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
+
 			if (slave == rcu_access_pointer(bond->curr_active_slave))
 				goto do_failover;
 
@@ -2453,6 +2466,8 @@ static void bond_loadbalance_arp_mon(struct work_struct *work)
 
 		if (slave_state_changed) {
 			bond_slave_state_change(bond);
+			if (BOND_MODE(bond) == BOND_MODE_XOR)
+				bond_update_slave_arr(bond, NULL);
 		} else if (do_failover) {
 			block_netpoll_tx();
 			bond_select_active_slave(bond);
@@ -2829,8 +2844,20 @@ static int bond_slave_netdev_event(unsigned long event,
 			if (old_duplex != slave->duplex)
 				bond_3ad_adapter_duplex_changed(slave);
 		}
+		/* Refresh slave-array if applicable!
+		 * If the setup does not use miimon or arpmon (mode-specific!),
+		 * then these events will not cause the slave-array to be
+		 * refreshed. This will cause xmit to use a slave that is not
+		 * usable. Avoid such situation by refeshing the array at these
+		 * events. If these (miimon/arpmon) parameters are configured
+		 * then array gets refreshed twice and that should be fine!
+		 */
+		if (bond_mode_uses_xmit_hash(bond))
+			bond_update_slave_arr(bond, NULL);
 		break;
 	case NETDEV_DOWN:
+		if (bond_mode_uses_xmit_hash(bond))
+			bond_update_slave_arr(bond, NULL);
 		break;
 	case NETDEV_CHANGEMTU:
 		/* TODO: Should slaves be allowed to
@@ -3010,6 +3037,7 @@ static void bond_work_init_all(struct bonding *bond)
 	else
 		INIT_DELAYED_WORK(&bond->arp_work, bond_loadbalance_arp_mon);
 	INIT_DELAYED_WORK(&bond->ad_work, bond_3ad_state_machine_handler);
+	INIT_DELAYED_WORK(&bond->slave_arr_work, bond_slave_arr_handler);
 }
 
 static void bond_work_cancel_all(struct bonding *bond)
@@ -3019,6 +3047,7 @@ static void bond_work_cancel_all(struct bonding *bond)
 	cancel_delayed_work_sync(&bond->alb_work);
 	cancel_delayed_work_sync(&bond->ad_work);
 	cancel_delayed_work_sync(&bond->mcast_work);
+	cancel_delayed_work_sync(&bond->slave_arr_work);
 }
 
 static int bond_open(struct net_device *bond_dev)
@@ -3068,6 +3097,9 @@ static int bond_open(struct net_device *bond_dev)
 		bond_3ad_initiate_agg_selection(bond, 1);
 	}
 
+	if (bond_mode_uses_xmit_hash(bond))
+		bond_update_slave_arr(bond, NULL);
+
 	return 0;
 }
 
@@ -3573,20 +3605,148 @@ static int bond_xmit_activebackup(struct sk_buff *skb, struct net_device *bond_d
 	return NETDEV_TX_OK;
 }
 
-/* In bond_xmit_xor() , we determine the output device by using a pre-
- * determined xmit_hash_policy(), If the selected device is not enabled,
- * find the next active slave.
+/* Use this to update slave_array when (a) it's not appropriate to update
+ * slave_array right away (note that update_slave_array() may sleep)
+ * and / or (b) RTNL is not held.
  */
-static int bond_xmit_xor(struct sk_buff *skb, struct net_device *bond_dev)
+void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay)
 {
-	struct bonding *bond = netdev_priv(bond_dev);
-	int slave_cnt = ACCESS_ONCE(bond->slave_cnt);
+	queue_delayed_work(bond->wq, &bond->slave_arr_work, delay);
+}
 
-	if (likely(slave_cnt))
-		bond_xmit_slave_id(bond, skb,
-				   bond_xmit_hash(bond, skb) % slave_cnt);
-	else
+/* Slave array work handler. Holds only RTNL */
+static void bond_slave_arr_handler(struct work_struct *work)
+{
+	struct bonding *bond = container_of(work, struct bonding,
+					    slave_arr_work.work);
+	int ret;
+
+	if (!rtnl_trylock())
+		goto err;
+
+	ret = bond_update_slave_arr(bond, NULL);
+	rtnl_unlock();
+	if (ret) {
+		pr_warn_ratelimited("Failed to update slave array from WT\n");
+		goto err;
+	}
+	return;
+
+err:
+	bond_slave_arr_work_rearm(bond, 1);
+}
+
+/* Build the usable slaves array in control path for modes that use xmit-hash
+ * to determine the slave interface -
+ * (a) BOND_MODE_8023AD
+ * (b) BOND_MODE_XOR
+ * (c) BOND_MODE_TLB && tlb_dynamic_lb == 0
+ *
+ * The caller is expected to hold RTNL only and NO other lock!
+ */
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave)
+{
+	struct slave *slave;
+	struct list_head *iter;
+	struct bond_up_slave *new_arr, *old_arr;
+	int slaves_in_agg;
+	int agg_id = 0;
+	int ret = 0;
+
+#ifdef CONFIG_LOCKDEP
+	lockdep_assert_held(&bond->mode_lock);
+#endif
+
+	new_arr = kzalloc(offsetof(struct bond_up_slave, arr[bond->slave_cnt]),
+			  GFP_KERNEL);
+	if (!new_arr) {
+		ret = -ENOMEM;
+		pr_err("Failed to build slave-array.\n");
+		goto out;
+	}
+	if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+		struct ad_info ad_info;
+
+		if (bond_3ad_get_active_agg_info(bond, &ad_info)) {
+			pr_debug("bond_3ad_get_active_agg_info failed\n");
+			kfree_rcu(new_arr, rcu);
+			/* No active aggragator means it's not safe to use
+			 * the previous array.
+			 */
+			old_arr = rtnl_dereference(bond->slave_arr);
+			if (old_arr) {
+				RCU_INIT_POINTER(bond->slave_arr, NULL);
+				kfree_rcu(old_arr, rcu);
+			}
+			goto out;
+		}
+		slaves_in_agg = ad_info.ports;
+		agg_id = ad_info.aggregator_id;
+	}
+	bond_for_each_slave(bond, slave, iter) {
+		if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+			struct aggregator *agg;
+
+			agg = SLAVE_AD_INFO(slave)->port.aggregator;
+			if (!agg || agg->aggregator_identifier != agg_id)
+				continue;
+		}
+		if (!bond_slave_can_tx(slave))
+			continue;
+		if (skipslave == slave)
+			continue;
+		new_arr->arr[new_arr->count++] = slave;
+	}
+
+	old_arr = rtnl_dereference(bond->slave_arr);
+	rcu_assign_pointer(bond->slave_arr, new_arr);
+	if (old_arr)
+		kfree_rcu(old_arr, rcu);
+out:
+	if (ret != 0 && skipslave) {
+		int idx;
+
+		/* Rare situation where caller has asked to skip a specific
+		 * slave but allocation failed (most likely!). BTW this is
+		 * only possible when the call is initiated from
+		 * __bond_release_one(). In this situation; overwrite the
+		 * skipslave entry in the array with the last entry from the
+		 * array to avoid a situation where the xmit path may choose
+		 * this to-be-skipped slave to send a packet out.
+		 */
+		old_arr = rtnl_dereference(bond->slave_arr);
+		for (idx = 0; idx < old_arr->count; idx++) {
+			if (skipslave == old_arr->arr[idx]) {
+				old_arr->arr[idx] =
+				    old_arr->arr[old_arr->count-1];
+				old_arr->count--;
+				break;
+			}
+		}
+	}
+	return ret;
+}
+
+/* Use this Xmit function for 3AD as well as XOR modes. The current
+ * usable slave array is formed in the control path. The xmit function
+ * just calculates hash and sends the packet out.
+ */
+int bond_3ad_xor_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct bonding *bond = netdev_priv(dev);
+	struct slave *slave;
+	struct bond_up_slave *slaves;
+	unsigned int count;
+
+	slaves = rcu_dereference(bond->slave_arr);
+	count = slaves ? ACCESS_ONCE(slaves->count) : 0;
+	if (likely(count)) {
+		slave = slaves->arr[bond_xmit_hash(bond, skb) % count];
+		bond_dev_queue_xmit(bond, skb, slave->dev);
+	} else {
 		dev_kfree_skb_any(skb);
+		atomic_long_inc(&dev->tx_dropped);
+	}
 
 	return NETDEV_TX_OK;
 }
@@ -3682,12 +3842,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
 		return bond_xmit_roundrobin(skb, dev);
 	case BOND_MODE_ACTIVEBACKUP:
 		return bond_xmit_activebackup(skb, dev);
+	case BOND_MODE_8023AD:
 	case BOND_MODE_XOR:
-		return bond_xmit_xor(skb, dev);
+		return bond_3ad_xor_xmit(skb, dev);
 	case BOND_MODE_BROADCAST:
 		return bond_xmit_broadcast(skb, dev);
-	case BOND_MODE_8023AD:
-		return bond_3ad_xmit_xor(skb, dev);
 	case BOND_MODE_ALB:
 		return bond_alb_xmit(skb, dev);
 	case BOND_MODE_TLB:
@@ -3861,6 +4020,7 @@ static void bond_uninit(struct net_device *bond_dev)
 	struct bonding *bond = netdev_priv(bond_dev);
 	struct list_head *iter;
 	struct slave *slave;
+	struct bond_up_slave *arr;
 
 	bond_netpoll_cleanup(bond_dev);
 
@@ -3869,6 +4029,12 @@ static void bond_uninit(struct net_device *bond_dev)
 		__bond_release_one(bond_dev, slave->dev, true);
 	netdev_info(bond_dev, "Released all slaves\n");
 
+	arr = rtnl_dereference(bond->slave_arr);
+	if (arr) {
+		RCU_INIT_POINTER(bond->slave_arr, NULL);
+		kfree_rcu(arr, rcu);
+	}
+
 	list_del(&bond->bond_list);
 
 	bond_debug_unregister(bond);
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 5b022da9cad2..10920f0686e2 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -179,6 +179,12 @@ struct slave {
 	struct rtnl_link_stats64 slave_stats;
 };
 
+struct bond_up_slave {
+	unsigned int	count;
+	struct rcu_head rcu;
+	struct slave	*arr[0];
+};
+
 /*
  * Link pseudo-state only used internally by monitors
  */
@@ -193,6 +199,7 @@ struct bonding {
 	struct   slave __rcu *curr_active_slave;
 	struct   slave __rcu *current_arp_slave;
 	struct   slave __rcu *primary_slave;
+	struct   bond_up_slave __rcu *slave_arr; /* Array of usable slaves */
 	bool     force_primary;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	int     (*recv_probe)(const struct sk_buff *, struct bonding *,
@@ -222,6 +229,7 @@ struct bonding {
 	struct   delayed_work alb_work;
 	struct   delayed_work ad_work;
 	struct   delayed_work mcast_work;
+	struct   delayed_work slave_arr_work;
 #ifdef CONFIG_DEBUG_FS
 	/* debugging support via debugfs */
 	struct	 dentry *debug_dir;
@@ -534,6 +542,8 @@ const char *bond_slave_link_status(s8 link);
 struct bond_vlan_tag *bond_verify_device_path(struct net_device *start_dev,
 					      struct net_device *end_dev,
 					      int level);
+int bond_update_slave_arr(struct bonding *bond, struct slave *skipslave);
+void bond_slave_arr_work_rearm(struct bonding *bond, unsigned long delay);
 
 #ifdef CONFIG_PROC_FS
 void bond_create_proc_entry(struct bonding *bond);
-- 
2.1.0.rc2.206.gedb03e5

^ permalink raw reply related

* [PATCH nf next 0/3] bridge: netfilter: fix handling of ipv4 packets w. options
From: Florian Westphal @ 2014-10-04  1:04 UTC (permalink / raw)
  To: netfilter-devel; +Cc: bsd, stephen, netdev, herbert, eric.dumazet, davidn

David Newall reported that bridge causes bad checksums:
http://thread.gmane.org/gmane.linux.network/315705/focus=1706769

The proposal was to revert
462fb2af9788a82a5 (bridge : Sanitize skb before it enters the IP stack).

However, this has some other adverse effects since bridge netfilter
and ip stack both use skb->cb (and we thus memset skb->cb whenever
we hand skb off to the ip stack).

So, this series attemps to resolve this a bit differently.

First, lets add the inet_param padding that Eric suggested previously.
This means that any earlier setup of IPCB will be preserved inside the
bridge layer.

This is also useful for netfilter since it will preserve
IPCB(skb)->frag_max_size set up by ip defrag.

Second, this gets rid of the option parsing/memset calls in
to forward and output cases.

Third, the pre-routing path is changed to not mangle the packets
but to only validate the ip options.

This patch series is vs. next instead of net/nf tree.

This has been broken for so long that I don't think we need
to rush this.


^ permalink raw reply

* [PATCH nf next 1/3] bridge: prepend inet_skb_param dummy to bridge cb
From: Florian Westphal @ 2014-10-04  1:04 UTC (permalink / raw)
  To: netfilter-devel
  Cc: bsd, stephen, netdev, herbert, eric.dumazet, davidn,
	Florian Westphal
In-Reply-To: <1412384670-17794-1-git-send-email-fw@strlen.de>

bridge can make upcalls into the ip stack, especially
when bridge netfilter is involved, we can end up calling ip_fragment().

IPv4 functions, however, may (rightfully) depend on skb->cb[]
containing the IPCB area, where eg. earlier-parsed ip options
reside.

However, since bridge has its own cb area, this has caused several
crashes in the past, and several call sites in br_netfilter since
zero ->cb again before invoking netfilter hooks.

We've tried to cure these in the past by applying memsets of skb->cb
where needed, and parsing ip options within the bridge layer.

This isn't such a great idea since we e.g. lose max fragment size
information stored there via ipv4 defrag.

Also, since 462fb2af9788a82 (bridge : Sanitize skb before it enters the IP
stack) bridge handling of received packets with ipv4 options is broken
in different ways (crash, then discarding of such packets).

This patch, originally proposed by Eric Dumazet, prepends
inet_skb_param padding so IPCB contents will be preserved (e.g.
ipv4 defrag info).

This is a first step in fixing handling of ipv4 packets with options.

br_input_skb_cb is now exactly 48 bytes.

Cc: Bandan Das <bsd@redhat.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bridge/br.c         | 2 ++
 net/bridge/br_private.h | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/net/bridge/br.c b/net/bridge/br.c
index 44425af..4ee730e 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -147,6 +147,8 @@ static int __init br_init(void)
 {
 	int err;
 
+	BUILD_BUG_ON(sizeof(struct br_input_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
+
 	err = stp_proto_register(&br_stp_proto);
 	if (err < 0) {
 		pr_err("bridge: can't register sap for STP\n");
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index f53592f..559938f 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -19,6 +19,8 @@
 #include <linux/u64_stats_sync.h>
 #include <net/route.h>
 #include <linux/if_vlan.h>
+#include <linux/ipv6.h>
+#include <net/ip.h>
 
 #define BR_HASH_BITS 8
 #define BR_HASH_SIZE (1 << BR_HASH_BITS)
@@ -304,6 +306,10 @@ struct net_bridge
 };
 
 struct br_input_skb_cb {
+	union {
+		struct inet_skb_parm inet4_parm;
+		struct inet6_skb_parm inet6_param;
+	} inet_parm;
 	struct net_device *brdev;
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	int igmp;
-- 
2.0.4


^ permalink raw reply related

* [PATCH nf next 2/3] netfilter: bridge: don't parse ip headers in fwd and output path
From: Florian Westphal @ 2014-10-04  1:04 UTC (permalink / raw)
  To: netfilter-devel
  Cc: bsd, stephen, netdev, herbert, eric.dumazet, davidn,
	Florian Westphal
In-Reply-To: <1412384670-17794-1-git-send-email-fw@strlen.de>

We currently call ip_options_compile() for incoming, forwarded,
and outgoing packets.

This prevents ipv4 packets that have ip options set from being processed
by a linux bridge; both for the forward and the local delivery cases.

We first call ip_options_compile() from the pre-routing path.
This will mangle the ipv4 packet, which is already problematic since
it is a layering violation.
But this also will invalidate the ipv4 header checksum.

Since the checksum isn't fixed up, we then drop the packet in the ipv4
input path (for local delivery to the bridge) or when forwarding the
packet (since br_nf_forward_ip invokes br_parse_ip_options(), because it
has checksum error.

For the output path, this is no longer needed either:
previous change added inet_skb_param padding to br_input_skb_cb, so
bridge can no longer overwrite any IPCB data.

With this patch, such packets are not dropped by the bridge anymore,
but they are still corrupted.  This is addressed by next patch.

Cc: Bandan Das <bsd@redhat.com>
Reported-by: David Newall <davidn@davidnewall.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bridge/br_netfilter.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index fa1270c..56c7ed8 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -718,9 +718,6 @@ static unsigned int br_nf_forward_ip(const struct nf_hook_ops *ops,
 		nf_bridge->mask |= BRNF_PKT_TYPE;
 	}
 
-	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
-		return NF_DROP;
-
 	/* The physdev module checks on this */
 	nf_bridge->mask |= BRNF_BRIDGED;
 	nf_bridge->physoutdev = skb->dev;
@@ -778,12 +775,9 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 
 	if (skb->protocol == htons(ETH_P_IP) &&
 	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
-	    !skb_is_gso(skb)) {
-		if (br_parse_ip_options(skb))
-			/* Drop invalid packet */
-			return NF_DROP;
+	    !skb_is_gso(skb))
 		ret = ip_fragment(skb, br_dev_queue_push_xmit);
-	} else
+	else
 		ret = br_dev_queue_push_xmit(skb);
 
 	return ret;
-- 
2.0.4


^ permalink raw reply related

* [PATCH nf-next 3/3] netfilter: bridge: don't mangle ipv4 header options
From: Florian Westphal @ 2014-10-04  1:04 UTC (permalink / raw)
  To: netfilter-devel
  Cc: bsd, stephen, netdev, herbert, eric.dumazet, davidn,
	Florian Westphal
In-Reply-To: <1412384670-17794-1-git-send-email-fw@strlen.de>

a bridge is meant to be L3 protocol agnostic, we should not act on ipv4
header options.  Thus, ensure that skb data isn't modified when
parsing options and also remove the ip_options_rcv_srr() call.

The only purpose of this function is to do sanity tests so upcalls into
netfilter will have the same checks applied as done by the ipv4 input path.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Bandan Das <bsd@redhat.com>
Reported-by: David Newall <davidn@davidnewall.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bridge/br_netfilter.c | 42 +++++++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 23 deletions(-)

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 56c7ed8..f5cb2ef 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -185,14 +185,17 @@ static inline void nf_bridge_save_header(struct sk_buff *skb)
 					 skb->nf_bridge->data, header_size);
 }
 
-/* When handing a packet over to the IP layer
- * check whether we have a skb that is in the
- * expected format
+/* When handing an ipv4 packet over to netfilter, we must
+ * first replicate the sanity tests performed in the IP stack
+ * input path.  This includes making sure that the entire ip
+ * header is in the linear skb area, ip->ihl is sane, etc.
  */
-
-static int br_parse_ip_options(struct sk_buff *skb)
+static bool br_ip_input_valid(struct sk_buff *skb)
 {
-	struct ip_options *opt;
+	struct {
+		struct ip_options opt;
+		u8 hdrdata[0xf * 4];
+	} ip_opts;
 	const struct iphdr *iph;
 	struct net_device *dev = skb->dev;
 	u32 len;
@@ -201,7 +204,6 @@ static int br_parse_ip_options(struct sk_buff *skb)
 		goto inhdr_error;
 
 	iph = ip_hdr(skb);
-	opt = &(IPCB(skb)->opt);
 
 	/* Basic sanity checks */
 	if (iph->ihl < 5 || iph->version != 4)
@@ -228,28 +230,22 @@ static int br_parse_ip_options(struct sk_buff *skb)
 
 	memset(IPCB(skb), 0, sizeof(struct inet_skb_parm));
 	if (iph->ihl == 5)
-		return 0;
-
-	opt->optlen = iph->ihl*4 - sizeof(struct iphdr);
-	if (ip_options_compile(dev_net(dev), opt, skb))
-		goto inhdr_error;
+		return true;
 
-	/* Check correct handling of SRR option */
-	if (unlikely(opt->srr)) {
-		struct in_device *in_dev = __in_dev_get_rcu(dev);
-		if (in_dev && !IN_DEV_SOURCE_ROUTE(in_dev))
-			goto drop;
+	memset(&ip_opts.opt, 0, sizeof(ip_opts.opt));
+	ip_opts.opt.optlen = iph->ihl*4 - sizeof(struct iphdr);
+	memcpy(ip_opts.hdrdata, iph + 1, ip_opts.opt.optlen);
 
-		if (ip_options_rcv_srr(skb))
-			goto drop;
-	}
+	/* We only call this to validate iph options. */
+	if (ip_options_compile(dev_net(dev), &ip_opts.opt, NULL))
+		goto inhdr_error;
 
-	return 0;
+	return true;
 
 inhdr_error:
 	IP_INC_STATS_BH(dev_net(dev), IPSTATS_MIB_INHDRERRORS);
 drop:
-	return -1;
+	return false;
 }
 
 /* PF_BRIDGE/PRE_ROUTING *********************************************/
@@ -617,7 +613,7 @@ static unsigned int br_nf_pre_routing(const struct nf_hook_ops *ops,
 
 	nf_bridge_pull_encap_header_rcsum(skb);
 
-	if (br_parse_ip_options(skb))
+	if (!br_ip_input_valid(skb))
 		return NF_DROP;
 
 	nf_bridge_put(skb->nf_bridge);
-- 
2.0.4


^ permalink raw reply related

* Re: [PATCH nf next 0/3] bridge: netfilter: fix handling of ipv4 packets w. options
From: Herbert Xu @ 2014-10-04  3:56 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, bsd, stephen, netdev, eric.dumazet, davidn,
	Bandan Das
In-Reply-To: <1412384670-17794-1-git-send-email-fw@strlen.de>

On Sat, Oct 04, 2014 at 03:04:27AM +0200, Florian Westphal wrote:
> David Newall reported that bridge causes bad checksums:
> http://thread.gmane.org/gmane.linux.network/315705/focus=1706769
> 
> The proposal was to revert
> 462fb2af9788a82a5 (bridge : Sanitize skb before it enters the IP stack).
> 
> However, this has some other adverse effects since bridge netfilter
> and ip stack both use skb->cb (and we thus memset skb->cb whenever
> we hand skb off to the ip stack).
> 
> So, this series attemps to resolve this a bit differently.
> 
> First, lets add the inet_param padding that Eric suggested previously.
> This means that any earlier setup of IPCB will be preserved inside the
> bridge layer.
> 
> This is also useful for netfilter since it will preserve
> IPCB(skb)->frag_max_size set up by ip defrag.
> 
> Second, this gets rid of the option parsing/memset calls in
> to forward and output cases.
> 
> Third, the pre-routing path is changed to not mangle the packets
> but to only validate the ip options.
> 
> This patch series is vs. next instead of net/nf tree.
> 
> This has been broken for so long that I don't think we need
> to rush this.

I'm unsure whether this is the right approach.  So if I understand
this correctly your problem is coming from packets that are

	IP stack => bridge => IP stack

in which case preserving IP options may work.

But does your patch handle packets that are

	external => bridge => IP stack

The reason I asked for the IPCB to be built is to handle exactly
that case.

In fact, even preserving IPCB in the IP stack reentry case is
a hack since if we ever change the IP stack in future such that
on exit the IPCB is no longer valid for reentry your approach
will fail.

Now as to your original problem that ip_options_compile mangles
the packet this is something I explicitly said we should fix
before we added br_parse_ip_options (point 2 in that email):

	https://lkml.org/lkml/2010/9/3/16

Unfortunately it looks like nobody actually did the audit.

So my suggestion would be to fix br_parse_ip_options so that
it never mangles the packet.

Does this fix your problem or are there other issues that I
have overlooked?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH net-next] net: skb_segment() provides list head and tail
From: Eric Dumazet @ 2014-10-04  3:59 UTC (permalink / raw)
  To: David Miller
  Cc: brouer, netdev, therbert, hannes, fw, dborkman, jhs,
	alexander.duyck, john.r.fastabend
In-Reply-To: <1412375467.17245.16.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

Its unfortunate we have to walk again skb list to find the tail
after segmentation, even if data is probably hot in cpu caches.

skb_segment() can store the tail of the list into segs->prev,
and validate_xmit_skb_list() can immediately get the tail.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c    |   27 +++++++++++++++------------
 net/core/skbuff.c |    5 +++++
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1a90530f83ff..7d5691cc1f47 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2724,22 +2724,25 @@ struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *d
 {
 	struct sk_buff *next, *head = NULL, *tail;
 
-	while (skb) {
+	for (; skb != NULL; skb = next) {
 		next = skb->next;
 		skb->next = NULL;
+
+		/* in case skb wont be segmented, point to itself */
+		skb->prev = skb;
+
 		skb = validate_xmit_skb(skb, dev);
-		if (skb) {
-			struct sk_buff *end = skb;
+		if (!skb)
+			continue;
 
-			while (end->next)
-				end = end->next;
-			if (!head)
-				head = skb;
-			else
-				tail->next = skb;
-			tail = end;
-		}
-		skb = next;
+		if (!head)
+			head = skb;
+		else
+			tail->next = skb;
+		/* If skb was segmented, skb->prev points to
+		 * the last segment. If not, it still contains skb.
+		 */
+		tail = skb->prev;
 	}
 	return head;
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0b312fa3047..06b57ec91f32 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3083,6 +3083,11 @@ perform_csum_check:
 		}
 	} while ((offset += len) < head_skb->len);
 
+	/* Some callers want to get the end of the list.
+	 * Put it in segs->prev to avoid walking the list.
+	 * (see validate_xmit_skb_list() for example)
+	 */
+	segs->prev = tail;
 	return segs;
 
 err:

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox