netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 nf-next 0/6] more bridge netfilter refactoring
@ 2015-03-12 17:05 Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

This is another batch towards the planned skb->nf_bridge removal.

These patches rid ip_fragment of the bridge netfilter mtu/ll hacks
and remove the nf_bridge_info->data area that is used to store
original mac address for refragmentation and neigh resolution.

The need to use it for the DNAT detection is already resolved
via

http://patchwork.ozlabs.org/patch/448342/

These patches go on top of this.

While at it, this series also replaces the 'mask' flags with
a tristate enum to separate those 'flags' that have to be visible
outside of bridge netfilter context and those that are internal.

Tested, on host connected to kvm-bridge:

ping -s $bignum $ip_behind_bridge

on bridge:
-j REDIRECT
-j DNAT --to-destination $ip_behind_bridge

 include/linux/netfilter_bridge.h          |   36 ----
 include/linux/skbuff.h                    |   16 +
 include/net/ip.h                          |    4 
 net/bridge/br_device.c                    |    2 
 net/bridge/br_netfilter.c                 |  246 +++++++++++++++++++++---------
 net/bridge/br_private.h                   |    2 
 net/ipv4/ip_output.c                      |   37 ++--
 net/ipv4/netfilter/nf_defrag_ipv4.c       |    2 
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |    2 
 9 files changed, 215 insertions(+), 132 deletions(-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  2015-03-13  0:38   ` Andy Zhou
  2015-03-16 22:55   ` Pablo Neira Ayuso
  2015-03-12 17:05 ` [PATCH v2 nf-next 2/6] netfilter: bridge: don't use nf_bridge_info to store mac header Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal, Andy Zhou

Long time ago it was possible for the netfilter ip_conntrack
core to call ip_fragment in POST_ROUTING hook.

This is no longer the case, so the only case where bridge netfilter
ends up calling ip_fragment is the direct call site in br_netfilter.c.

Add ll and mtu arguments for ip_fragment and then get rid of the bridge
netfilter specific helpers from ip_fragment.

Cc: Andy Zhou <azhou@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 17 -----------------
 include/net/ip.h                 |  4 ++--
 net/bridge/br_netfilter.c        | 23 ++++++++++++++++++++---
 net/ipv4/ip_output.c             | 37 +++++++++++++++++++++----------------
 4 files changed, 43 insertions(+), 38 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index ed0d3bf..fbbd5de 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -35,24 +35,8 @@ static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
 	}
 }
 
-static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
-{
-	if (unlikely(skb->nf_bridge->mask & BRNF_PPPoE))
-		return PPPOE_SES_HLEN;
-	return 0;
-}
-
 int br_handle_frame_finish(struct sk_buff *skb);
 
-/* This is called by the IP fragmenting code and it ensures there is
- * enough room for the encapsulating header (if there is one). */
-static inline unsigned int nf_bridge_pad(const struct sk_buff *skb)
-{
-	if (skb->nf_bridge)
-		return nf_bridge_encap_header_len(skb);
-	return 0;
-}
-
 static inline void br_drop_fake_rtable(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -62,7 +46,6 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 }
 
 #else
-#define nf_bridge_pad(skb)			(0)
 #define br_drop_fake_rtable(skb)	        do { } while (0)
 #endif /* CONFIG_BRIDGE_NETFILTER */
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 025c61c..2905a4b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -108,8 +108,8 @@ int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct sock *sk, struct sk_buff *skb);
 int ip_mc_output(struct sock *sk, struct sk_buff *skb);
-int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));
-int ip_do_nat(struct sk_buff *skb);
+int ip_fragment(struct sk_buff *skb, unsigned int mtu,
+		unsigned int ll_rs, int (*output)(struct sk_buff *));
 void ip_send_check(struct iphdr *ip);
 int __ip_local_out(struct sk_buff *skb);
 int ip_local_out_sk(struct sock *sk, struct sk_buff *skb);
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index bd2d24d..550ee19 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -812,26 +812,43 @@ static int br_nf_push_frag_xmit(struct sk_buff *skb)
 	return br_dev_queue_push_xmit(skb);
 }
 
+static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
+{
+	if (skb->nf_bridge->mask & BRNF_PPPoE)
+		return PPPOE_SES_HLEN;
+	return 0;
+}
+
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
 	int ret;
 	int frag_max_size;
-	unsigned int mtu_reserved;
+	unsigned int mtu_reserved, mtu;
 
 	if (skb_is_gso(skb) || skb->protocol != htons(ETH_P_IP))
 		return br_dev_queue_push_xmit(skb);
 
 	mtu_reserved = nf_bridge_mtu_reduction(skb);
+	mtu = min(skb->dev->mtu, IP_MAX_MTU);
 	/* This is wrong! We should preserve the original fragment
 	 * boundaries by preserving frag_list rather than refragmenting.
 	 */
-	if (skb->len + mtu_reserved > skb->dev->mtu) {
+	if (skb->len + mtu_reserved > mtu) {
+		unsigned int llrs;
+
 		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
 		if (br_parse_ip_options(skb))
 			/* Drop invalid packet */
 			return NF_DROP;
 		IPCB(skb)->frag_max_size = frag_max_size;
-		ret = ip_fragment(skb, br_nf_push_frag_xmit);
+
+		/* for bridged IP traffic encapsulated inside f.e. a vlan header,
+		 * we need to make room for the encapsulating header
+		 */
+		llrs = nf_bridge_encap_header_len(skb);
+
+		mtu -= mtu_reserved;
+		ret = ip_fragment(skb, mtu, llrs, br_nf_push_frag_xmit);
 	} else
 		ret = br_dev_queue_push_xmit(skb);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a7aea20..fe5ec3f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 	netdev_features_t features;
 	struct sk_buff *segs;
 	int ret = 0;
+	unsigned int mtu;
 
 	/* common case: locally created skb or seglen is <= mtu */
 	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
@@ -236,6 +237,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 		return -ENOMEM;
 	}
 
+	mtu = ip_skb_dst_mtu(skb);
 	consume_skb(skb);
 
 	do {
@@ -243,7 +245,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 		int err;
 
 		segs->next = NULL;
-		err = ip_fragment(segs, ip_finish_output2);
+		err = ip_fragment(segs, mtu, 0, ip_finish_output2);
 
 		if (err && ret == 0)
 			ret = err;
@@ -255,6 +257,8 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 
 static int ip_finish_output(struct sk_buff *skb)
 {
+	unsigned int mtu;
+
 #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
 	/* Policy lookup after SNAT yielded a new policy */
 	if (skb_dst(skb)->xfrm != NULL) {
@@ -265,8 +269,9 @@ static int ip_finish_output(struct sk_buff *skb)
 	if (skb_is_gso(skb))
 		return ip_finish_output_gso(skb);
 
-	if (skb->len > ip_skb_dst_mtu(skb))
-		return ip_fragment(skb, ip_finish_output2);
+	mtu = ip_skb_dst_mtu(skb);
+	if (skb->len > mtu)
+		return ip_fragment(skb, mtu, 0, ip_finish_output2);
 
 	return ip_finish_output2(skb);
 }
@@ -472,20 +477,28 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	skb_copy_secmark(to, from);
 }
 
-/*
+/**
+ *	ip_fragment - fragment IP datagram or send ICMP error
+ *
+ *	@skb: the skb to fragment
+ *	@mtu: mtu to use for fragmentation
+ *	@ll_rs: extra linklayer space required
+ *	@output: transmit function used to send fragments
+ *
  *	This IP datagram is too large to be sent in one piece.  Break it up into
  *	smaller pieces (each of size equal to IP header plus
  *	a block of the data of the original IP data part) that will yet fit in a
  *	single device frame, and queue such a frame for sending.
  */
-
-int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
+int ip_fragment(struct sk_buff *skb,
+		unsigned int mtu, unsigned int ll_rs,
+		int (*output)(struct sk_buff *))
 {
 	struct iphdr *iph;
 	int ptr;
 	struct net_device *dev;
 	struct sk_buff *skb2;
-	unsigned int mtu, hlen, left, len, ll_rs;
+	unsigned int hlen, left, len;
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
@@ -499,7 +512,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	iph = ip_hdr(skb);
 
-	mtu = ip_skb_dst_mtu(skb);
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 		     (IPCB(skb)->frag_max_size &&
 		      IPCB(skb)->frag_max_size > mtu))) {
@@ -516,10 +528,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	hlen = iph->ihl * 4;
 	mtu = mtu - hlen;	/* Size of data space */
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge)
-		mtu -= nf_bridge_mtu_reduction(skb);
-#endif
 	IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE;
 
 	/* When frag_list is given, use it. First, check its validity:
@@ -636,10 +644,7 @@ slow_path:
 	left = skb->len - hlen;		/* Space per frame */
 	ptr = hlen;		/* Where to start from */
 
-	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
-	 * we need to make room for the encapsulating header
-	 */
-	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
+	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, ll_rs);
 
 	/*
 	 *	Fragment the datagram.
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 nf-next 2/6] netfilter: bridge: don't use nf_bridge_info to store mac header
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

Currently br_netfilter maintains an extra state, nf_bridge_info,
which is attached to skb via skb->nf_bridge pointer.  Amongst
other things we use skb->nf_bridge->data to store the original
mac header for every processed skb.

This is required for ip refragmentation when using conntrack
on top of bridge, because ip_fragment doesn't copy it from original skb.

However there is no need anymore to do this unconditionally.

Move this to the one place where its needed -- when br_netfilter calls
ip_fragment().

Also switch to percpu storage for this, there is no need to use
skb->nf_bridge->data.

After this change, only one user of skb->nf_bridge->data is left.
It will be removed by followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 12 --------
 net/bridge/br_netfilter.c        | 66 +++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 36 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index fbbd5de..b131613 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -23,18 +23,6 @@ enum nf_br_hook_priorities {
 #define BRNF_8021Q			0x10
 #define BRNF_PPPoE			0x20
 
-static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
-{
-	switch (skb->protocol) {
-	case __cpu_to_be16(ETH_P_8021Q):
-		return VLAN_HLEN;
-	case __cpu_to_be16(ETH_P_PPP_SES):
-		return PPPOE_SES_HLEN;
-	default:
-		return 0;
-	}
-}
-
 int br_handle_frame_finish(struct sk_buff *skb);
 
 static inline void br_drop_fake_rtable(struct sk_buff *skb)
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 550ee19..669b4fa 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -111,6 +111,27 @@ static inline __be16 pppoe_proto(const struct sk_buff *skb)
 	 pppoe_proto(skb) == htons(PPP_IPV6) && \
 	 brnf_filter_pppoe_tagged)
 
+static unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case __cpu_to_be16(ETH_P_8021Q):
+		return VLAN_HLEN;
+	case __cpu_to_be16(ETH_P_PPP_SES):
+		return PPPOE_SES_HLEN;
+	default:
+		break;
+	}
+	return 0;
+}
+
+/* largest possible L2 header, see br_nf_dev_queue_xmit() */
+#define NF_BRIDGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
+
+#if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+static DEFINE_PER_CPU(__u8 [NF_BRIDGE_MAX_MAC_HEADER_LENGTH],
+		      brnf_mac_header_storage);
+#endif
+
 static inline struct rtable *bridge_parent_rtable(const struct net_device *dev)
 {
 	struct net_bridge_port *port;
@@ -177,14 +198,6 @@ static inline void nf_bridge_pull_encap_header_rcsum(struct sk_buff *skb)
 	skb->network_header += len;
 }
 
-static inline void nf_bridge_save_header(struct sk_buff *skb)
-{
-	int header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
-
-	skb_copy_from_linear_data_offset(skb, -header_size,
-					 skb->nf_bridge->data, header_size);
-}
-
 /* When handing a packet over to the IP layer
  * check whether we have a skb that is in the
  * expected format
@@ -785,30 +798,28 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops,
 }
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
-static bool nf_bridge_copy_header(struct sk_buff *skb)
+static int br_nf_push_frag_xmit(struct sk_buff *skb)
 {
-	int err;
 	unsigned int header_size;
+	unsigned int encap_size;
+	char *mac;
+	int err;
 
 	nf_bridge_update_protocol(skb);
-	header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
-	err = skb_cow_head(skb, header_size);
-	if (err)
-		return false;
 
-	skb_copy_to_linear_data_offset(skb, -header_size,
-				       skb->nf_bridge->data, header_size);
-	__skb_push(skb, nf_bridge_encap_header_len(skb));
-	return true;
-}
+	encap_size = nf_bridge_encap_header_len(skb);
+	header_size = ETH_HLEN + encap_size;
 
-static int br_nf_push_frag_xmit(struct sk_buff *skb)
-{
-	if (!nf_bridge_copy_header(skb)) {
+	err = skb_cow_head(skb, header_size);
+	if (err) {
 		kfree_skb(skb);
 		return 0;
 	}
 
+	mac = this_cpu_ptr(brnf_mac_header_storage);
+	skb_copy_to_linear_data_offset(skb, -header_size, mac, header_size);
+	__skb_push(skb, encap_size);
+
 	return br_dev_queue_push_xmit(skb);
 }
 
@@ -834,7 +845,8 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 	 * boundaries by preserving frag_list rather than refragmenting.
 	 */
 	if (skb->len + mtu_reserved > mtu) {
-		unsigned int llrs;
+		unsigned int llrs, header_size;
+		char *mac;
 
 		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
 		if (br_parse_ip_options(skb))
@@ -847,6 +859,13 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 		 */
 		llrs = nf_bridge_encap_header_len(skb);
 
+		mac = this_cpu_ptr(brnf_mac_header_storage);
+
+		header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
+
+		skb_copy_from_linear_data_offset(skb, -header_size, mac,
+						 header_size);
+
 		mtu -= mtu_reserved;
 		ret = ip_fragment(skb, mtu, llrs, br_nf_push_frag_xmit);
 	} else
@@ -898,7 +917,6 @@ static unsigned int br_nf_post_routing(const struct nf_hook_ops *ops,
 	}
 
 	nf_bridge_pull_encap_header(skb);
-	nf_bridge_save_header(skb);
 	if (pf == NFPROTO_IPV4)
 		skb->protocol = htons(ETH_P_IP);
 	else
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 2/6] netfilter: bridge: don't use nf_bridge_info to store mac header Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  2015-03-12 18:02   ` Oliver Hartkopp
  2015-03-12 17:05 ` [PATCH v2 nf-next 4/6] netfilter: bridge: don't use nf_bridge_info to store proto value Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

nf_bridge_info->mask is used for several things, for example to remember
if skb->pkt_type was set to OTHER_HOST.

For a bridge, OTHER_HOST is expected case. For ip forward its a
non-starter though -- routing expects PACKET_HOST.

Bridge netfilter thus changes OTHER_HOST to PACKET_HOST
before hook invocation and then un-does it after hook traversal.

For this, cb[] can be used since the skb will never be used outside
(fake inet) bridge forwarding while in 'fake PACKET_HOST' state.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h |  1 -
 net/bridge/br_netfilter.c        | 71 ++++++++++++++++++++++++++--------------
 2 files changed, 47 insertions(+), 25 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index b131613..05437f8 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -17,7 +17,6 @@ enum nf_br_hook_priorities {
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 
-#define BRNF_PKT_TYPE			0x01
 #define BRNF_BRIDGED_DNAT		0x02
 #define BRNF_NF_BRIDGE_PREROUTING	0x08
 #define BRNF_8021Q			0x10
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 669b4fa..8649ef5 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -47,6 +47,17 @@
 #include <linux/sysctl.h>
 #endif
 
+struct nf_bridge_skb_cb {
+	union {
+		struct inet_skb_parm i;
+		struct inet6_skb_parm i6;
+	} u;
+
+	u8 packet_otherhost:1;
+};
+
+#define BRNF_CB(skb) ((struct nf_bridge_skb_cb *)(skb)->cb)
+
 #ifdef CONFIG_SYSCTL
 static struct ctl_table_header *brnf_sysctl_header;
 static int brnf_call_iptables __read_mostly = 1;
@@ -259,6 +270,29 @@ static void nf_bridge_update_protocol(struct sk_buff *skb)
 		skb->protocol = htons(ETH_P_PPP_SES);
 }
 
+static void nf_bridge_restore_otherhost(struct sk_buff *skb)
+{
+	struct nf_bridge_skb_cb *cb = BRNF_CB(skb);
+
+	if (cb->packet_otherhost) {
+		cb->packet_otherhost = 0;
+		skb->pkt_type = PACKET_OTHERHOST;
+	}
+}
+
+static void nf_bridge_save_otherhost(struct sk_buff *skb)
+{
+	struct nf_bridge_skb_cb *cb;
+
+	if (skb->pkt_type != PACKET_OTHERHOST)
+		return;
+
+	cb = BRNF_CB(skb);
+
+	cb->packet_otherhost = 1;
+	skb->pkt_type = PACKET_HOST;
+}
+
 /* PF_BRIDGE/PRE_ROUTING *********************************************/
 /* Undo the changes made for ip6tables PREROUTING and continue the
  * bridge PRE_ROUTING hook. */
@@ -267,10 +301,8 @@ static int br_nf_pre_routing_finish_ipv6(struct sk_buff *skb)
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
 	struct rtable *rt;
 
-	if (nf_bridge->mask & BRNF_PKT_TYPE) {
-		skb->pkt_type = PACKET_OTHERHOST;
-		nf_bridge->mask ^= BRNF_PKT_TYPE;
-	}
+	nf_bridge_restore_otherhost(skb);
+
 	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
 
 	rt = bridge_parent_rtable(nf_bridge->physindev);
@@ -400,10 +432,7 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)
 	frag_max_size = IPCB(skb)->frag_max_size;
 	BR_INPUT_SKB_CB(skb)->frag_max_size = frag_max_size;
 
-	if (nf_bridge->mask & BRNF_PKT_TYPE) {
-		skb->pkt_type = PACKET_OTHERHOST;
-		nf_bridge->mask ^= BRNF_PKT_TYPE;
-	}
+	nf_bridge_restore_otherhost(skb);
 	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
 	if (dnat_took_place(skb)) {
 		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) {
@@ -485,11 +514,10 @@ static struct net_device *brnf_get_logical_dev(struct sk_buff *skb, const struct
 static struct net_device *setup_pre_routing(struct sk_buff *skb)
 {
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+	struct nf_bridge_skb_cb *cb = BRNF_CB(skb);
 
-	if (skb->pkt_type == PACKET_OTHERHOST) {
-		skb->pkt_type = PACKET_HOST;
-		nf_bridge->mask |= BRNF_PKT_TYPE;
-	}
+	cb->packet_otherhost = 0;
+	nf_bridge_save_otherhost(skb);
 
 	nf_bridge->mask |= BRNF_NF_BRIDGE_PREROUTING;
 	nf_bridge->physindev = skb->dev;
@@ -687,11 +715,8 @@ static int br_nf_forward_finish(struct sk_buff *skb)
 	struct net_device *in;
 
 	if (!IS_ARP(skb) && !IS_VLAN_ARP(skb)) {
+		nf_bridge_restore_otherhost(skb);
 		in = nf_bridge->physindev;
-		if (nf_bridge->mask & BRNF_PKT_TYPE) {
-			skb->pkt_type = PACKET_OTHERHOST;
-			nf_bridge->mask ^= BRNF_PKT_TYPE;
-		}
 		nf_bridge_update_protocol(skb);
 	} else {
 		in = *((struct net_device **)(skb->cb));
@@ -741,10 +766,8 @@ static unsigned int br_nf_forward_ip(const struct nf_hook_ops *ops,
 	nf_bridge_pull_encap_header(skb);
 
 	nf_bridge = skb->nf_bridge;
-	if (skb->pkt_type == PACKET_OTHERHOST) {
-		skb->pkt_type = PACKET_HOST;
-		nf_bridge->mask |= BRNF_PKT_TYPE;
-	}
+
+	nf_bridge_save_otherhost(skb);
 
 	if (pf == NFPROTO_IPV4 && br_parse_ip_options(skb))
 		return NF_DROP;
@@ -911,10 +934,8 @@ static unsigned int br_nf_post_routing(const struct nf_hook_ops *ops,
 
 	/* We assume any code from br_dev_queue_push_xmit onwards doesn't care
 	 * about the value of skb->pkt_type. */
-	if (skb->pkt_type == PACKET_OTHERHOST) {
-		skb->pkt_type = PACKET_HOST;
-		nf_bridge->mask |= BRNF_PKT_TYPE;
-	}
+
+	nf_bridge_save_otherhost(skb);
 
 	nf_bridge_pull_encap_header(skb);
 	if (pf == NFPROTO_IPV4)
@@ -1104,6 +1125,8 @@ static int __init br_netfilter_init(void)
 {
 	int ret;
 
+	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
+
 	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
 	if (ret < 0)
 		return ret;
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 nf-next 4/6] netfilter: bridge: don't use nf_bridge_info to store proto value
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
                   ` (2 preceding siblings ...)
  2015-03-12 17:05 ` [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  2015-03-12 17:05 ` [PATCH v2 nf-next 5/6] netfilter: bridge: replace remaining flags with state enum Florian Westphal
  2015-03-12 17:05 ` [PATCH nf-next 6/6] netfilter: bridge: don't use nf_bridge storage during neigh resolution Florian Westphal
  5 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

There is not need to use any external storage for this -- we only need
to know about original skb->protocol value inside bridge netfilter
itself.  skb->cb can be used to store this information.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h |  2 --
 net/bridge/br_netfilter.c        | 28 +++++++++++++++++++++++-----
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 05437f8..66245b5 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -19,8 +19,6 @@ enum nf_br_hook_priorities {
 
 #define BRNF_BRIDGED_DNAT		0x02
 #define BRNF_NF_BRIDGE_PREROUTING	0x08
-#define BRNF_8021Q			0x10
-#define BRNF_PPPoE			0x20
 
 int br_handle_frame_finish(struct sk_buff *skb);
 
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 8649ef5..215ec3f 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -47,6 +47,12 @@
 #include <linux/sysctl.h>
 #endif
 
+enum brnf_orig_skb_proto {
+	BRNF_PROTO_UNCHANGED,
+	BRNF_PROTO_8021Q,
+	BRNF_PROTO_PPPOE,
+};
+
 struct nf_bridge_skb_cb {
 	union {
 		struct inet_skb_parm i;
@@ -54,6 +60,7 @@ struct nf_bridge_skb_cb {
 	} u;
 
 	u8 packet_otherhost:1;
+	enum brnf_orig_skb_proto origproto:2;
 };
 
 #define BRNF_CB(skb) ((struct nf_bridge_skb_cb *)(skb)->cb)
@@ -264,10 +271,16 @@ drop:
 
 static void nf_bridge_update_protocol(struct sk_buff *skb)
 {
-	if (skb->nf_bridge->mask & BRNF_8021Q)
+	struct nf_bridge_skb_cb *cb = BRNF_CB(skb);
+
+	switch (cb->origproto) {
+	case BRNF_PROTO_8021Q:
 		skb->protocol = htons(ETH_P_8021Q);
-	else if (skb->nf_bridge->mask & BRNF_PPPoE)
+		break;
+	case BRNF_PROTO_PPPOE:
 		skb->protocol = htons(ETH_P_PPP_SES);
+		break;
+	}
 }
 
 static void nf_bridge_restore_otherhost(struct sk_buff *skb)
@@ -522,10 +535,13 @@ static struct net_device *setup_pre_routing(struct sk_buff *skb)
 	nf_bridge->mask |= BRNF_NF_BRIDGE_PREROUTING;
 	nf_bridge->physindev = skb->dev;
 	skb->dev = brnf_get_logical_dev(skb, skb->dev);
+
 	if (skb->protocol == htons(ETH_P_8021Q))
-		nf_bridge->mask |= BRNF_8021Q;
+		cb->origproto = BRNF_PROTO_8021Q;
 	else if (skb->protocol == htons(ETH_P_PPP_SES))
-		nf_bridge->mask |= BRNF_PPPoE;
+		cb->origproto = BRNF_PROTO_PPPOE;
+	else
+		cb->origproto = BRNF_PROTO_UNCHANGED;
 
 	/* Must drop socket now because of tproxy. */
 	skb_orphan(skb);
@@ -848,7 +864,9 @@ static int br_nf_push_frag_xmit(struct sk_buff *skb)
 
 static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
-	if (skb->nf_bridge->mask & BRNF_PPPoE)
+	struct nf_bridge_skb_cb *cb = BRNF_CB(skb);
+
+	if (cb->origproto == BRNF_PROTO_PPPOE)
 		return PPPOE_SES_HLEN;
 	return 0;
 }
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 nf-next 5/6] netfilter: bridge: replace remaining flags with state enum
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
                   ` (3 preceding siblings ...)
  2015-03-12 17:05 ` [PATCH v2 nf-next 4/6] netfilter: bridge: don't use nf_bridge_info to store proto value Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  2015-03-12 17:05 ` [PATCH nf-next 6/6] netfilter: bridge: don't use nf_bridge storage during neigh resolution Florian Westphal
  5 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

The two remaining flags are mutually exclusive, use a state enum for
telling in which stage of bridge netfilter processing we are.

BRNF_STATE_PREROUTING: used so that netfilter PRE_ROUTING is not
traversed twice and to put such skbs into different defragmentation
queues.

Could also be indicated via skb->cb, but this would make it necessary to
expose such flag in ipv4 IPCB so that physdev match can use this reliably,
and we want less IP stack entanglements).

BRNF_STATE_BRIDGED_DNAT means that we don't know the new destination
mac address after a IP DNAT took place and that we'll have to push the
skb through negh resolution.

Cannot be stored in skb->cb either since such skb leaves bridge ownership and
can e.g. be enqueued in qdisc.

The SEEN state is only so we know when skb is neither in nefarious
DNAT reinject path nor travelling through the PRE_ROUTING netfilter hooks.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h          |  4 ----
 include/linux/skbuff.h                    | 15 ++++++++++++++-
 net/bridge/br_netfilter.c                 | 19 +++++++++----------
 net/ipv4/netfilter/nf_defrag_ipv4.c       |  2 +-
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c |  2 +-
 5 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 66245b5..2070623 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -16,10 +16,6 @@ enum nf_br_hook_priorities {
 };
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-
-#define BRNF_BRIDGED_DNAT		0x02
-#define BRNF_NF_BRIDGE_PREROUTING	0x08
-
 int br_handle_frame_finish(struct sk_buff *skb);
 
 static inline void br_drop_fake_rtable(struct sk_buff *skb)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bba1330..364c4a8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -163,10 +163,23 @@ struct nf_conntrack {
 };
 #endif
 
+enum brnf_state {
+	BRNF_STATE_SEEN,
+
+	/* IPV4/IPV6 PRE_ROUTING called from bridge netfilter */
+	BRNF_STATE_PREROUTING,
+
+	/* skb that is 'transmitted' via bridge must to be injected
+	 * back into br forwarding for delivery to the correct bridge output
+	 * port due to DNAT to a destination on the same (bridged) network.
+	 */
+	BRNF_STATE_BRIDGED_DNAT,
+};
+
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 struct nf_bridge_info {
 	atomic_t		use;
-	unsigned int		mask;
+	enum brnf_state	brnf_state;
 	struct net_device	*physindev;
 	struct net_device	*physoutdev;
 	unsigned long		data[32 / sizeof(unsigned long)];
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 215ec3f..342081e 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -315,8 +315,7 @@ static int br_nf_pre_routing_finish_ipv6(struct sk_buff *skb)
 	struct rtable *rt;
 
 	nf_bridge_restore_otherhost(skb);
-
-	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
+	nf_bridge->brnf_state = BRNF_STATE_SEEN;
 
 	rt = bridge_parent_rtable(nf_bridge->physindev);
 	if (!rt) {
@@ -367,8 +366,7 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)
 							 skb->nf_bridge->data,
 							 ETH_HLEN-ETH_ALEN);
 			/* tell br_dev_xmit to continue with forwarding */
-			nf_bridge->mask |= BRNF_BRIDGED_DNAT;
-			/* FIXME Need to refragment */
+			nf_bridge->brnf_state = BRNF_STATE_BRIDGED_DNAT;
 			ret = neigh->output(neigh, skb);
 		}
 		neigh_release(neigh);
@@ -446,7 +444,8 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)
 	BR_INPUT_SKB_CB(skb)->frag_max_size = frag_max_size;
 
 	nf_bridge_restore_otherhost(skb);
-	nf_bridge->mask ^= BRNF_NF_BRIDGE_PREROUTING;
+	nf_bridge->brnf_state = BRNF_STATE_SEEN;
+
 	if (dnat_took_place(skb)) {
 		if ((err = ip_route_input(skb, iph->daddr, iph->saddr, iph->tos, dev))) {
 			struct in_device *in_dev = __in_dev_get_rcu(dev);
@@ -532,7 +531,7 @@ static struct net_device *setup_pre_routing(struct sk_buff *skb)
 	cb->packet_otherhost = 0;
 	nf_bridge_save_otherhost(skb);
 
-	nf_bridge->mask |= BRNF_NF_BRIDGE_PREROUTING;
+	nf_bridge->brnf_state = BRNF_STATE_PREROUTING;
 	nf_bridge->physindev = skb->dev;
 	skb->dev = brnf_get_logical_dev(skb, skb->dev);
 
@@ -977,9 +976,8 @@ static unsigned int ip_sabotage_in(const struct nf_hook_ops *ops,
 				   int (*okfn)(struct sk_buff *))
 {
 	if (skb->nf_bridge &&
-	    !(skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)) {
+	    skb->nf_bridge->brnf_state != BRNF_STATE_PREROUTING)
 		return NF_STOP;
-	}
 
 	return NF_ACCEPT;
 }
@@ -998,7 +996,7 @@ static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
 
 	skb_pull(skb, ETH_HLEN);
-	nf_bridge->mask &= ~BRNF_BRIDGED_DNAT;
+	nf_bridge->brnf_state = BRNF_STATE_SEEN;
 
 	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
 				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
@@ -1008,7 +1006,8 @@ static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
 
 static int br_nf_dev_xmit(struct sk_buff *skb)
 {
-	if (skb->nf_bridge && (skb->nf_bridge->mask & BRNF_BRIDGED_DNAT)) {
+	if (skb->nf_bridge &&
+	    skb->nf_bridge->brnf_state == BRNF_STATE_BRIDGED_DNAT) {
 		br_nf_pre_routing_finish_bridge_slow(skb);
 		return 1;
 	}
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 7e5ca6f..8280816 100644
--- a/net/ipv4/netfilter/nf_defrag_ipv4.c
+++ b/net/ipv4/netfilter/nf_defrag_ipv4.c
@@ -52,7 +52,7 @@ static enum ip_defrag_users nf_ct_defrag_user(unsigned int hooknum,
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	if (skb->nf_bridge &&
-	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
+	    skb->nf_bridge->brnf_state == BRNF_STATE_PREROUTING)
 		return IP_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
diff --git a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
index e70382e..3d89c92 100644
--- a/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
+++ b/net/ipv6/netfilter/nf_defrag_ipv6_hooks.c
@@ -42,7 +42,7 @@ static enum ip6_defrag_users nf_ct6_defrag_user(unsigned int hooknum,
 
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
 	if (skb->nf_bridge &&
-	    skb->nf_bridge->mask & BRNF_NF_BRIDGE_PREROUTING)
+	    skb->nf_bridge->brnf_state == BRNF_STATE_PREROUTING)
 		return IP6_DEFRAG_CONNTRACK_BRIDGE_IN + zone;
 #endif
 	if (hooknum == NF_INET_PRE_ROUTING)
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH nf-next 6/6] netfilter: bridge: don't use nf_bridge storage during neigh resolution
  2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
                   ` (4 preceding siblings ...)
  2015-03-12 17:05 ` [PATCH v2 nf-next 5/6] netfilter: bridge: replace remaining flags with state enum Florian Westphal
@ 2015-03-12 17:05 ` Florian Westphal
  5 siblings, 0 replies; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 17:05 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

The last user of the nf_bridge_info data area is the rare case where
we have to save original source MAC address when resolving the new
destination MAC address after ip dnat rewrite.  The neigh resolution
(re)builds new MAC header, using the bridge device source mac.

We can use skb->cb for this: We just need to make sure that we're
adding enough padding so qdiscs won't interfere with it.

This also uses the interface index of the physical dev instead of a
pointer.

Since we do not hold any reference to the physindev, so we should
at least make sure that we're not passing a stale address upon
reinject.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h    |  1 -
 net/bridge/br_device.c    |  2 +-
 net/bridge/br_netfilter.c | 39 +++++++++++++++++++++++++++++++++------
 net/bridge/br_private.h   |  2 +-
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 364c4a8..4049239 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -182,7 +182,6 @@ struct nf_bridge_info {
 	enum brnf_state	brnf_state;
 	struct net_device	*physindev;
 	struct net_device	*physoutdev;
-	unsigned long		data[32 / sizeof(unsigned long)];
 };
 #endif
 
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 4ff77a1..c3ec7fe 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -41,7 +41,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	rcu_read_lock();
 	nf_ops = rcu_dereference(nf_br_ops);
-	if (nf_ops && nf_ops->br_dev_xmit_hook(skb)) {
+	if (nf_ops && nf_ops->br_dev_xmit_hook(skb, dev)) {
 		rcu_read_unlock();
 		return NETDEV_TX_OK;
 	}
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 342081e..0ba697f 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -36,6 +36,7 @@
 #include <net/ipv6.h>
 #include <net/route.h>
 #include <net/netfilter/br_netfilter.h>
+#include <net/sch_generic.h>
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 #include <net/netfilter/nf_conntrack.h>
@@ -65,6 +66,15 @@ struct nf_bridge_skb_cb {
 
 #define BRNF_CB(skb) ((struct nf_bridge_skb_cb *)(skb)->cb)
 
+/* bridge netfilter dnat mac resolution via neigh cache */
+struct nf_bridge_dnat_skb_cb {
+	struct qdisc_skb_cb qdisc;
+	int physindev_index;
+	char dnat_orig_mac[ETH_HLEN - ETH_ALEN];
+};
+
+#define BRNF_DNAT_SKB_CB(__skb)  ((struct nf_bridge_dnat_skb_cb *)(__skb)->cb)
+
 #ifdef CONFIG_SYSCTL
 static struct ctl_table_header *brnf_sysctl_header;
 static int brnf_call_iptables __read_mostly = 1;
@@ -357,14 +367,22 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)
 			skb->dev = nf_bridge->physindev;
 			ret = br_handle_frame_finish(skb);
 		} else {
+			struct nf_bridge_dnat_skb_cb *cb;
+
+			cb = BRNF_DNAT_SKB_CB(skb);
+
+			memset(cb, 0, sizeof(*cb));
+
 			/* the neighbour function below overwrites the complete
 			 * MAC header, so we save the Ethernet source address and
 			 * protocol number.
 			 */
 			skb_copy_from_linear_data_offset(skb,
 							 -(ETH_HLEN-ETH_ALEN),
-							 skb->nf_bridge->data,
+							 cb->dnat_orig_mac,
 							 ETH_HLEN-ETH_ALEN);
+
+			cb->physindev_index = nf_bridge->physindev->ifindex;
 			/* tell br_dev_xmit to continue with forwarding */
 			nf_bridge->brnf_state = BRNF_STATE_BRIDGED_DNAT;
 			ret = neigh->output(neigh, skb);
@@ -991,24 +1009,32 @@ static unsigned int ip_sabotage_in(const struct nf_hook_ops *ops,
  * This restores the original MAC saddr of the bridged packet
  * before invoking bridge forward logic to transmit the packet.
  */
-static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
+static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb,
+						 struct net_device *dev)
 {
 	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+	struct nf_bridge_dnat_skb_cb *cb = BRNF_DNAT_SKB_CB(skb);
 
 	skb_pull(skb, ETH_HLEN);
 	nf_bridge->brnf_state = BRNF_STATE_SEEN;
 
 	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
-				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
-	skb->dev = nf_bridge->physindev;
+				       cb->dnat_orig_mac, ETH_HLEN - ETH_ALEN);
+
+	skb->dev = dev_get_by_index_rcu(dev_net(dev), cb->physindev_index);
+	if (!skb->dev) {
+		kfree_skb(skb);
+		return;
+	}
+
 	br_handle_frame_finish(skb);
 }
 
-static int br_nf_dev_xmit(struct sk_buff *skb)
+static int br_nf_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	if (skb->nf_bridge &&
 	    skb->nf_bridge->brnf_state == BRNF_STATE_BRIDGED_DNAT) {
-		br_nf_pre_routing_finish_bridge_slow(skb);
+		br_nf_pre_routing_finish_bridge_slow(skb, dev);
 		return 1;
 	}
 	return 0;
@@ -1143,6 +1169,7 @@ static int __init br_netfilter_init(void)
 	int ret;
 
 	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
+	BUILD_BUG_ON(sizeof(struct nf_bridge_dnat_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
 
 	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
 	if (ret < 0)
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index b46fa0c..6664141 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -764,7 +764,7 @@ static inline int br_vlan_enabled(struct net_bridge *br)
 #endif
 
 struct nf_br_ops {
-	int (*br_dev_xmit_hook)(struct sk_buff *skb);
+	int (*br_dev_xmit_hook)(struct sk_buff *skb, struct net_device *dev);
 };
 extern const struct nf_br_ops __rcu *nf_br_ops;
 
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling
  2015-03-12 17:05 ` [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling Florian Westphal
@ 2015-03-12 18:02   ` Oliver Hartkopp
  2015-03-12 18:31     ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: Oliver Hartkopp @ 2015-03-12 18:02 UTC (permalink / raw)
  To: Florian Westphal, netfilter-devel; +Cc: netdev, Eyal Birger



On 03/12/2015 06:05 PM, Florian Westphal wrote:
> nf_bridge_info->mask is used for several things, for example to remember
> if skb->pkt_type was set to OTHER_HOST.
> 
> For a bridge, OTHER_HOST is expected case. For ip forward its a
> non-starter though -- routing expects PACKET_HOST.
> 
> Bridge netfilter thus changes OTHER_HOST to PACKET_HOST
> before hook invocation and then un-does it after hook traversal.
> 
> For this, cb[] can be used since the skb will never be used outside
> (fake inet) bridge forwarding while in 'fake PACKET_HOST' state.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/linux/netfilter_bridge.h |  1 -
>  net/bridge/br_netfilter.c        | 71 ++++++++++++++++++++++++++--------------


> @@ -1104,6 +1125,8 @@ static int __init br_netfilter_init(void)
>  {
>  	int ret;
>  
> +	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
> +

Please use sock_skb_cb_check_size(size) for cb size checking which is the
'new' check for cb sizes for netdev.

See:

http://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git/commit?id=b4772ef879a8f7d8c56118c2ae5a296fcf6f81d2

>  	ret = nf_register_hooks(br_nf_ops, ARRAY_SIZE(br_nf_ops));
>  	if (ret < 0)
>  		return ret;
> 

Regards,
Oliver

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling
  2015-03-12 18:02   ` Oliver Hartkopp
@ 2015-03-12 18:31     ` Florian Westphal
  2015-03-12 18:35       ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 18:31 UTC (permalink / raw)
  To: Oliver Hartkopp; +Cc: Florian Westphal, netfilter-devel, netdev, Eyal Birger

Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> On 03/12/2015 06:05 PM, Florian Westphal wrote:
> > nf_bridge_info->mask is used for several things, for example to remember
> > if skb->pkt_type was set to OTHER_HOST.
> > 
> > For a bridge, OTHER_HOST is expected case. For ip forward its a
> > non-starter though -- routing expects PACKET_HOST.
> > 
> > Bridge netfilter thus changes OTHER_HOST to PACKET_HOST
> > before hook invocation and then un-does it after hook traversal.
> > 
> > For this, cb[] can be used since the skb will never be used outside
> > (fake inet) bridge forwarding while in 'fake PACKET_HOST' state.
> > 
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > ---
> >  include/linux/netfilter_bridge.h |  1 -
> >  net/bridge/br_netfilter.c        | 71 ++++++++++++++++++++++++++--------------
> 
> 
> > @@ -1104,6 +1125,8 @@ static int __init br_netfilter_init(void)
> >  {
> >  	int ret;
> >  
> > +	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
> > +
> 
> Please use sock_skb_cb_check_size(size) for cb size checking which is the
> 'new' check for cb sizes for netdev.

I can do this for conistency, but its technically not needed here.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling
  2015-03-12 18:31     ` Florian Westphal
@ 2015-03-12 18:35       ` Florian Westphal
  2015-03-12 18:40         ` Oliver Hartkopp
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2015-03-12 18:35 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Oliver Hartkopp, netfilter-devel, netdev, Eyal Birger

Florian Westphal <fw@strlen.de> wrote:
> > > +	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
> > > +
> > 
> > Please use sock_skb_cb_check_size(size) for cb size checking which is the
> > 'new' check for cb sizes for netdev.
> 
> I can do this for conistency, but its technically not needed here.

I take that back, it would actually be inconsistent to use
sock_skb_cb_check_size since nf_bridge skbs will not be associated with
sockets.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling
  2015-03-12 18:35       ` Florian Westphal
@ 2015-03-12 18:40         ` Oliver Hartkopp
  0 siblings, 0 replies; 18+ messages in thread
From: Oliver Hartkopp @ 2015-03-12 18:40 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev, Eyal Birger

On 03/12/2015 07:35 PM, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
>>>> +	BUILD_BUG_ON(sizeof(struct nf_bridge_skb_cb) > FIELD_SIZEOF(struct sk_buff, cb));
>>>> +
>>>
>>> Please use sock_skb_cb_check_size(size) for cb size checking which is the
>>> 'new' check for cb sizes for netdev.
>>
>> I can do this for conistency, but its technically not needed here.
> 
> I take that back, it would actually be inconsistent to use
> sock_skb_cb_check_size since nf_bridge skbs will not be associated with
> sockets.
> 

Ok. I just wanted to make sure that the use of the cb[]-space doesn't clash
with the new space that's acquired by the dropcount stuff.

Sorry for the noise :-)

Best regards,
Oliver

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
@ 2015-03-13  0:38   ` Andy Zhou
  2015-03-16 22:55   ` Pablo Neira Ayuso
  1 sibling, 0 replies; 18+ messages in thread
From: Andy Zhou @ 2015-03-13  0:38 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev@vger.kernel.org

On Thu, Mar 12, 2015 at 10:05 AM, Florian Westphal <fw@strlen.de> wrote:
> Long time ago it was possible for the netfilter ip_conntrack
> core to call ip_fragment in POST_ROUTING hook.
>
> This is no longer the case, so the only case where bridge netfilter
> ends up calling ip_fragment is the direct call site in br_netfilter.c.
>
> Add ll and mtu arguments for ip_fragment and then get rid of the bridge
> netfilter specific helpers from ip_fragment.
>
> Cc: Andy Zhou <azhou@nicira.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/linux/netfilter_bridge.h | 17 -----------------
>  include/net/ip.h                 |  4 ++--
>  net/bridge/br_netfilter.c        | 23 ++++++++++++++++++++---
>  net/ipv4/ip_output.c             | 37 +++++++++++++++++++++----------------
>  4 files changed, 43 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
> index ed0d3bf..fbbd5de 100644
> --- a/include/linux/netfilter_bridge.h
> +++ b/include/linux/netfilter_bridge.h
I like this patch a lot.  The nf_brdige was confusing to me when I
looked into this area. I am happen to it is going away.

With this patch, it seems we don't need the 'dev' variable anymore,
all we need is 'net' and we can move it into the 'if' block for
sending icmp.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
  2015-03-13  0:38   ` Andy Zhou
@ 2015-03-16 22:55   ` Pablo Neira Ayuso
  2015-03-17  4:42     ` David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-16 22:55 UTC (permalink / raw)
  To: David Miller; +Cc: Florian Westphal, netfilter-devel, netdev, Andy Zhou

Hi David,

This patch changes the interface ip_fragment(). If this sounds
reasonable to you, please ack it and I'll route it to you in the next
nf-next batch.

Thanks.

On Thu, Mar 12, 2015 at 06:05:20PM +0100, Florian Westphal wrote:
> Long time ago it was possible for the netfilter ip_conntrack
> core to call ip_fragment in POST_ROUTING hook.
> 
> This is no longer the case, so the only case where bridge netfilter
> ends up calling ip_fragment is the direct call site in br_netfilter.c.
> 
> Add ll and mtu arguments for ip_fragment and then get rid of the bridge
> netfilter specific helpers from ip_fragment.
> 
> Cc: Andy Zhou <azhou@nicira.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/linux/netfilter_bridge.h | 17 -----------------
>  include/net/ip.h                 |  4 ++--
>  net/bridge/br_netfilter.c        | 23 ++++++++++++++++++++---
>  net/ipv4/ip_output.c             | 37 +++++++++++++++++++++----------------
>  4 files changed, 43 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
> index ed0d3bf..fbbd5de 100644
> --- a/include/linux/netfilter_bridge.h
> +++ b/include/linux/netfilter_bridge.h
> @@ -35,24 +35,8 @@ static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
>  	}
>  }
>  
> -static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
> -{
> -	if (unlikely(skb->nf_bridge->mask & BRNF_PPPoE))
> -		return PPPOE_SES_HLEN;
> -	return 0;
> -}
> -
>  int br_handle_frame_finish(struct sk_buff *skb);
>  
> -/* This is called by the IP fragmenting code and it ensures there is
> - * enough room for the encapsulating header (if there is one). */
> -static inline unsigned int nf_bridge_pad(const struct sk_buff *skb)
> -{
> -	if (skb->nf_bridge)
> -		return nf_bridge_encap_header_len(skb);
> -	return 0;
> -}
> -
>  static inline void br_drop_fake_rtable(struct sk_buff *skb)
>  {
>  	struct dst_entry *dst = skb_dst(skb);
> @@ -62,7 +46,6 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
>  }
>  
>  #else
> -#define nf_bridge_pad(skb)			(0)
>  #define br_drop_fake_rtable(skb)	        do { } while (0)
>  #endif /* CONFIG_BRIDGE_NETFILTER */
>  
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 025c61c..2905a4b 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -108,8 +108,8 @@ int ip_local_deliver(struct sk_buff *skb);
>  int ip_mr_input(struct sk_buff *skb);
>  int ip_output(struct sock *sk, struct sk_buff *skb);
>  int ip_mc_output(struct sock *sk, struct sk_buff *skb);
> -int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));
> -int ip_do_nat(struct sk_buff *skb);
> +int ip_fragment(struct sk_buff *skb, unsigned int mtu,
> +		unsigned int ll_rs, int (*output)(struct sk_buff *));
>  void ip_send_check(struct iphdr *ip);
>  int __ip_local_out(struct sk_buff *skb);
>  int ip_local_out_sk(struct sock *sk, struct sk_buff *skb);
> diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
> index bd2d24d..550ee19 100644
> --- a/net/bridge/br_netfilter.c
> +++ b/net/bridge/br_netfilter.c
> @@ -812,26 +812,43 @@ static int br_nf_push_frag_xmit(struct sk_buff *skb)
>  	return br_dev_queue_push_xmit(skb);
>  }
>  
> +static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
> +{
> +	if (skb->nf_bridge->mask & BRNF_PPPoE)
> +		return PPPOE_SES_HLEN;
> +	return 0;
> +}
> +
>  static int br_nf_dev_queue_xmit(struct sk_buff *skb)
>  {
>  	int ret;
>  	int frag_max_size;
> -	unsigned int mtu_reserved;
> +	unsigned int mtu_reserved, mtu;
>  
>  	if (skb_is_gso(skb) || skb->protocol != htons(ETH_P_IP))
>  		return br_dev_queue_push_xmit(skb);
>  
>  	mtu_reserved = nf_bridge_mtu_reduction(skb);
> +	mtu = min(skb->dev->mtu, IP_MAX_MTU);
>  	/* This is wrong! We should preserve the original fragment
>  	 * boundaries by preserving frag_list rather than refragmenting.
>  	 */
> -	if (skb->len + mtu_reserved > skb->dev->mtu) {
> +	if (skb->len + mtu_reserved > mtu) {
> +		unsigned int llrs;
> +
>  		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
>  		if (br_parse_ip_options(skb))
>  			/* Drop invalid packet */
>  			return NF_DROP;
>  		IPCB(skb)->frag_max_size = frag_max_size;
> -		ret = ip_fragment(skb, br_nf_push_frag_xmit);
> +
> +		/* for bridged IP traffic encapsulated inside f.e. a vlan header,
> +		 * we need to make room for the encapsulating header
> +		 */
> +		llrs = nf_bridge_encap_header_len(skb);
> +
> +		mtu -= mtu_reserved;
> +		ret = ip_fragment(skb, mtu, llrs, br_nf_push_frag_xmit);
>  	} else
>  		ret = br_dev_queue_push_xmit(skb);
>  
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index a7aea20..fe5ec3f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  	netdev_features_t features;
>  	struct sk_buff *segs;
>  	int ret = 0;
> +	unsigned int mtu;
>  
>  	/* common case: locally created skb or seglen is <= mtu */
>  	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> @@ -236,6 +237,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  		return -ENOMEM;
>  	}
>  
> +	mtu = ip_skb_dst_mtu(skb);
>  	consume_skb(skb);
>  
>  	do {
> @@ -243,7 +245,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  		int err;
>  
>  		segs->next = NULL;
> -		err = ip_fragment(segs, ip_finish_output2);
> +		err = ip_fragment(segs, mtu, 0, ip_finish_output2);
>  
>  		if (err && ret == 0)
>  			ret = err;
> @@ -255,6 +257,8 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>  
>  static int ip_finish_output(struct sk_buff *skb)
>  {
> +	unsigned int mtu;
> +
>  #if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
>  	/* Policy lookup after SNAT yielded a new policy */
>  	if (skb_dst(skb)->xfrm != NULL) {
> @@ -265,8 +269,9 @@ static int ip_finish_output(struct sk_buff *skb)
>  	if (skb_is_gso(skb))
>  		return ip_finish_output_gso(skb);
>  
> -	if (skb->len > ip_skb_dst_mtu(skb))
> -		return ip_fragment(skb, ip_finish_output2);
> +	mtu = ip_skb_dst_mtu(skb);
> +	if (skb->len > mtu)
> +		return ip_fragment(skb, mtu, 0, ip_finish_output2);
>  
>  	return ip_finish_output2(skb);
>  }
> @@ -472,20 +477,28 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
>  	skb_copy_secmark(to, from);
>  }
>  
> -/*
> +/**
> + *	ip_fragment - fragment IP datagram or send ICMP error
> + *
> + *	@skb: the skb to fragment
> + *	@mtu: mtu to use for fragmentation
> + *	@ll_rs: extra linklayer space required
> + *	@output: transmit function used to send fragments
> + *
>   *	This IP datagram is too large to be sent in one piece.  Break it up into
>   *	smaller pieces (each of size equal to IP header plus
>   *	a block of the data of the original IP data part) that will yet fit in a
>   *	single device frame, and queue such a frame for sending.
>   */
> -
> -int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> +int ip_fragment(struct sk_buff *skb,
> +		unsigned int mtu, unsigned int ll_rs,
> +		int (*output)(struct sk_buff *))
>  {
>  	struct iphdr *iph;
>  	int ptr;
>  	struct net_device *dev;
>  	struct sk_buff *skb2;
> -	unsigned int mtu, hlen, left, len, ll_rs;
> +	unsigned int hlen, left, len;
>  	int offset;
>  	__be16 not_last_frag;
>  	struct rtable *rt = skb_rtable(skb);
> @@ -499,7 +512,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  
>  	iph = ip_hdr(skb);
>  
> -	mtu = ip_skb_dst_mtu(skb);
>  	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
>  		     (IPCB(skb)->frag_max_size &&
>  		      IPCB(skb)->frag_max_size > mtu))) {
> @@ -516,10 +528,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  
>  	hlen = iph->ihl * 4;
>  	mtu = mtu - hlen;	/* Size of data space */
> -#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
> -	if (skb->nf_bridge)
> -		mtu -= nf_bridge_mtu_reduction(skb);
> -#endif
>  	IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE;
>  
>  	/* When frag_list is given, use it. First, check its validity:
> @@ -636,10 +644,7 @@ slow_path:
>  	left = skb->len - hlen;		/* Space per frame */
>  	ptr = hlen;		/* Where to start from */
>  
> -	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
> -	 * we need to make room for the encapsulating header
> -	 */
> -	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
> +	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, ll_rs);
>  
>  	/*
>  	 *	Fragment the datagram.
> -- 
> 2.0.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-16 22:55   ` Pablo Neira Ayuso
@ 2015-03-17  4:42     ` David Miller
  2015-03-17 10:11       ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2015-03-17  4:42 UTC (permalink / raw)
  To: pablo; +Cc: fw, netfilter-devel, netdev, azhou

From: Pablo Neira Ayuso <pablo@netfilter.org>
Date: Mon, 16 Mar 2015 23:55:45 +0100

> This patch changes the interface ip_fragment(). If this sounds
> reasonable to you, please ack it and I'll route it to you in the next
> nf-next batch.

I really think bridging netfilter needs to stop pretending.

Specifically it needs to stop pretending it can do full on IP
operations like fragmentation without the full necessary context.

That full necessary context being a physical destination device,
and a proper IP route.

It means that all of the MTU calculations miss everything done
by the ipv4 routing layer, all of the settings made by the user
via sysctl_ip_fwd_use_pmtu, etc.

So I think bridge netfilter needs to seriously look up a real
route and do things properly like the rest of the networking
stack does when it wants to fragment ipv4 packets.

And this just creates another hole for the openvswitch folks to
crawl through instead of fixing their architectural mistakes.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-17  4:42     ` David Miller
@ 2015-03-17 10:11       ` Florian Westphal
  2015-03-17 17:12         ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2015-03-17 10:11 UTC (permalink / raw)
  To: David Miller; +Cc: pablo, fw, netfilter-devel, netdev, azhou

David Miller <davem@davemloft.net> wrote:
> Specifically it needs to stop pretending it can do full on IP
> operations like fragmentation without the full necessary context.
> 
> That full necessary context being a physical destination device,
> and a proper IP route.
> 
> It means that all of the MTU calculations miss everything done
> by the ipv4 routing layer, all of the settings made by the user
> via sysctl_ip_fwd_use_pmtu, etc.

Perhaps, but I have a hard time defining wheter a bridge should
use something like sysctl_ip_fwd_use_pmtu or not.

And doing route lookups will break things for some people, we have zero
guarantee that a bridge has the needed routing information,
its valid to not even configure a default gateway on a bridge.

We could alter defragmentation to provide the size of the largest
fragment seen unconditionally, and use that.

But I honestly think this patch is the best we can do to at least
don't have the IP stack deal with this crap.

> So I think bridge netfilter needs to seriously look up a real
> route and do things properly like the rest of the networking
> stack does when it wants to fragment ipv4 packets.

Sure, I can investigate doing this.

However, I don't believe that this is fixable given that we might not
have any routing tables; also; we allowed things like transparent PPPOE
and VLAN header stripping.

ip_fragment shouldn't have to deal with increased LL space, as it does now,
and I don't see any way to fix that except adding that extra ll size argument
and having br_netfilter set it.

If you disagree, whats your suggested solution to get rid
of the br_netfilter inline helpers?

Kill support for vlan/pppoe header stripping?
Add route lookup but keep current behaviour as fallback in case we don't
find route?

I wouldn't object to doing that, but I'm reasonably sure it will break
existing setups.

Thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-17 10:11       ` Florian Westphal
@ 2015-03-17 17:12         ` David Miller
  2015-03-17 20:40           ` Florian Westphal
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2015-03-17 17:12 UTC (permalink / raw)
  To: fw; +Cc: pablo, netfilter-devel, netdev, azhou

From: Florian Westphal <fw@strlen.de>
Date: Tue, 17 Mar 2015 11:11:52 +0100

> And doing route lookups will break things for some people, we have zero
> guarantee that a bridge has the needed routing information,
> its valid to not even configure a default gateway on a bridge.

Then without a proper route you absolutely cannot choose an
appropriate MTU from which to perform fragmentation.

Just accept that basic fact.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-17 17:12         ` David Miller
@ 2015-03-17 20:40           ` Florian Westphal
  2015-03-17 21:38             ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Westphal @ 2015-03-17 20:40 UTC (permalink / raw)
  To: David Miller; +Cc: fw, pablo, netfilter-devel, netdev, azhou

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Tue, 17 Mar 2015 11:11:52 +0100
> 
> > And doing route lookups will break things for some people, we have zero
> > guarantee that a bridge has the needed routing information,
> > its valid to not even configure a default gateway on a bridge.
> 
> Then without a proper route you absolutely cannot choose an
> appropriate MTU from which to perform fragmentation.

Just to clarify, this ip_fragment call is done only for frames that
are forwarded by the bridge, i.e. not routed.

All interfaces have the same MTU.

So why would we need to chose an MTU different than the device
mtu, or larger than the largest reassembled packet?

Ideally, the bridge would re-create the original fragments it received
on 1:1 basis to make it fully transparent, and to make the bridge behave
as if it would not do the defrag layering violations in the first place.

> Just accept that basic fact.

For a router I'd agree, but, then again, we're a bridge.  Normally we
would not fragment at all.  The bridge defragmentation should not be
observable by external entity.  No increase, no decrease of mtu, 1:1
fragment passthrough illusion.

I can leave ip_fragment alone and, when skb->nf_bridge goes away,
just replace

#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
        if (skb->nf_bridge)
	      mtu -= nf_bridge_mtu_reduction(skb);
#endif
and

ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));

With a functionally equivalent "solution".
But I'd really prefer to move these kludges out of the ip stack for good.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter
  2015-03-17 20:40           ` Florian Westphal
@ 2015-03-17 21:38             ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2015-03-17 21:38 UTC (permalink / raw)
  To: fw; +Cc: pablo, netfilter-devel, netdev, azhou

From: Florian Westphal <fw@strlen.de>
Date: Tue, 17 Mar 2015 21:40:28 +0100

> Ideally, the bridge would re-create the original fragments it received
> on 1:1 basis to make it fully transparent, and to make the bridge behave
> as if it would not do the defrag layering violations in the first place.

And we have infrastructure to do exactly this, via GRO.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-03-17 21:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-12 17:05 [PATCH v2 nf-next 0/6] more bridge netfilter refactoring Florian Westphal
2015-03-12 17:05 ` [PATCH v2 nf-next 1/6] net: untangle ip_fragment and bridge netfilter Florian Westphal
2015-03-13  0:38   ` Andy Zhou
2015-03-16 22:55   ` Pablo Neira Ayuso
2015-03-17  4:42     ` David Miller
2015-03-17 10:11       ` Florian Westphal
2015-03-17 17:12         ` David Miller
2015-03-17 20:40           ` Florian Westphal
2015-03-17 21:38             ` David Miller
2015-03-12 17:05 ` [PATCH v2 nf-next 2/6] netfilter: bridge: don't use nf_bridge_info to store mac header Florian Westphal
2015-03-12 17:05 ` [PATCH v2 nf-next 3/6] netfilter: bridge: use skb->cb to track otherhost mangling Florian Westphal
2015-03-12 18:02   ` Oliver Hartkopp
2015-03-12 18:31     ` Florian Westphal
2015-03-12 18:35       ` Florian Westphal
2015-03-12 18:40         ` Oliver Hartkopp
2015-03-12 17:05 ` [PATCH v2 nf-next 4/6] netfilter: bridge: don't use nf_bridge_info to store proto value Florian Westphal
2015-03-12 17:05 ` [PATCH v2 nf-next 5/6] netfilter: bridge: replace remaining flags with state enum Florian Westphal
2015-03-12 17:05 ` [PATCH nf-next 6/6] netfilter: bridge: don't use nf_bridge storage during neigh resolution Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).