[PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
@ 2015-03-04 23:52 Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 1/8] bridge: move mac header copying into br_netfilter Florian Westphal
                   ` (8 more replies)
  0 siblings, 9 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev

bridge_netfilter.h contains various helpers, some only used by br_netfilter,
others however are also called in bridge or even ip stack.

Lets start untangling bridge, bridge netfilter, and the
rest of the ip stack (esp. ip_fragment).

This changes ip_fragment() so that bridge netfilter
can pass in the required information as arguments instead
of using skb->nf_bridge to pass some extra information to it.

Another problem with br_netfilter and the way its plumbed to
ip/ip6-tables (physdev match) is skb->nf_bridge.

nf_bridge is kmalloced blob with some extra information, including
the bridge in and outports (mainly for iptables' physdev match).
It also has various state bits so we know what manipulations
have been performed by bridge netfilter on the skb (e.g.
ppp header stripping).

nf_bridge also provides scratch space where br_netfilter saves
(and later restores) various things, e.g. ipv4 address for
dnat detection, mac address to fix up ip fragmented skbs, etc.

But in almost all cases we can avoid using ->data completely.

Tested, on host connected to kvm-bridge:
ping -s $bignum $ip_behind_bridge
-j REDIRECT
-j DNAT --to-destination $ip_behind_bridge

 include/linux/netfilter_bridge.h |   86 -------------------------
 include/linux/skbuff.h           |    2 
 include/net/ip.h                 |    8 +-
 net/bridge/br_device.c           |   27 +++++++-
 net/bridge/br_forward.c          |    4 -
 net/bridge/br_netfilter.c        |  131 ++++++++++++++++++++++++++++++---------
 net/ipv4/ip_output.c             |   47 ++++++++-----
 7 files changed, 161 insertions(+), 144 deletions(-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH nf-next 1/8] bridge: move mac header copying into br_netfilter
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 2/8] netfilter: bridge: move nf_bridge_update_protocol to where its used Florian Westphal
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

The mac header only has to be copied back into the skb for
fragments generated by ip_fragment(), which only happens
for bridge forwarded packets with nf-call-iptables=1 && active nf_defrag.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 31 -------------------------------
 net/bridge/br_forward.c          |  4 +---
 net/bridge/br_netfilter.c        | 29 ++++++++++++++++++++++++++++-
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index c755e49..332ef8a 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -44,36 +44,6 @@ static inline void nf_bridge_update_protocol(struct sk_buff *skb)
 		skb->protocol = htons(ETH_P_PPP_SES);
 }
 
-/* Fill in the header for fragmented IP packets handled by
- * the IPv4 connection tracking code.
- *
- * Only used in br_forward.c
- */
-static inline int nf_bridge_copy_header(struct sk_buff *skb)
-{
-	int err;
-	unsigned int header_size;
-
-	nf_bridge_update_protocol(skb);
-	header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
-	err = skb_cow_head(skb, header_size);
-	if (err)
-		return err;
-
-	skb_copy_to_linear_data_offset(skb, -header_size,
-				       skb->nf_bridge->data, header_size);
-	__skb_push(skb, nf_bridge_encap_header_len(skb));
-	return 0;
-}
-
-static inline int nf_bridge_maybe_copy_header(struct sk_buff *skb)
-{
-	if (skb->nf_bridge &&
-	    skb->nf_bridge->mask & (BRNF_BRIDGED | BRNF_BRIDGED_DNAT))
-		return nf_bridge_copy_header(skb);
-  	return 0;
-}
-
 static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
 	if (unlikely(skb->nf_bridge->mask & BRNF_PPPoE))
@@ -119,7 +89,6 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 }
 
 #else
-#define nf_bridge_maybe_copy_header(skb)	(0)
 #define nf_bridge_pad(skb)			(0)
 #define br_drop_fake_rtable(skb)	        do { } while (0)
 #endif /* CONFIG_BRIDGE_NETFILTER */
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index f96933a..32541d4 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -37,9 +37,7 @@ static inline int should_deliver(const struct net_bridge_port *p,
 
 int br_dev_queue_push_xmit(struct sk_buff *skb)
 {
-	/* ip_fragment doesn't copy the MAC header */
-	if (nf_bridge_maybe_copy_header(skb) ||
-	    !is_skb_forwardable(skb->dev, skb)) {
+	if (!is_skb_forwardable(skb->dev, skb)) {
 		kfree_skb(skb);
 	} else {
 		skb_push(skb, ETH_HLEN);
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 0ee453f..e547911 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -764,6 +764,33 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops,
 }
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
+static bool nf_bridge_copy_header(struct sk_buff *skb)
+{
+	int err;
+	unsigned int header_size;
+
+	nf_bridge_update_protocol(skb);
+	header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
+	err = skb_cow_head(skb, header_size);
+	if (err)
+		return false;
+
+	skb_copy_to_linear_data_offset(skb, -header_size,
+				       skb->nf_bridge->data, header_size);
+	__skb_push(skb, nf_bridge_encap_header_len(skb));
+	return true;
+}
+
+static int br_nf_push_frag_xmit(struct sk_buff *skb)
+{
+	if (!nf_bridge_copy_header(skb)) {
+		kfree_skb(skb);
+		return 0;
+	}
+
+	return br_dev_queue_push_xmit(skb);
+}
+
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
 	int ret;
@@ -780,7 +807,7 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 			/* Drop invalid packet */
 			return NF_DROP;
 		IPCB(skb)->frag_max_size = frag_max_size;
-		ret = ip_fragment(skb, br_dev_queue_push_xmit);
+		ret = ip_fragment(skb, br_nf_push_frag_xmit);
 	} else
 		ret = br_dev_queue_push_xmit(skb);
 
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 2/8] netfilter: bridge: move nf_bridge_update_protocol to where its used
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 1/8] bridge: move mac header copying into br_netfilter Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 3/8] netfilter: brige: move DNAT helper " Florian Westphal
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

no need to keep it in a header file.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 8 --------
 net/bridge/br_netfilter.c        | 8 ++++++++
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 332ef8a..dd580a9 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -36,14 +36,6 @@ static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
 	}
 }
 
-static inline void nf_bridge_update_protocol(struct sk_buff *skb)
-{
-	if (skb->nf_bridge->mask & BRNF_8021Q)
-		skb->protocol = htons(ETH_P_8021Q);
-	else if (skb->nf_bridge->mask & BRNF_PPPoE)
-		skb->protocol = htons(ETH_P_PPP_SES);
-}
-
 static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 {
 	if (unlikely(skb->nf_bridge->mask & BRNF_PPPoE))
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index e547911..5b3bceb 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -239,6 +239,14 @@ drop:
 	return -1;
 }
 
+static void nf_bridge_update_protocol(struct sk_buff *skb)
+{
+	if (skb->nf_bridge->mask & BRNF_8021Q)
+		skb->protocol = htons(ETH_P_8021Q);
+	else if (skb->nf_bridge->mask & BRNF_PPPoE)
+		skb->protocol = htons(ETH_P_PPP_SES);
+}
+
 /* PF_BRIDGE/PRE_ROUTING *********************************************/
 /* Undo the changes made for ip6tables PREROUTING and continue the
  * bridge PRE_ROUTING hook. */
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 3/8] netfilter: brige: move DNAT helper to where its used
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 1/8] bridge: move mac header copying into br_netfilter Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 2/8] netfilter: bridge: move nf_bridge_update_protocol to where its used Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 4/8] netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit Florian Westphal
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

Only one caller, there is no need to keep this in a header.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 12 ------------
 net/bridge/br_device.c           | 24 ++++++++++++++++++++++++
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index dd580a9..bb39113 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -44,18 +44,6 @@ static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 }
 
 int br_handle_frame_finish(struct sk_buff *skb);
-/* Only used in br_device.c */
-static inline int br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
-{
-	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-
-	skb_pull(skb, ETH_HLEN);
-	nf_bridge->mask ^= BRNF_BRIDGED_DNAT;
-	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
-				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
-	skb->dev = nf_bridge->physindev;
-	return br_handle_frame_finish(skb);
-}
 
 /* This is called by the IP fragmenting code and it ensures there is
  * enough room for the encapsulating header (if there is one). */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index ffd379d..7e39b8d 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -25,6 +25,30 @@
 #define COMMON_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA | \
 			 NETIF_F_GSO_MASK | NETIF_F_HW_CSUM)
 
+#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+/* This is called when br_netfilter has called into iptables/netfilter,
+ * and DNAT has taken place on a bridge-forwarded packet.
+ *
+ * neigh->output has created a new MAC header, with local br0 MAC
+ * as saddr.
+ *
+ * This restores the original MAC saddr of the bridged packet
+ * before invoking bridge forward logic to transmit the packet.
+ */
+static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
+{
+	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+
+	skb_pull(skb, ETH_HLEN);
+	nf_bridge->mask &= ~BRNF_BRIDGED_DNAT;
+
+	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
+				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
+	skb->dev = nf_bridge->physindev;
+	br_handle_frame_finish(skb);
+}
+#endif
+
 /* net device transmit always called with BH disabled */
 netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 4/8] netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (2 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 3/8] netfilter: brige: move DNAT helper " Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 5/8] net: untangle ip_fragment and bridge netfilter Florian Westphal
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

simpilifies followup patch that re-works brnf ip_fragment handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/bridge/br_netfilter.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 5b3bceb..ef1fe28 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -803,13 +803,16 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
 	int ret;
 	int frag_max_size;
+	unsigned int mtu_reserved;
 
+	if (skb_is_gso(skb) || skb->protocol != htons(ETH_P_IP))
+		return br_dev_queue_push_xmit(skb);
+
+	mtu_reserved = nf_bridge_mtu_reduction(skb);
 	/* This is wrong! We should preserve the original fragment
 	 * boundaries by preserving frag_list rather than refragmenting.
 	 */
-	if (skb->protocol == htons(ETH_P_IP) &&
-	    skb->len + nf_bridge_mtu_reduction(skb) > skb->dev->mtu &&
-	    !skb_is_gso(skb)) {
+	if (skb->len + mtu_reserved > skb->dev->mtu) {
 		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
 		if (br_parse_ip_options(skb))
 			/* Drop invalid packet */
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 5/8] net: untangle ip_fragment and bridge netfilter
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (3 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 4/8] netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 6/8] netfilter: bridge: query conntrack about skb dnat Florian Westphal
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

Long time ago it was possible for the netfilter ip_conntrack
core to call ip_fragment in POST_ROUTING hook.

This is no longer the case, so the only case where bridge netfilter
ends up calling ip_fragment is the direct call site in br_netfilter.c.

Add ll and mtu arguments for ip_fragment and then get rid of the bridge
netfilter specific helpers from ip_fragment.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 17 -----------------
 include/net/ip.h                 |  4 ++--
 net/bridge/br_netfilter.c        | 12 +++++++++++-
 net/ipv4/ip_output.c             | 28 ++++++++++++++++------------
 4 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index bb39113..3b5e539 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -36,24 +36,8 @@ static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
 	}
 }
 
-static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
-{
-	if (unlikely(skb->nf_bridge->mask & BRNF_PPPoE))
-		return PPPOE_SES_HLEN;
-	return 0;
-}
-
 int br_handle_frame_finish(struct sk_buff *skb);
 
-/* This is called by the IP fragmenting code and it ensures there is
- * enough room for the encapsulating header (if there is one). */
-static inline unsigned int nf_bridge_pad(const struct sk_buff *skb)
-{
-	if (skb->nf_bridge)
-		return nf_bridge_encap_header_len(skb);
-	return 0;
-}
-
 struct bridge_skb_cb {
 	union {
 		__be32 ipv4;
@@ -69,7 +53,6 @@ static inline void br_drop_fake_rtable(struct sk_buff *skb)
 }
 
 #else
-#define nf_bridge_pad(skb)			(0)
 #define br_drop_fake_rtable(skb)	        do { } while (0)
 #endif /* CONFIG_BRIDGE_NETFILTER */
 
diff --git a/include/net/ip.h b/include/net/ip.h
index 025c61c..9c34441 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -108,8 +108,8 @@ int ip_local_deliver(struct sk_buff *skb);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct sock *sk, struct sk_buff *skb);
 int ip_mc_output(struct sock *sk, struct sk_buff *skb);
-int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *));
-int ip_do_nat(struct sk_buff *skb);
+int ip_fragment(struct sk_buff *skb, unsigned int mtu_reserved,
+		unsigned int ll_reserved, int (*output)(struct sk_buff *));
 void ip_send_check(struct iphdr *ip);
 int __ip_local_out(struct sk_buff *skb);
 int ip_local_out_sk(struct sock *sk, struct sk_buff *skb);
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index ef1fe28..cf4e93f 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -799,6 +799,13 @@ static int br_nf_push_frag_xmit(struct sk_buff *skb)
 	return br_dev_queue_push_xmit(skb);
 }
 
+static unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
+{
+	if (skb->nf_bridge->mask & BRNF_PPPoE)
+		return PPPOE_SES_HLEN;
+         return 0;
+}
+
 static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 {
 	int ret;
@@ -818,7 +825,10 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 			/* Drop invalid packet */
 			return NF_DROP;
 		IPCB(skb)->frag_max_size = frag_max_size;
-		ret = ip_fragment(skb, br_nf_push_frag_xmit);
+
+		ret = ip_fragment(skb, mtu_reserved,
+				  nf_bridge_encap_header_len(skb),
+				  br_nf_push_frag_xmit);
 	} else
 		ret = br_dev_queue_push_xmit(skb);
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a7aea20..1b284eb 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -243,7 +243,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 		int err;
 
 		segs->next = NULL;
-		err = ip_fragment(segs, ip_finish_output2);
+		err = ip_fragment(segs, 0, 0, ip_finish_output2);
 
 		if (err && ret == 0)
 			ret = err;
@@ -266,7 +266,7 @@ static int ip_finish_output(struct sk_buff *skb)
 		return ip_finish_output_gso(skb);
 
 	if (skb->len > ip_skb_dst_mtu(skb))
-		return ip_fragment(skb, ip_finish_output2);
+		return ip_fragment(skb, 0, 0, ip_finish_output2);
 
 	return ip_finish_output2(skb);
 }
@@ -472,20 +472,28 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
 	skb_copy_secmark(to, from);
 }
 
-/*
+/**
+ *	ip_fragment - fragment IP datagram or send ICMP error
+ *
+ *	@skb: the skb to fragment
+ *	@mtu_reserved: extra MTU space required (used by bridge netfilter)
+ *	@ll_rs: extra linklayer space required (used by bridge netfilter)
+ *	@output: transmit function used to send fragments
+ *
  *	This IP datagram is too large to be sent in one piece.  Break it up into
  *	smaller pieces (each of size equal to IP header plus
  *	a block of the data of the original IP data part) that will yet fit in a
  *	single device frame, and queue such a frame for sending.
  */
-
-int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
+int ip_fragment(struct sk_buff *skb,
+		unsigned int mtu_reserved, unsigned int ll_rs,
+		int (*output)(struct sk_buff *))
 {
 	struct iphdr *iph;
 	int ptr;
 	struct net_device *dev;
 	struct sk_buff *skb2;
-	unsigned int mtu, hlen, left, len, ll_rs;
+	unsigned int mtu, hlen, left, len;
 	int offset;
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
@@ -515,11 +523,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	 */
 
 	hlen = iph->ihl * 4;
-	mtu = mtu - hlen;	/* Size of data space */
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge)
-		mtu -= nf_bridge_mtu_reduction(skb);
-#endif
+	mtu = mtu - hlen - mtu_reserved;	/* Size of data space */
 	IPCB(skb)->flags |= IPSKB_FRAG_COMPLETE;
 
 	/* When frag_list is given, use it. First, check its validity:
@@ -639,7 +643,7 @@ slow_path:
 	/* for bridged IP traffic encapsulated inside f.e. a vlan header,
 	 * we need to make room for the encapsulating header
 	 */
-	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, nf_bridge_pad(skb));
+	ll_rs = LL_RESERVED_SPACE_EXTRA(rt->dst.dev, ll_rs);
 
 	/*
 	 *	Fragment the datagram.
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 6/8] netfilter: bridge: query conntrack about skb dnat
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (4 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 5/8] net: untangle ip_fragment and bridge netfilter Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 7/8] netfilter: bridge: don't use nf_bridge_info data to store mac header Florian Westphal
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

ask conntrack instead of remembering ipv4 address in nf_bridge_info->data.

Ths avoids the need to use ->data during NF_PRE_ROUTING.
Only two use cases that need ->data remain.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h |  6 ------
 net/bridge/br_netfilter.c        | 20 ++++++++++++++------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index 3b5e539..ab06213 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -38,12 +38,6 @@ static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
 
 int br_handle_frame_finish(struct sk_buff *skb);
 
-struct bridge_skb_cb {
-	union {
-		__be32 ipv4;
-	} daddr;
-};
-
 static inline void br_drop_fake_rtable(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index cf4e93f..6ff7ed5 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -36,6 +36,7 @@
 #include <net/ipv6.h>
 #include <net/route.h>
 #include <net/netfilter/br_netfilter.h>
+#include <net/netfilter/nf_conntrack.h>
 
 #include <asm/uaccess.h>
 #include "br_private.h"
@@ -43,11 +44,6 @@
 #include <linux/sysctl.h>
 #endif
 
-#define skb_origaddr(skb)	 (((struct bridge_skb_cb *) \
-				 (skb->nf_bridge->data))->daddr.ipv4)
-#define store_orig_dstaddr(skb)	 (skb_origaddr(skb) = ip_hdr(skb)->daddr)
-#define dnat_took_place(skb)	 (skb_origaddr(skb) != ip_hdr(skb)->daddr)
-
 #ifdef CONFIG_SYSCTL
 static struct ctl_table_header *brnf_sysctl_header;
 static int brnf_call_iptables __read_mostly = 1;
@@ -322,6 +318,18 @@ free_skb:
 	return 0;
 }
 
+static bool dnat_took_place(const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct || nf_ct_is_untracked(ct))
+		return false;
+
+	return test_bit(IPS_DST_NAT_BIT, &ct->status);
+}
+
 /* This requires some explaining. If DNAT has taken place,
  * we will need to fix up the destination Ethernet address.
  *
@@ -625,7 +633,7 @@ static unsigned int br_nf_pre_routing(const struct nf_hook_ops *ops,
 		return NF_DROP;
 	if (!setup_pre_routing(skb))
 		return NF_DROP;
-	store_orig_dstaddr(skb);
+
 	skb->protocol = htons(ETH_P_IP);
 
 	NF_HOOK(NFPROTO_IPV4, NF_INET_PRE_ROUTING, skb, skb->dev, NULL,
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 7/8] netfilter: bridge: don't use nf_bridge_info data to store mac header
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (5 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 6/8] netfilter: bridge: query conntrack about skb dnat Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-04 23:52 ` [PATCH nf-next 8/8] netfilter: bridge: rename nf_bridge_info->data to dnat_orig_mac Florian Westphal
  2015-03-09 13:02 ` [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Pablo Neira Ayuso
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal, Andy Zhou

Currently br_netfilter maintains an extra state, nf_bridge_info,
which is attached to skb via skb->nf_bridge pointer.

For every packet handed to POST_ROUTING ipv4/ipv6 netfilter we save
original mac header in nf_bridge_info->data space.

However, there appears to be no technical reason anymore.

In ancient times, netfilter had an ip_refrag() hook, invoked before
NF_POST_ROUTING.  It no longer exists, ip(6) netfilter hooks should not
be mangling the layer 2 headers.

Remove this unconditional saving of mac header and only do this when needed --
when br_netfilter has to fragment skb that was previously defragmented by
nf_defrag.  ip_fragment doesn't copy the mac header from the
to-be-fragmented skb.

Save a copy on the stack and extend ip_fragment to pass that to the output
function.

The ip_fragment changes are based on an earlier version from Andy Zhou.

Cc: Andy Zhou <azhou@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/netfilter_bridge.h | 12 ----------
 include/net/ip.h                 |  4 +++-
 net/bridge/br_netfilter.c        | 48 ++++++++++++++++++++++++++--------------
 net/ipv4/ip_output.c             | 19 +++++++++-------
 4 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index ab06213..20089bb 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -24,18 +24,6 @@ enum nf_br_hook_priorities {
 #define BRNF_8021Q			0x10
 #define BRNF_PPPoE			0x20
 
-static inline unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
-{
-	switch (skb->protocol) {
-	case __cpu_to_be16(ETH_P_8021Q):
-		return VLAN_HLEN;
-	case __cpu_to_be16(ETH_P_PPP_SES):
-		return PPPOE_SES_HLEN;
-	default:
-		return 0;
-	}
-}
-
 int br_handle_frame_finish(struct sk_buff *skb);
 
 static inline void br_drop_fake_rtable(struct sk_buff *skb)
diff --git a/include/net/ip.h b/include/net/ip.h
index 9c34441..4cf6bd1 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -109,7 +109,9 @@ int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct sock *sk, struct sk_buff *skb);
 int ip_mc_output(struct sock *sk, struct sk_buff *skb);
 int ip_fragment(struct sk_buff *skb, unsigned int mtu_reserved,
-		unsigned int ll_reserved, int (*output)(struct sk_buff *));
+		unsigned int ll_reserved,
+		int (*output)(struct sk_buff *, const void *output_arg),
+		const void *output_arg);
 void ip_send_check(struct iphdr *ip);
 int __ip_local_out(struct sk_buff *skb);
 int ip_local_out_sk(struct sock *sk, struct sk_buff *skb);
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 6ff7ed5..88e7656 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -150,6 +150,22 @@ static inline struct nf_bridge_info *nf_bridge_unshare(struct sk_buff *skb)
 	return nf_bridge;
 }
 
+#define NF_BRDIGE_MAX_MAC_HEADER_LENGTH (PPPOE_SES_HLEN + ETH_HLEN)
+
+static unsigned int nf_bridge_encap_header_len(const struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case __cpu_to_be16(ETH_P_8021Q):
+		return VLAN_HLEN;
+	case __cpu_to_be16(ETH_P_PPP_SES):
+		return PPPOE_SES_HLEN;
+	default:
+		break;
+	}
+	return 0;
+}
+
+
 static inline void nf_bridge_push_encap_header(struct sk_buff *skb)
 {
 	unsigned int len = nf_bridge_encap_header_len(skb);
@@ -174,14 +190,6 @@ static inline void nf_bridge_pull_encap_header_rcsum(struct sk_buff *skb)
 	skb->network_header += len;
 }
 
-static inline void nf_bridge_save_header(struct sk_buff *skb)
-{
-	int header_size = ETH_HLEN + nf_bridge_encap_header_len(skb);
-
-	skb_copy_from_linear_data_offset(skb, -header_size,
-					 skb->nf_bridge->data, header_size);
-}
-
 /* When handing a packet over to the IP layer
  * check whether we have a skb that is in the
  * expected format
@@ -780,7 +788,7 @@ static unsigned int br_nf_forward_arp(const struct nf_hook_ops *ops,
 }
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV4)
-static bool nf_bridge_copy_header(struct sk_buff *skb)
+static bool nf_bridge_copy_header(struct sk_buff *skb, const char *machdr)
 {
 	int err;
 	unsigned int header_size;
@@ -791,15 +799,14 @@ static bool nf_bridge_copy_header(struct sk_buff *skb)
 	if (err)
 		return false;
 
-	skb_copy_to_linear_data_offset(skb, -header_size,
-				       skb->nf_bridge->data, header_size);
+	skb_copy_to_linear_data_offset(skb, -header_size, machdr, header_size);
 	__skb_push(skb, nf_bridge_encap_header_len(skb));
 	return true;
 }
 
-static int br_nf_push_frag_xmit(struct sk_buff *skb)
+static int br_nf_push_frag_xmit(struct sk_buff *skb, const void *data)
 {
-	if (!nf_bridge_copy_header(skb)) {
+	if (!nf_bridge_copy_header(skb, data)) {
 		kfree_skb(skb);
 		return 0;
 	}
@@ -828,15 +835,23 @@ static int br_nf_dev_queue_xmit(struct sk_buff *skb)
 	 * boundaries by preserving frag_list rather than refragmenting.
 	 */
 	if (skb->len + mtu_reserved > skb->dev->mtu) {
+		char brnf_mac_header[NF_BRDIGE_MAX_MAC_HEADER_LENGTH];
+		int headerlen, encaplen;
+
 		frag_max_size = BR_INPUT_SKB_CB(skb)->frag_max_size;
 		if (br_parse_ip_options(skb))
 			/* Drop invalid packet */
 			return NF_DROP;
 		IPCB(skb)->frag_max_size = frag_max_size;
 
-		ret = ip_fragment(skb, mtu_reserved,
-				  nf_bridge_encap_header_len(skb),
-				  br_nf_push_frag_xmit);
+		encaplen = nf_bridge_encap_header_len(skb);
+		headerlen = ETH_HLEN + encaplen;
+
+		skb_copy_from_linear_data_offset(skb, -headerlen,
+						 brnf_mac_header, headerlen);
+
+		ret = ip_fragment(skb, mtu_reserved, encaplen,
+				  br_nf_push_frag_xmit, brnf_mac_header);
 	} else
 		ret = br_dev_queue_push_xmit(skb);
 
@@ -881,7 +896,6 @@ static unsigned int br_nf_post_routing(const struct nf_hook_ops *ops,
 	}
 
 	nf_bridge_pull_encap_header(skb);
-	nf_bridge_save_header(skb);
 	if (pf == NFPROTO_IPV4)
 		skb->protocol = htons(ETH_P_IP);
 	else
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 1b284eb..2d0cf84 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -163,7 +163,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, struct sock *sk,
 }
 EXPORT_SYMBOL_GPL(ip_build_and_send_pkt);
 
-static inline int ip_finish_output2(struct sk_buff *skb)
+static int ip_finish_output2(struct sk_buff *skb,
+			     const void *unused __always_unused)
 {
 	struct dst_entry *dst = skb_dst(skb);
 	struct rtable *rt = (struct rtable *)dst;
@@ -220,7 +221,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 	/* common case: locally created skb or seglen is <= mtu */
 	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
 	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
-		return ip_finish_output2(skb);
+		return ip_finish_output2(skb, NULL);
 
 	/* Slowpath -  GSO segment length is exceeding the dst MTU.
 	 *
@@ -243,7 +244,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 		int err;
 
 		segs->next = NULL;
-		err = ip_fragment(segs, 0, 0, ip_finish_output2);
+		err = ip_fragment(segs, 0, 0, ip_finish_output2, NULL);
 
 		if (err && ret == 0)
 			ret = err;
@@ -266,9 +267,9 @@ static int ip_finish_output(struct sk_buff *skb)
 		return ip_finish_output_gso(skb);
 
 	if (skb->len > ip_skb_dst_mtu(skb))
-		return ip_fragment(skb, 0, 0, ip_finish_output2);
+		return ip_fragment(skb, 0, 0, ip_finish_output2, NULL);
 
-	return ip_finish_output2(skb);
+	return ip_finish_output2(skb, NULL);
 }
 
 int ip_mc_output(struct sock *sk, struct sk_buff *skb)
@@ -479,6 +480,7 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
  *	@mtu_reserved: extra MTU space required (used by bridge netfilter)
  *	@ll_rs: extra linklayer space required (used by bridge netfilter)
  *	@output: transmit function used to send fragments
+ *	@output_arg: pointer passed to transmit function as argument
  *
  *	This IP datagram is too large to be sent in one piece.  Break it up into
  *	smaller pieces (each of size equal to IP header plus
@@ -487,7 +489,8 @@ static void ip_copy_metadata(struct sk_buff *to, struct sk_buff *from)
  */
 int ip_fragment(struct sk_buff *skb,
 		unsigned int mtu_reserved, unsigned int ll_rs,
-		int (*output)(struct sk_buff *))
+		int (*output)(struct sk_buff *, const void *output_arg),
+		const void *output_arg)
 {
 	struct iphdr *iph;
 	int ptr;
@@ -596,7 +599,7 @@ int ip_fragment(struct sk_buff *skb,
 				ip_send_check(iph);
 			}
 
-			err = output(skb);
+			err = output(skb, output_arg);
 
 			if (!err)
 				IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGCREATES);
@@ -736,7 +739,7 @@ slow_path:
 
 		ip_send_check(iph);
 
-		err = output(skb2);
+		err = output(skb2, output_arg);
 		if (err)
 			goto fail;
 
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH nf-next 8/8] netfilter: bridge: rename nf_bridge_info->data to dnat_orig_mac
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (6 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 7/8] netfilter: bridge: don't use nf_bridge_info data to store mac header Florian Westphal
@ 2015-03-04 23:52 ` Florian Westphal
  2015-03-09 13:02 ` [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Pablo Neira Ayuso
  8 siblings, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-04 23:52 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, Florian Westphal

This is no generic scratch space anymore, we removed all users
so far except the rare case of DNAT-ON-BRIDE.

That one is not trivial to remove, neigh->output queues skb
and eventually invokes dev_queue_xmit and we only get the packet
back into the bridge via bride_devices' ndo_start_xmit function.

The packet will re-enter bridge stack via br_dev_xmit(), and we pretend
it entered bridge forward logic (we inject it back into stack right after
br_netfilter invoked iptables PREROUTING).

This also means we don't need to care about fragmentation here --
it will be fragmented later when br_netfilter has finished with
POST_ROUTING traversal.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h    | 2 +-
 net/bridge/br_device.c    | 3 ++-
 net/bridge/br_netfilter.c | 5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bba1330..b7d15da 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -169,7 +169,7 @@ struct nf_bridge_info {
 	unsigned int		mask;
 	struct net_device	*physindev;
 	struct net_device	*physoutdev;
-	unsigned long		data[32 / sizeof(unsigned long)];
+	char dnat_orig_mac[8]; /* mac src + type */
 };
 #endif
 
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 7e39b8d..d9768e0 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -43,7 +43,8 @@ static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
 	nf_bridge->mask &= ~BRNF_BRIDGED_DNAT;
 
 	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
-				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
+				       skb->nf_bridge->dnat_orig_mac,
+				       ETH_HLEN-ETH_ALEN);
 	skb->dev = nf_bridge->physindev;
 	br_handle_frame_finish(skb);
 }
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index 88e7656..825c5d7 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -305,17 +305,18 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff *skb)
 			skb->dev = nf_bridge->physindev;
 			ret = br_handle_frame_finish(skb);
 		} else {
+			BUILD_BUG_ON(sizeof(nf_bridge->dnat_orig_mac) != (ETH_HLEN-ETH_ALEN));
+
 			/* the neighbour function below overwrites the complete
 			 * MAC header, so we save the Ethernet source address and
 			 * protocol number.
 			 */
 			skb_copy_from_linear_data_offset(skb,
 							 -(ETH_HLEN-ETH_ALEN),
-							 skb->nf_bridge->data,
+							 nf_bridge->dnat_orig_mac,
 							 ETH_HLEN-ETH_ALEN);
 			/* tell br_dev_xmit to continue with forwarding */
 			nf_bridge->mask |= BRNF_BRIDGED_DNAT;
-			/* FIXME Need to refragment */
 			ret = neigh->output(neigh, skb);
 		}
 		neigh_release(neigh);
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
                   ` (7 preceding siblings ...)
  2015-03-04 23:52 ` [PATCH nf-next 8/8] netfilter: bridge: rename nf_bridge_info->data to dnat_orig_mac Florian Westphal
@ 2015-03-09 13:02 ` Pablo Neira Ayuso
  2015-03-09 13:13   ` Florian Westphal
  2015-03-09 13:59   ` Florian Westphal
  8 siblings, 2 replies; 22+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-09 13:02 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]

Hi Florian,

On Thu, Mar 05, 2015 at 12:52:32AM +0100, Florian Westphal wrote:
> bridge_netfilter.h contains various helpers, some only used by br_netfilter,
> others however are also called in bridge or even ip stack.
> 
> Lets start untangling bridge, bridge netfilter, and the
> rest of the ip stack (esp. ip_fragment).
> 
> This changes ip_fragment() so that bridge netfilter
> can pass in the required information as arguments instead
> of using skb->nf_bridge to pass some extra information to it.
> 
> Another problem with br_netfilter and the way its plumbed to
> ip/ip6-tables (physdev match) is skb->nf_bridge.
> 
> nf_bridge is kmalloced blob with some extra information, including
> the bridge in and outports (mainly for iptables' physdev match).
> It also has various state bits so we know what manipulations
> have been performed by bridge netfilter on the skb (e.g.
> ppp header stripping).
>
> nf_bridge also provides scratch space where br_netfilter saves
> (and later restores) various things, e.g. ipv4 address for
> dnat detection, mac address to fix up ip fragmented skbs, etc.
> 
> But in almost all cases we can avoid using ->data completely.

I think one of the goals of this patchset is to prepare the removal of
that nf_bridge pointer from sk_buff which sounds as a good idea to me.

Did you consider to implement this scratchpad area using a per-cpu
area? I mean, something similar to what we used to have in
ip_conntrack for events:

http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/netfilter/nf_conntrack_ecache.c?v=2.6.25#L65

Following that approach, I think we could avoid changing the
ip_fragment() interface. My concern is that these changes are too
specific of br_netfilter, which is not a good reference client of it
given all its design problems.

I'm preparing a new net-next batch, I can quickly take these three if
you agree:

1/8 bridge: move mac header copying into br_netfilter
2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit

Regarding 3/8, I think we can move that code to br_netfilter.c so we
reduce ifdef pollution a bit and keep as much of this monster code
fenced under that file. Please, see patch attached.

BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
placed. 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
a small typo in it, well these are dependent of 4/8 anyway.

Let me know, thanks a lot!

[-- Attachment #2: 0001-netfilter-brige-move-DNAT-helper-to-br_netfilter.patch --]
[-- Type: text/x-diff, Size: 4237 bytes --]

>From 57a8e6c3ac46601615f0df0ccc98e6481c8017a9 Mon Sep 17 00:00:00 2001
From: Florian Westphal <fw@strlen.de>
Date: Thu, 5 Mar 2015 00:52:35 +0100
Subject: [PATCH] netfilter: brige: move DNAT helper to br_netfilter

Only one caller, there is no need to keep this in a header.
Move it to br_netfilter.c where this belongs to.

Based on patch from Florian Westphal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter_bridge.h |   12 ------------
 net/bridge/br_device.c           |    5 +----
 net/bridge/br_netfilter.c        |   32 ++++++++++++++++++++++++++++++++
 net/bridge/br_private.h          |    5 +++++
 4 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/include/linux/netfilter_bridge.h b/include/linux/netfilter_bridge.h
index dd580a9..bb39113 100644
--- a/include/linux/netfilter_bridge.h
+++ b/include/linux/netfilter_bridge.h
@@ -44,18 +44,6 @@ static inline unsigned int nf_bridge_mtu_reduction(const struct sk_buff *skb)
 }
 
 int br_handle_frame_finish(struct sk_buff *skb);
-/* Only used in br_device.c */
-static inline int br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
-{
-	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
-
-	skb_pull(skb, ETH_HLEN);
-	nf_bridge->mask ^= BRNF_BRIDGED_DNAT;
-	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
-				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
-	skb->dev = nf_bridge->physindev;
-	return br_handle_frame_finish(skb);
-}
 
 /* This is called by the IP fragmenting code and it ensures there is
  * enough room for the encapsulating header (if there is one). */
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index ffd379d..294cbcc 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -36,13 +36,10 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 	u16 vid = 0;
 
 	rcu_read_lock();
-#if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
-	if (skb->nf_bridge && (skb->nf_bridge->mask & BRNF_BRIDGED_DNAT)) {
-		br_nf_pre_routing_finish_bridge_slow(skb);
+	if (br_nf_prerouting_finish_bridge(skb)) {
 		rcu_read_unlock();
 		return NETDEV_TX_OK;
 	}
-#endif
 
 	u64_stats_update_begin(&brstats->syncp);
 	brstats->tx_packets++;
diff --git a/net/bridge/br_netfilter.c b/net/bridge/br_netfilter.c
index ef1fe28..a8361c7 100644
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -892,6 +892,38 @@ static unsigned int ip_sabotage_in(const struct nf_hook_ops *ops,
 	return NF_ACCEPT;
 }
 
+/* This is called when br_netfilter has called into iptables/netfilter,
+ * and DNAT has taken place on a bridge-forwarded packet.
+ *
+ * neigh->output has created a new MAC header, with local br0 MAC
+ * as saddr.
+ *
+ * This restores the original MAC saddr of the bridged packet
+ * before invoking bridge forward logic to transmit the packet.
+ */
+static void br_nf_pre_routing_finish_bridge_slow(struct sk_buff *skb)
+{
+	struct nf_bridge_info *nf_bridge = skb->nf_bridge;
+
+	skb_pull(skb, ETH_HLEN);
+	nf_bridge->mask &= ~BRNF_BRIDGED_DNAT;
+
+	skb_copy_to_linear_data_offset(skb, -(ETH_HLEN-ETH_ALEN),
+				       skb->nf_bridge->data, ETH_HLEN-ETH_ALEN);
+	skb->dev = nf_bridge->physindev;
+	br_handle_frame_finish(skb);
+}
+
+int br_nf_prerouting_finish_bridge(struct sk_buff *skb)
+{
+	if (skb->nf_bridge && (skb->nf_bridge->mask & BRNF_BRIDGED_DNAT)) {
+		br_nf_pre_routing_finish_bridge_slow(skb);
+		return 1;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(br_nf_prerouting_finish_bridge);
+
 void br_netfilter_enable(void)
 {
 }
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index de09199..d63fc17 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -764,10 +764,15 @@ static inline int br_vlan_enabled(struct net_bridge *br)
 
 /* br_netfilter.c */
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
+int br_nf_prerouting_finish_bridge(struct sk_buff *skb);
 int br_nf_core_init(void);
 void br_nf_core_fini(void);
 void br_netfilter_rtable_init(struct net_bridge *);
 #else
+static inline int br_nf_prerouting_finish_bridge(struct sk_buff *skb)
+{
+        return 0;
+}
 static inline int br_nf_core_init(void) { return 0; }
 static inline void br_nf_core_fini(void) {}
 #define br_netfilter_rtable_init(x)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 13:02 ` [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Pablo Neira Ayuso
@ 2015-03-09 13:13   ` Florian Westphal
  2015-03-09 16:47     ` Pablo Neira Ayuso
  2015-03-09 17:16     ` David Miller
  2015-03-09 13:59   ` Florian Westphal
  1 sibling, 2 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-09 13:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, netdev, azhou

Pablo Neira Ayuso <pablo@netfilter.org> wrote:

[ CC Andy, see below for ip_fragment api discussion ]

> > Lets start untangling bridge, bridge netfilter, and the
> > rest of the ip stack (esp. ip_fragment).
> > 
> > This changes ip_fragment() so that bridge netfilter
> > can pass in the required information as arguments instead
> > of using skb->nf_bridge to pass some extra information to it.
> > 
> > Another problem with br_netfilter and the way its plumbed to
> > ip/ip6-tables (physdev match) is skb->nf_bridge.
> > 
> > nf_bridge is kmalloced blob with some extra information, including
> > the bridge in and outports (mainly for iptables' physdev match).
> > It also has various state bits so we know what manipulations
> > have been performed by bridge netfilter on the skb (e.g.
> > ppp header stripping).
> >
> > nf_bridge also provides scratch space where br_netfilter saves
> > (and later restores) various things, e.g. ipv4 address for
> > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > 
> > But in almost all cases we can avoid using ->data completely.
> 
> I think one of the goals of this patchset is to prepare the removal of
> that nf_bridge pointer from sk_buff which sounds as a good idea to me.

The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
at the end (where its not initialized at allocation time).

The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
so we'd increase skbuff size only by 8.

> Did you consider to implement this scratchpad area using a per-cpu
> area? I mean, something similar to what we used to have in
> ip_conntrack for events:
> 
> http://www.cs.fsu.edu/~baker/devices/lxr/http/source/linux/net/netfilter/nf_conntrack_ecache.c?v=2.6.25#L65
> 
> Following that approach, I think we could avoid changing the
> ip_fragment() interface.

Yes, I had a patch that used a percpu area.  BUT, please see conflicting
patch:

http://patchwork.ozlabs.org/patch/447939/

So if OVS needs such an interface I thought it makes no sense to work
around current API using percpu data.

Andy, it looks like I could either wait for your patches to hit
net-next, or you could refactor your changes based on my ip_fragment
changes; AFAICS the use cases are pretty much the same so both
ovs and bridge netfilter could use one unified ip_fragment().

> My concern is that these changes are too
> specific of br_netfilter, which is not a good reference client of it
> given all its design problems.

I agree that change just for br_netfilter sake sucks but please also
keep in mind that right now we have 'extra signalling' into ip_frament
via skb->nf_bridge pointer which I dislike even more.

> I'm preparing a new net-next batch, I can quickly take these three if
> you agree:
> 
> 1/8 bridge: move mac header copying into br_netfilter
> 2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
> 4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit

Sure.  I can then submit the entire series (i.e. part1 without these 3
plus part2).  I have no further br netfilter changes after that.

> Regarding 3/8, I think we can move that code to br_netfilter.c so we
> reduce ifdef pollution a bit and keep as much of this monster code
> fenced under that file. Please, see patch attached.

Sure.

> BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
> placed. 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
> a small typo in it, well these are dependent of 4/8 anyway.

Seems you're right, I'll look at it, thanks Pablo!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 13:13   ` Florian Westphal
@ 2015-03-09 16:47     ` Pablo Neira Ayuso
  2015-03-09 17:16     ` David Miller
  1 sibling, 0 replies; 22+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-09 16:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev, azhou

On Mon, Mar 09, 2015 at 02:13:56PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> > I'm preparing a new net-next batch, I can quickly take these three if
> > you agree:
> > 
> > 1/8 bridge: move mac header copying into br_netfilter
> > 2/8 netfilter: bridge: move nf_bridge_update_protocol to where its used
> > 4/8 netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit
> 
> Sure.  I can then submit the entire series (i.e. part1 without these 3
> plus part2).  I have no further br netfilter changes after that.
> 
> > Regarding 3/8, I think we can move that code to br_netfilter.c so we
> > reduce ifdef pollution a bit and keep as much of this monster code
> > fenced under that file. Please, see patch attached.
> 
> Sure.

OK, I'll push these four small changes in the follow nf-next pull
request then, so you and Andry can sort out this. Thanks Florian.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 13:13   ` Florian Westphal
  2015-03-09 16:47     ` Pablo Neira Ayuso
@ 2015-03-09 17:16     ` David Miller
  2015-03-09 17:35       ` Florian Westphal
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2015-03-09 17:16 UTC (permalink / raw)
  To: fw; +Cc: pablo, netfilter-devel, netdev, azhou

From: Florian Westphal <fw@strlen.de>
Date: Mon, 9 Mar 2015 14:13:56 +0100

> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> 
> [ CC Andy, see below for ip_fragment api discussion ]
> 
>> > Lets start untangling bridge, bridge netfilter, and the
>> > rest of the ip stack (esp. ip_fragment).
>> > 
>> > This changes ip_fragment() so that bridge netfilter
>> > can pass in the required information as arguments instead
>> > of using skb->nf_bridge to pass some extra information to it.
>> > 
>> > Another problem with br_netfilter and the way its plumbed to
>> > ip/ip6-tables (physdev match) is skb->nf_bridge.
>> > 
>> > nf_bridge is kmalloced blob with some extra information, including
>> > the bridge in and outports (mainly for iptables' physdev match).
>> > It also has various state bits so we know what manipulations
>> > have been performed by bridge netfilter on the skb (e.g.
>> > ppp header stripping).
>> >
>> > nf_bridge also provides scratch space where br_netfilter saves
>> > (and later restores) various things, e.g. ipv4 address for
>> > dnat detection, mac address to fix up ip fragmented skbs, etc.
>> > 
>> > But in almost all cases we can avoid using ->data completely.
>> 
>> I think one of the goals of this patchset is to prepare the removal of
>> that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> 
> The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
> at the end (where its not initialized at allocation time).
> 
> The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
> so we'd increase skbuff size only by 8.

Sorry, increases in size of any kind of sk_buff are strongly discouraged.

Especially for something like nfbridge.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 17:16     ` David Miller
@ 2015-03-09 17:35       ` Florian Westphal
  2015-03-09 19:20         ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2015-03-09 17:35 UTC (permalink / raw)
  To: David Miller; +Cc: fw, pablo, netfilter-devel, netdev, azhou

David Miller <davem@redhat.com> wrote:
> > The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
> > at the end (where its not initialized at allocation time).
> > 
> > The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
> > so we'd increase skbuff size only by 8.
> 
> Sorry, increases in size of any kind of sk_buff are strongly discouraged.

No need to be sorry, its not a show-stopper.
Its the last patch in the series and nothing depends on this change.

> Especially for something like nfbridge.

Its not initialized, ever, if bridge is not used, so it shouldn't be a big
deal.  The goal of the entire exercise is to remove the need to zero
nf_bridge pointer at skb allocation time, so one of the patches in that
series moves it to the end of the skb.

The only reason why I added the 'fold it' patch on top is that after the
refactoring nf_bridge_info struct is just 16 bytes so it seemed
preferrable to me to add cost of 8 more bytes and get rid of the
nf_bridge refcounting at same time.

I'd just send it along with the series so everyone can look at it,
and, if you (or others) still prefer to keep the pointer then Pablo can
just drop it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 17:35       ` Florian Westphal
@ 2015-03-09 19:20         ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2015-03-09 19:20 UTC (permalink / raw)
  To: fw; +Cc: pablo, netfilter-devel, netdev, azhou

From: Florian Westphal <fw@strlen.de>
Date: Mon, 9 Mar 2015 18:35:15 +0100

> David Miller <davem@redhat.com> wrote:
>> > The 2nd (and last half) of the set folds nf_bridge_info into skbuff,
>> > at the end (where its not initialized at allocation time).
>> > 
>> > The struct is a lot smaller by then (16 bytes on 64bit, 12 on 32bit)
>> > so we'd increase skbuff size only by 8.
>> 
>> Sorry, increases in size of any kind of sk_buff are strongly discouraged.
> 
> No need to be sorry, its not a show-stopper.
> Its the last patch in the series and nothing depends on this change.
> 
>> Especially for something like nfbridge.
> 
> Its not initialized, ever, if bridge is not used, so it shouldn't be a big
> deal.

It's about the overall size of the structure, not whether the member is
ever used or not.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 13:02 ` [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Pablo Neira Ayuso
  2015-03-09 13:13   ` Florian Westphal
@ 2015-03-09 13:59   ` Florian Westphal
  2015-03-14  9:00     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2015-03-09 13:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, netdev

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > nf_bridge is kmalloced blob with some extra information, including
> > the bridge in and outports (mainly for iptables' physdev match).
> > It also has various state bits so we know what manipulations
> > have been performed by bridge netfilter on the skb (e.g.
> > ppp header stripping).
> >
> > nf_bridge also provides scratch space where br_netfilter saves
> > (and later restores) various things, e.g. ipv4 address for
> > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > 
> > But in almost all cases we can avoid using ->data completely.
> 
> I think one of the goals of this patchset is to prepare the removal of
> that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> 
> Did you consider to implement this scratchpad area using a per-cpu
> area? I mean, something similar to what we used to have in
> ip_conntrack for events:

I see that I misread part of what you wrote.

We cannot use percpu data for nf_bridge_info; at least its not trivial
to do (I tried).

Main issues are:
- nfqueue (we have to save/restore info)
- local delivery (skb is enqueued in backlog)
- skb is dropped (we need to reset scratch data)
- DNAT mac hack (skb can be queued in qdisc).

We _can_ use percpu data for passing the original mac
header to the ip_fragment output function.  But, as mentioned,
there is a patch in netdev patchwork that adds extra output
parameter for use with OVS so it seems cleaner to (re-) use common
api in both OVS and br netfilter.

> BTW, 6/8 I think needs some ifdef to make sure NF_CONNTRACK is in
> placed.

Right, thanks for catching this.

> 7/8 needs NF_BRDIGE_MAX_MAC_HEADER_LENGTH which seems to have
> a small typo in it, well these are dependent of 4/8 anyway.

Sorry -- I'm dense -- what typo? :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-09 13:59   ` Florian Westphal
@ 2015-03-14  9:00     ` Pablo Neira Ayuso
  2015-03-14 11:13       ` Florian Westphal
  0 siblings, 1 reply; 22+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-14  9:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev

On Mon, Mar 09, 2015 at 02:59:10PM +0100, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > nf_bridge is kmalloced blob with some extra information, including
> > > the bridge in and outports (mainly for iptables' physdev match).
> > > It also has various state bits so we know what manipulations
> > > have been performed by bridge netfilter on the skb (e.g.
> > > ppp header stripping).
> > >
> > > nf_bridge also provides scratch space where br_netfilter saves
> > > (and later restores) various things, e.g. ipv4 address for
> > > dnat detection, mac address to fix up ip fragmented skbs, etc.
> > > 
> > > But in almost all cases we can avoid using ->data completely.
> > 
> > I think one of the goals of this patchset is to prepare the removal of
> > that nf_bridge pointer from sk_buff which sounds as a good idea to me.
> > 
> > Did you consider to implement this scratchpad area using a per-cpu
> > area? I mean, something similar to what we used to have in
> > ip_conntrack for events:
> 
> I see that I misread part of what you wrote.
> 
> We cannot use percpu data for nf_bridge_info; at least its not trivial
> to do (I tried).
> 
> Main issues are:
> - nfqueue (we have to save/restore info)
> - local delivery (skb is enqueued in backlog)
> - skb is dropped (we need to reset scratch data)
> - DNAT mac hack (skb can be queued in qdisc).

What about something like this?

1) We keep a nf_bridge table that contains the nf_bridge structures
   that are currently in used, so you can look it up for them every
   time we need it. This can be implemented as a per-cpu list of
   nf_bridge structures.

2) We have another per-cpu cache to hold a pointer to the current
   nf_bridge.

3) We only need one bit from sk_buff to indicate that this sk_buff has
   nf_bridge info attached.

Then, we can use the sk_buff pointer as an unique way to identify
the owner of the nf_bridge structure.

   struct nf_bridge {
        struct list_head        head;
        const struct sk_buff    *owner;
        ...
   }

so if we need the scratchpad area, we first have to look up for it:

   struct nf_bridge *nf_bridge_find(struct sk_buff *skb)
   {
        ...

        /* First, check if we have it in the per_cpu cache. */
        nf_bridge = per_cpu_ptr(nf_bridge_pcpu);
        if (nf_bridge->owner == skb)
            return nf_bridge;

        /* Otherwise, slow path: find this scratchpad area in the list. */
        ...
        nf_bridge_list = per_cpu_ptr(nf_bridge_pcpu_list);
        list_for_each_entry(n, &nf_bridge_list, head) {
            if (nf_bridge->owner == skb) {
                /* Update the per_cpu cache. */
                rcu_assign_pointer(nf_bridge, n);
                return nf_bridge;
            }
        }

        return NULL;
   }

>From skb_release_head_state() we'll have to:

    if (skb->nf_bridge_present) {
         struct nf_bridge *nf_bridge = nf_bridge_find(skb);

         /* Remove it from the per-cpu nf_bridge cache. */
         list_del(&nf_bridge->head);
         kfree(nf_bridge);
    }

If nfqueue or other situation where we enqueue the packet, we'll enter
the slow path of nf_bridge_find(). This comes with some overhead in
that case, but one of the goal is to get rid of this pointer from
sk_buff which is unused for most people outthere as you already
mentioned on some patch.

I'm proposing this because I have concerns with the approach to place
nf_bridge in skb->cb. This br_netfilter thing makes the skb go back
and forth from different layers while expecting to find the right
right data in it, this seems fragile to me.

What do you think?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-14  9:00     ` Pablo Neira Ayuso
@ 2015-03-14 11:13       ` Florian Westphal
  2015-03-16 12:38         ` Pablo Neira Ayuso
  0 siblings, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2015-03-14 11:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, netdev

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > We cannot use percpu data for nf_bridge_info; at least its not trivial
> > to do (I tried).
> > 
> > Main issues are:
> > - nfqueue (we have to save/restore info)
> > - local delivery (skb is enqueued in backlog)
> > - skb is dropped (we need to reset scratch data)
> > - DNAT mac hack (skb can be queued in qdisc).

The DNAT mac hack is no longer an issue, can be handled via skb->cb.

> What about something like this?
> 
> 1) We keep a nf_bridge table that contains the nf_bridge structures
>    that are currently in used, so you can look it up for them every
>    time we need it. This can be implemented as a per-cpu list of
>    nf_bridge structures.

Not sure if it can be percpu.  In particular, when we pass frame up
skb can be enqueud in backlog, also we might encounter ingress scheduler
so I am not sure that skb cannot move to another cpu.

> 2) We have another per-cpu cache to hold a pointer to the current
>    nf_bridge.
> 
> 3) We only need one bit from sk_buff to indicate that this sk_buff has
>    nf_bridge info attached.

Yes, the "put it in skb->cb" also uses this approach, it puts 2bit
"state" information into skb and then removes the pointer.

> Then, we can use the sk_buff pointer as an unique way to identify
> the owner of the nf_bridge structure.
> 
>    struct nf_bridge {
>         struct list_head        head;
>         const struct sk_buff    *owner;
>         ...
>    }
> 
> so if we need the scratchpad area, we first have to look up for it:
> 
>    struct nf_bridge *nf_bridge_find(struct sk_buff *skb)
>    {
>         ...
> 
>         /* First, check if we have it in the per_cpu cache. */
>         nf_bridge = per_cpu_ptr(nf_bridge_pcpu);
>         if (nf_bridge->owner == skb)
>             return nf_bridge;
> 
>         /* Otherwise, slow path: find this scratchpad area in the list. */
>         ...
>         nf_bridge_list = per_cpu_ptr(nf_bridge_pcpu_list);
>         list_for_each_entry(n, &nf_bridge_list, head) {
>             if (nf_bridge->owner == skb) {
>                 /* Update the per_cpu cache. */
>                 rcu_assign_pointer(nf_bridge, n);
>                 return nf_bridge;
>             }
>         }
> 
>         return NULL;
>    }

Ok, I see.  This would work, but I'd prefer to just use the skb control
buffer.

> I'm proposing this because I have concerns with the approach to place
> nf_bridge in skb->cb. This br_netfilter thing makes the skb go back
> and forth from different layers while expecting to find the right
> right data in it, this seems fragile to me.

Mhhh, for the *usual* case of "skb is forwarded by bridge" skb->cb
should be safe since we only move skb between bridge and inet(6).

The only "problematic" case is where skb is DNAT'd, then things get
hairy (e.e.g dnat-to-host-on-other device).

> What do you think?

I think that currently it doesn't matter what solution we'll pick in the
end, I'd first like to send all my refactoring patches.

With those patches, it reduces the nf_bridge_info content that we need
to the physin and physout devices, plus a two-bit state in skb.

Then, if you still think that ->cb is too fragile I'd be happy to add
the lookup table method.  For the normal case of bridge forwarding, it
shouldn't be needed though, perhaps we could also use a hybrid approach,
cb-based fastpath for forwarding, lookup-table slowpath for dnat and
local delivery cases.

One other solution would be to use skb->cb but not store bridge device
pointers but the ifindexes instead (we currently don't hold any
device refcounts either so the current method isn't 100% safe either since
such device can be removed ...).

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-14 11:13       ` Florian Westphal
@ 2015-03-16 12:38         ` Pablo Neira Ayuso
  2015-03-16 13:01           ` Florian Westphal
  2015-03-16 13:41           ` Florian Westphal
  0 siblings, 2 replies; 22+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-16 12:38 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev

Hi Florian,

On Sat, Mar 14, 2015 at 12:13:25PM +0100, Florian Westphal wrote:
> I think that currently it doesn't matter what solution we'll pick in the
> end, I'd first like to send all my refactoring patches.
> 
> With those patches, it reduces the nf_bridge_info content that we need
> to the physin and physout devices, plus a two-bit state in skb.
> 
> Then, if you still think that ->cb is too fragile I'd be happy to add
> the lookup table method.  For the normal case of bridge forwarding, it
> shouldn't be needed though, perhaps we could also use a hybrid approach,
> cb-based fastpath for forwarding, lookup-table slowpath for dnat and
> local delivery cases.

I think the use of skb->cb introduces another dependency between
br_netfilter and the core, which is what we should avoid.  I still
have concerns regarding future extensibility issues that this may
introduce:

1) If we want to shrink the skb->cb in size, br_netfilter would limit
   this shrink since we'll have to keep enough room for our internal
   private information, even if most people will not need this.

2) If inet or qdisc private info size needs to be increased for
   whatever reason, we may run out of space to keep the br_netfilter data
   after it. And we cannot ask David to allow us to allocate more bytes
   for skb->cb to resolve this.

I think we have to find some permanent solution to isolate this
br_netfilter code as much as possible. Then, focus on providing a
replacement including the minimal set of features for what most users
are using br_netfilter, I would say:

* nfqueue support
* transparent proxying
* some simple way to perform stateless NAT (mangle mac and IP address
  to place the packet in the right interface).

so we can point them to nf_tables to progressively migrate to a better
environment, the overall goal to me is to deprecate this br_netfilter
thing.

I think we can use the per-cpu cache + global hashtable to look up for
nf_bridge data (using skbuff as index). The forwarded traffic will
find its nf_bridge info most of the time in the cache without having
to follow the slow lookup path.

> One other solution would be to use skb->cb but not store bridge device
> pointers but the ifindexes instead (we currently don't hold any
> device refcounts either so the current method isn't 100% safe either since
> such device can be removed ...).

Good catch, I think it would be great to start by fixing this problem
first.

Let me know,
Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-16 12:38         ` Pablo Neira Ayuso
@ 2015-03-16 13:01           ` Florian Westphal
  2015-03-16 13:47             ` Pablo Neira Ayuso
  2015-03-16 13:41           ` Florian Westphal
  1 sibling, 1 reply; 22+ messages in thread
From: Florian Westphal @ 2015-03-16 13:01 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, netdev

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Sat, Mar 14, 2015 at 12:13:25PM +0100, Florian Westphal wrote:
> > I think that currently it doesn't matter what solution we'll pick in the
> > end, I'd first like to send all my refactoring patches.
> > 
> > With those patches, it reduces the nf_bridge_info content that we need
> > to the physin and physout devices, plus a two-bit state in skb.
> > 
> > Then, if you still think that ->cb is too fragile I'd be happy to add
> > the lookup table method.  For the normal case of bridge forwarding, it
> > shouldn't be needed though, perhaps we could also use a hybrid approach,
> > cb-based fastpath for forwarding, lookup-table slowpath for dnat and
> > local delivery cases.
> 
> I think the use of skb->cb introduces another dependency between
> br_netfilter and the core, which is what we should avoid.  I still
> have concerns regarding future extensibility issues that this may
> introduce:
> 
> 1) If we want to shrink the skb->cb in size, br_netfilter would limit
>    this shrink since we'll have to keep enough room for our internal
>    private information, even if most people will not need this.

Given the size of tcp and gro this shouldn't be an issue.
We'd still fit into 40 bytes cb, and, if that would be a problem we
could still go with (more complex) external storage scheme.

> 2) If inet or qdisc private info size needs to be increased for
>    whatever reason, we may run out of space to keep the br_netfilter data
>    after it. And we cannot ask David to allow us to allocate more bytes
>    for skb->cb to resolve this.

Sure, absolutely.  skb->cb increase is out of question.

I don't think its a problem though, qdisc private info could even be
reduced to 16 (2 pointers on 64bit which I think is a reasonable size to
provide for qdiscs).  Infiniband has similar hack.

And tcp cb has to embed  inet/inet6 as well.

We wouldn't even need to pull of this stunt actually, we could just zap
the skb instead of reinjection... But that would induce e.g. a syn
retransmit which I want to avoid.

> I think we have to find some permanent solution to isolate this
> br_netfilter code as much as possible. Then, focus on providing a
> replacement including the minimal set of features for what most users
> are using br_netfilter, I would say:
> 
> * nfqueue support
> * transparent proxying
> * some simple way to perform stateless NAT (mangle mac and IP address
>   to place the packet in the right interface).

Please not stateless (IP) nat, it usually bites users due to lack
of helpers...  I've no problem with mac mangling.

> so we can point them to nf_tables to progressively migrate to a better
> environment, the overall goal to me is to deprecate this br_netfilter
> thing.

Yes, but it will be a long time before we can remove this thing.

> I think we can use the per-cpu cache + global hashtable to look up for
> nf_bridge data (using skbuff as index). The forwarded traffic will
> find its nf_bridge info most of the time in the cache without having
> to follow the slow lookup path.

I'm still not convinced we even need lookups in the forward case since
the skb can never leave br_netfilter context.

And we only need to store the physin/out ports, nothing else, so its a
bit sad that we need to add such lookup stuff in the first place :-/

I agree that it would still be better than the skb->nf_bridge
indirection, though.

> > One other solution would be to use skb->cb but not store bridge device
> > pointers but the ifindexes instead (we currently don't hold any
> > device refcounts either so the current method isn't 100% safe either since
> > such device can be removed ...).
> 
> Good catch, I think it would be great to start by fixing this problem
> first.

I don't see how this is easily fixable, sorry -- it would mean we'd have
to dev_hold right after allocation of nf_bridge_info struct...

We can store ifindexes of physin/physout and re-lookup, sure.

I'll start on the central storage infrastructure.
Do you want me to base it on the current patches i have or on plain
nf-next?

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-16 13:01           ` Florian Westphal
@ 2015-03-16 13:47             ` Pablo Neira Ayuso
  0 siblings, 0 replies; 22+ messages in thread
From: Pablo Neira Ayuso @ 2015-03-16 13:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, netdev

On Mon, Mar 16, 2015 at 02:01:10PM +0100, Florian Westphal wrote:
> We can store ifindexes of physin/physout and re-lookup, sure.
> 
> I'll start on the central storage infrastructure.
> Do you want me to base it on the current patches i have or on plain
> nf-next?

Let me pick these first:

[-next,v2] netfilter: bridge: query conntrack about skb dnat
[-next,resend] netfilter: bridge: remove BRNF_STATE_BRIDGED flag

This, I'll ping David to ack it:

[v2,nf-next,1/6] net: untangle ip_fragment and bridge netfilter

The follow up patches, you please rebase upon the central storage
idea. I like your small state machine idea instead of playing with
flags and the new functions that encapsulate the save/restore mangling
to keep IP/bridge happy.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter
  2015-03-16 12:38         ` Pablo Neira Ayuso
  2015-03-16 13:01           ` Florian Westphal
@ 2015-03-16 13:41           ` Florian Westphal
  1 sibling, 0 replies; 22+ messages in thread
From: Florian Westphal @ 2015-03-16 13:41 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, netdev

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> I think the use of skb->cb introduces another dependency between
> br_netfilter and the core, which is what we should avoid.  I still
> have concerns regarding future extensibility issues that this may
> introduce:

One more thing (sorry, forgot to mention this).
We CANNOT avoid dependency between br_netfilter and the core.

Its just a matter of what this dependency looks like, and how
much of a buren it is wrt. to maintenance and debugability.

Currently the dependency is skb->nf_bridge pointer.

I proposed to replace this pointer with smaller state + place
rest of info in skb->cb, with all the problems that this entails.

Using your 'external table' will also require some interaction,
I don't see how you'd remove entries from such table except some
callback function sitting in kfree_skb path, so we'd have to replace
current nf_bridge_put() call with something like

if (skb->nf_bridge_state && nf_bridge_info_destructor_hook)
	nf_bridge_info_destructor_hook(skb);

in kfree_skb.

We'd also still need the refcounting and figure out a way to handle
skb_clone properly (although I *think* we only have to care about those
spots where the bridge clones skbs and don't need a generic solution).

I admit that your proposal has less 'hidden' dependencies compared
to skb->cb based solutions, though.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-03-16 13:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-03-04 23:52 [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 1/8] bridge: move mac header copying into br_netfilter Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 2/8] netfilter: bridge: move nf_bridge_update_protocol to where its used Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 3/8] netfilter: brige: move DNAT helper " Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 4/8] netfilter: bridge: refactor conditional in br_nf_dev_queue_xmit Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 5/8] net: untangle ip_fragment and bridge netfilter Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 6/8] netfilter: bridge: query conntrack about skb dnat Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 7/8] netfilter: bridge: don't use nf_bridge_info data to store mac header Florian Westphal
2015-03-04 23:52 ` [PATCH nf-next 8/8] netfilter: bridge: rename nf_bridge_info->data to dnat_orig_mac Florian Westphal
2015-03-09 13:02 ` [PATCH nf-next 0/8] netfilter: untangle bridge and bridge netfilter Pablo Neira Ayuso
2015-03-09 13:13   ` Florian Westphal
2015-03-09 16:47     ` Pablo Neira Ayuso
2015-03-09 17:16     ` David Miller
2015-03-09 17:35       ` Florian Westphal
2015-03-09 19:20         ` David Miller
2015-03-09 13:59   ` Florian Westphal
2015-03-14  9:00     ` Pablo Neira Ayuso
2015-03-14 11:13       ` Florian Westphal
2015-03-16 12:38         ` Pablo Neira Ayuso
2015-03-16 13:01           ` Florian Westphal
2015-03-16 13:47             ` Pablo Neira Ayuso
2015-03-16 13:41           ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).