[PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting
@ 2015-05-04 20:54 Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse

Hello,

We would like to propose this patchset again. Only minor details
changed since the last version, we incorporated the suggestion from
Jesse to always store the size of the largest fragment received,
regardless of the DF bit.

Thus we never generate bigger fragments as originally received
regardless if DF is set ot not.

We would like to summarize the current discussion on this topic and
again would like you to consider applying this patchset to net-next:

Several proposals were suggested:

#1 employ GRO engine
	- Reassembly would only work within one napi poll run. But
	  reassembly must happen even independently of the interface
	  the frame gets received. Delays cause single fragments to
	  arrive in different napi runs, which wouldn't be aggregated.

	- We would have to kill the 1:1 correspondence between
          aggregation and segmentation: within the TCP protocol we can
          stop aggregating frames at any point without any harm
          because of it being a streaming protocol. Fragmentation is
          different in the way that we need to reassemble the complete
          packet before processing, we cannot make sense of 'half skbs'.

#2 keep fragments attached to reassembled

The idea is to attach the original skbs to the reassembled one, so the
networking stack can choose which ones to use depending on the use
case. Forwarding would operate on the original ones while code dealing
with PACKET_HOST frames would use the reassembled one.

	- We have the overhead to carry more skbs around, which we
          currently don't do.

	- This information cannot be stored in any of the currently
          available fields in the skb or shared_info. That said, a new
          pointer would be necessary in every skb, independently if it
          is fragmented or not. This change does impact fast path and
          skb size.

	- sometimes using reassembled skb or the original ones could
          lead to TOCTTOU attacks in some situations, like packet is
          split in the TCP header, core stacks sees complete
          reassembled TCP packet but netfilter only part of the
          header, so different decisions might be done

	- it does impact fast path in netfilter for every packet:
	  pskb_may_pull is not enough to check if we can eat enough of
	  the header, actually because of overlapping or duplicate
	  fragments we have to touch all those fragments, thus
	  creating new slow paths in netfilter

	- all netfilter helpers would need to adapt in case e.g. a
          udp packet containing a sip message is fragmented.

	- in case we change fragment size, we don't have clear
          semantics and the only behaviour which makes sense is what
          this patchset does (i.e., refragment).

	- still, even such complex change does not allow us to act as
          transparent router/bridge: we still have to queue up
          fragments; in case we cannot reassemble we have to drop
          them (else firewall bypass is possible).

#3 max_frag_size vector

As it is based on the idea of keep fragments attached to reassembly it
inherits a lot of the problems stated in section #2.

	- Still needs an additional way to store this information in
          the skb, thus enlarging a structure we try to shrink.

	- TOCTTOU attacks are not possible because we do inspect the
          same data all the time

	- ... but at the same time, we cannot deal with overlapping or
          duplicated fragments (without making this complex again)

For years the linux kernel never correctly handled fragmented packets
in forwarding L3 or L2 cases. We never heard any complaints. These
patches try to make Linux a better internet citizen, correctly
handling some edge cases, without harming core code and affecting
performance.

Thus we consider our proposed patches superior in all aspects. We are
happy to discuss any ideas how to solve this otherwise.

We investigated alternate approaches to allow transparent refragmentation
for the common case of "well-formed" (i.e., non-overlapping, no duplicates, ..)
fragments.  Unfortunately it involves removing an ip defragmentation
optimization in case netfilter conntrack is active.

The two patches that enable this are included as [RFC] as part of this series
so they can be discussed.

Thanks,
Hannes, Florian

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs Florian Westphal
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse, Florian Westphal

Send icmp pmtu error if we find that the largest fragment of df-skb
exceeded the output path mtu.

The ip output path will still catch this later on but we can avoid the
forward/postrouting hook traversal by rejecting right away.

This is what ipv6 already does.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 no change since v1.

 net/ipv4/ip_forward.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 3674484..2d3aa40 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -39,17 +39,21 @@
 #include <net/route.h>
 #include <net/xfrm.h>
 
-static bool ip_may_fragment(const struct sk_buff *skb)
-{
-	return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
-		skb->ignore_df;
-}
-
 static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
 {
 	if (skb->len <= mtu)
 		return false;
 
+	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+		return false;
+
+	/* original fragment exceeds mtu and DF is set */
+	if (unlikely(IPCB(skb)->frag_max_size > mtu))
+		return true;
+
+	if (skb->ignore_df)
+		return false;
+
 	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
 		return false;
 
@@ -114,7 +118,7 @@ int ip_forward(struct sk_buff *skb)
 
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
-	if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, mtu)) {
+	if (ip_exceeds_mtu(skb, mtu)) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse, Florian Westphal

since commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.

In this case, we must not use the device mtu, we might increase the
fragment size, possibly breaking connectivity as later router
might send packet-too-big errors even if sender never sent fragments
exceeding the reported mtu:

mtu 1500 - 1500:1400 - 1400:1280 - 1280
     A         R1         R2        B

- A sends to B, fragment size 1500
- R1 sends pkttoobig error for 1400
- A sends to B, fragment size 1400
- R2 sends pkttoobig error for 1280
- A sends to B, fragment size 1280
- R2 sends pkttoobig error for 1280 again because it sees
fragments of size 1400.

This doesn't usually occur in practice because we use fraglist to
store and use the individual fragments.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 No change since v1.

 net/ipv6/ip6_output.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..604c99d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -564,18 +564,17 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	/* We must not fragment if the socket is set to force MTU discovery
 	 * or if the skb it not generated by a local socket.
 	 */
-	if (unlikely(!skb->ignore_df && skb->len > mtu) ||
-		     (IP6CB(skb)->frag_max_size &&
-		      IP6CB(skb)->frag_max_size > mtu)) {
-		if (skb->sk && dst_allfrag(skb_dst(skb)))
-			sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+	if (unlikely(!skb->ignore_df && skb->len > mtu))
+		goto fail_toobig;
 
-		skb->dev = skb_dst(skb)->dev;
-		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_FRAGFAILS);
-		kfree_skb(skb);
-		return -EMSGSIZE;
+	if (IP6CB(skb)->frag_max_size) {
+		if (IP6CB(skb)->frag_max_size > mtu)
+			goto fail_toobig;
+
+		/* don't send larger fragments than what we received */
+		mtu = IP6CB(skb)->frag_max_size;
+		if (mtu < IPV6_MIN_MTU)
+			mtu = IPV6_MIN_MTU;
 	}
 
 	if (np && np->frag_size < mtu) {
@@ -815,6 +814,14 @@ slow_path:
 	consume_skb(skb);
 	return err;
 
+fail_toobig:
+	if (skb->sk && dst_allfrag(skb_dst(skb)))
+		sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+
+	skb->dev = skb_dst(skb)->dev;
+	icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+	err = -EMSGSIZE;
+
 fail:
 	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
 		      IPSTATS_MIB_FRAGFAILS);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
  2015-05-04 20:54 ` [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
  2015-05-04 20:54 ` [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs Florian Westphal
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse, Florian Westphal

We always send fragments without DF bit set.

Thus, given following setup:

mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
   A           R1               R2         B

Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.  So far, so good.

However, the host A will never learn about the lower 1280 link.
The next packet presumably sent by A would use 1400 as the new fragment
size, but R1 will strip DF bit when refragmenting.

Whats worse: If R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 would send an icmp error in response to a packet that was never sent,
citing a mtu that the original sender never exceeded.

In order to 'replay' the original fragments to preserve their semantics,
the simple solution is to

 1. set DF bit on the new fragments if it was set on original ones.
 2. set the size of the new fragments generated by R1 during
    refragmentation to the largest size seen when defragmenting.

R2 will then notice the problem and will send the expected
'too big, use 1280' icmp error, and further fragments of this size
would not grow anymore to 1400 link mtu when R1 refragments.

There is however, one important caveat. We cannot just use existing
IPCB(skb)->frag_max_size as upper boundary for refragmentation.

We have to consider a case where we receive a large fragment without DF,
followed by a small fragment with DF set.

In such scenario we must not generate a large spew of small DF-fragments
(else we induce packet/traffic amplification).

This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.

If the DF fragment value is not smaller than the largest with DF set,
we will let the output path know that it needs to refragment
using frag_max_size sized packets, also setting DF bit on each fragment.

If frag_max_size exceeds the mtu, we already drop the skb and send the
needed icmp error.

Joint work with Hannes Frederic Sowa.

Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since v1:
  - always store frag_max_size and use it to avoid increasing
    the size of the fragments when refragmenting.

 include/net/inet_frag.h |  2 +-
 include/net/ip.h        |  1 +
 net/ipv4/ip_fragment.c  | 31 ++++++++++++++++++++++++++-----
 net/ipv4/ip_output.c    | 31 ++++++++++++++++++++++---------
 4 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 8d17655..e1300b3 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,7 +43,7 @@ enum {
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
  * @flags: fragment queue flags
- * @max_size: (ipv4 only) maximum received fragment size with IP_DF set
+ * @max_size: maximum received fragment size
  * @net: namespace that this frag belongs to
  */
 struct inet_frag_queue {
diff --git a/include/net/ip.h b/include/net/ip.h
index d14af7e..428e694 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -45,6 +45,7 @@ struct inet_skb_parm {
 #define IPSKB_FRAG_COMPLETE	BIT(3)
 #define IPSKB_REROUTED		BIT(4)
 #define IPSKB_DOREDIRECT	BIT(5)
+#define IPSKB_FRAG_PMTU		BIT(6)
 
 	u16			frag_max_size;
 };
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index cc1da6d..ad2404f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -75,6 +75,7 @@ struct ipq {
 	__be16		id;
 	u8		protocol;
 	u8		ecn; /* RFC3168 support */
+	u16		max_df_size; /* largest frag with DF set seen */
 	int             iif;
 	unsigned int    rid;
 	struct inet_peer *peer;
@@ -319,6 +320,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
+	unsigned int fragsize;
 	int flags, offset;
 	int ihl, end;
 	int err = -ENOENT;
@@ -474,9 +476,14 @@ found:
 	if (offset == 0)
 		qp->q.flags |= INET_FRAG_FIRST_IN;
 
+	fragsize = skb->len + ihl;
+
+	if (fragsize > qp->q.max_size)
+		qp->q.max_size = fragsize;
+
 	if (ip_hdr(skb)->frag_off & htons(IP_DF) &&
-	    skb->len + ihl > qp->q.max_size)
-		qp->q.max_size = skb->len + ihl;
+	    fragsize > qp->max_df_size)
+		qp->max_df_size = fragsize;
 
 	if (qp->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
 	    qp->q.meat == qp->q.len) {
@@ -606,13 +613,27 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	head->next = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
-	IPCB(head)->frag_max_size = qp->q.max_size;
+	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
 
 	iph = ip_hdr(head);
-	/* max_size != 0 implies at least one fragment had IP_DF set */
-	iph->frag_off = qp->q.max_size ? htons(IP_DF) : 0;
 	iph->tot_len = htons(len);
 	iph->tos |= ecn;
+
+	/* if largest size with df is also largest seen we should also
+	 * call ip_fragment & set DF on the new fragments when forwarding
+	 * to not break path mtu discovery.
+	 *
+	 * This check exists because we don't want to send tiny packets if
+	 * skb was built from small df-fragment and one huge fragment
+	 * without df.
+	 */
+	if (qp->max_df_size == qp->q.max_size) {
+		IPCB(head)->flags |= IPSKB_FRAG_PMTU;
+		iph->frag_off = htons(IP_DF);
+	} else {
+		iph->frag_off = 0;
+	}
+
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65b93a..26847e4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -270,7 +270,8 @@ static int ip_finish_output(struct sock *sk, struct sk_buff *skb)
 	if (skb_is_gso(skb))
 		return ip_finish_output_gso(sk, skb);
 
-	if (skb->len > ip_skb_dst_mtu(skb))
+	if (skb->len > ip_skb_dst_mtu(skb) ||
+	    (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
 		return ip_fragment(sk, skb, ip_finish_output2);
 
 	return ip_finish_output2(sk, skb);
@@ -507,16 +508,19 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
 	iph = ip_hdr(skb);
 
 	mtu = ip_skb_dst_mtu(skb);
-	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
-		     (IPCB(skb)->frag_max_size &&
-		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
-		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(mtu));
-		kfree_skb(skb);
-		return -EMSGSIZE;
+
+	if (unlikely((iph->frag_off & htons(IP_DF)))) {
+		if (!skb->ignore_df)
+			goto fail_toobig;
+
+		if (IPCB(skb)->frag_max_size &&
+		    IPCB(skb)->frag_max_size > mtu)
+			goto fail_toobig;
 	}
 
+	if (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size < mtu)
+		mtu = IPCB(skb)->frag_max_size;
+
 	/*
 	 *	Setup starting values.
 	 */
@@ -711,6 +715,9 @@ slow_path:
 		iph = ip_hdr(skb2);
 		iph->frag_off = htons((offset >> 3));
 
+		if (IPCB(skb)->flags & IPSKB_FRAG_PMTU)
+			iph->frag_off |= htons(IP_DF);
+
 		/* ANK: dirty, but effective trick. Upgrade options only if
 		 * the segment to be fragmented was THE FIRST (otherwise,
 		 * options are already fixed) and make it ONCE
@@ -750,6 +757,12 @@ fail:
 	kfree_skb(skb);
 	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 	return err;
+fail_toobig:
+	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
+	kfree_skb(skb);
+	return -EMSGSIZE;
+
 }
 EXPORT_SYMBOL(ip_fragment);
 
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
                   ` (2 preceding siblings ...)
  2015-05-04 20:54 ` [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
  2015-05-04 20:54 ` [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier Florian Westphal
  2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
  5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse, Florian Westphal, Eric Dumazet

There was interest in allowing us to record the original fragment sizes
in more detail, i.e. preserve length of all individual fragments.

This (re)enables this capability. Caveats are:

1. - this disables the optimizations made in
commit 3cc4949269e01f39443d0 ("ipv4: use skb coalescing in defragmentation")
for everyone as soon as nf_defrag_ipv4 module is loaded (it hooks earlier
than ipv4 stacks own defragmentation for local delivery).

Unfortunately there is no (easy) way to determine if we will forward the skb
at that stage.

2. - it doesn't work when skb_linearize() and friends are invoked later.

3. - we still call ip_fragment() when skbs are forwarded to (re-)create
the fragment headers.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 I don't think this patch (and 5/5) is needed, but there was interest
 in allowing to 'replay' original fragment geometry in more detail
 when refragmenting, and this is one way of allowing this at least in
 some cases.

 net/ipv4/ip_fragment.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index ad2404f..2326ae8 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -94,7 +94,7 @@ int ip_frag_mem(struct net *net)
 }
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+			 struct net_device *dev, bool preserve_frags);
 
 struct ip4_create_arg {
 	struct iphdr *iph;
@@ -316,7 +316,8 @@ static int ip_frag_reinit(struct ipq *qp)
 }
 
 /* Add new segment to existing queue. */
-static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
+static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb,
+			 bool preserve_frags)
 {
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
@@ -490,7 +491,7 @@ found:
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, prev, dev, preserve_frags);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -507,7 +508,7 @@ err:
 /* Build a new IP datagram from all its fragments. */
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev)
+			 struct net_device *dev, bool preserve_frags)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -597,7 +598,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 
-		if (skb_try_coalesce(head, fp, &headstolen, &delta)) {
+		if (!preserve_frags &&
+		    skb_try_coalesce(head, fp, &headstolen, &delta)) {
 			kfree_skb_partial(fp, headstolen);
 		} else {
 			if (!skb_shinfo(head)->frag_list)
@@ -650,6 +652,11 @@ out_fail:
 	return err;
 }
 
+static bool preserve_fraglist(u32 user)
+{
+	return user != IP_DEFRAG_LOCAL_DELIVER;
+}
+
 /* Process an incoming IP datagram fragment. */
 int ip_defrag(struct sk_buff *skb, u32 user)
 {
@@ -666,7 +673,7 @@ int ip_defrag(struct sk_buff *skb, u32 user)
 
 		spin_lock(&qp->q.lock);
 
-		ret = ip_frag_queue(qp, skb);
+		ret = ip_frag_queue(qp, skb, preserve_fraglist(user));
 
 		spin_unlock(&qp->q.lock);
 		ipq_put(qp);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
                   ` (3 preceding siblings ...)
  2015-05-04 20:54 ` [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
  2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
  5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
  To: netdev; +Cc: hannes, jesse, Florian Westphal

This extends the previous change to also set the DF bit in the
fragment list skb, i.e. if the original fragments had DF set,
then also set it when 'replaying' the fraglist skbs.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_output.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 26847e4..3a41844 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -543,6 +543,7 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
 	if (skb_has_frag_list(skb)) {
 		struct sk_buff *frag, *frag2;
 		int first_len = skb_pagelen(skb);
+		bool is_df_skb;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -579,6 +580,9 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
+		is_df_skb = IPCB(skb)->flags & IPSKB_FRAG_PMTU;
+		if (is_df_skb)
+			iph->frag_off |= htons(IP_DF);
 		ip_send_check(iph);
 
 		for (;;) {
@@ -599,6 +603,8 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
 				iph->frag_off = htons(offset>>3);
 				if (frag->next)
 					iph->frag_off |= htons(IP_MF);
+				if (is_df_skb)
+					iph->frag_off |= htons(IP_DF);
 				/* Ready, complete checksum */
 				ip_send_check(iph);
 			}
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting
  2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
                   ` (4 preceding siblings ...)
  2015-05-04 20:54 ` [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier Florian Westphal
@ 2015-05-04 23:29 ` David Miller
  2015-05-04 23:48   ` Florian Westphal
  5 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2015-05-04 23:29 UTC (permalink / raw)
  To: fw; +Cc: netdev, hannes, jesse, herbert

From: Florian Westphal <fw@strlen.de>
Date: Mon,  4 May 2015 22:54:43 +0200

> #2 keep fragments attached to reassembled
> 
> The idea is to attach the original skbs to the reassembled one, so the
> networking stack can choose which ones to use depending on the use
> case. Forwarding would operate on the original ones while code dealing
> with PACKET_HOST frames would use the reassembled one.
> 
> 	- We have the overhead to carry more skbs around, which we
>           currently don't do.

I disagree.  It is much more cheaper to save around a chain of smaller
than PAGE_SIZE SKB fragments, than have to allocate multi-order linear
SKB to hold the whole thing.

Furthermore, the allocation of the incoming SKB fragments has by
definition _ALREADY_ suceeded.  Therefore it is more likely to result
in successful passing of the frames through the host.

And I do not think you need to sets of packets.  We have SKB
interfaces that can pull headers out of the SKB even if it crosses
a frag boundary, yet returns a pointer directly to the object inside
of skb->data if it is fully contained inside of the linear area which
is the common case.

All of this infrastructure is there and optimized for handling
spaghetti fragged SKBs if that's what we end up receiving, use it.

All of these overlapping frag etc. issues are just details, and I am
still not convinced these cannot be handled properly.  Please try
harder.

> 	- This information cannot be stored in any of the currently
>           available fields in the skb or shared_info. That said, a new
>           pointer would be necessary in every skb, independently if it
>           is fragmented or not. This change does impact fast path and
>           skb size.

You could use the existing frag_list, or make a new member (but not
in sk_buff, but rather in the shinfo).

Just imposing a size limit does not preserve the geometry.

I consider this a show stopper, and I believe people like Herbert
Xu will agree with me.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting
  2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
@ 2015-05-04 23:48   ` Florian Westphal
  0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 23:48 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, hannes, jesse, herbert

David Miller <davem@davemloft.net> wrote:
> > #2 keep fragments attached to reassembled
> > 
> > The idea is to attach the original skbs to the reassembled one, so the
> > networking stack can choose which ones to use depending on the use
> > case. Forwarding would operate on the original ones while code dealing
> > with PACKET_HOST frames would use the reassembled one.
> > 
> > 	- We have the overhead to carry more skbs around, which we
> >           currently don't do.
> 
> I disagree.  It is much more cheaper to save around a chain of smaller
> than PAGE_SIZE SKB fragments, than have to allocate multi-order linear
> SKB to hold the whole thing.
> 
> Furthermore, the allocation of the incoming SKB fragments has by
> definition _ALREADY_ suceeded.  Therefore it is more likely to result
> in successful passing of the frames through the host.

Sorry, but I am very confused.

The existing ipv4 defrag engine that we have attempts to use the
frag_list, we just don't push the individual skbs though and rather
the 'refragmented' one which will be non-linear and refer to the data
area of the fragments instead of realloactions etc.

Its just that this will 'normalize' the fragments to eliminate overlaps.
It also destroys arrival ordering (timing) because we need to put it
into correct 'logical' order.

> All of these overlapping frag etc. issues are just details, and I am
> still not convinced these cannot be handled properly.  Please try
> harder.
> 
> > 	- This information cannot be stored in any of the currently
> >           available fields in the skb or shared_info. That said, a new
> >           pointer would be necessary in every skb, independently if it
> >           is fragmented or not. This change does impact fast path and
> >           skb size.
> 
> You could use the existing frag_list, or make a new member (but not
> in sk_buff, but rather in the shinfo).
> 
> Just imposing a size limit does not preserve the geometry.

Yes, but there is no functionality loss. Nevertheless, patch #4
will preserve geometry for all fragments (discarding overlaps),
provided that no packet linearization took place (or any other
operation that destroys frag lists).

[ I am refering to http://patchwork.ozlabs.org/patch/467834/ ]

Is that not sufficient, i.e. what functionality is missing/which
property needs to be preserved?
Do you want arrival ordering, overlaps etc. to be kept?

Thank you.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-04 23:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets Florian Westphal
2015-05-04 20:54 ` [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs Florian Westphal
2015-05-04 20:54 ` [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier Florian Westphal
2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
2015-05-04 23:48   ` Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).