* [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs Florian Westphal
` (4 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
To: netdev; +Cc: hannes, jesse, Florian Westphal
Send icmp pmtu error if we find that the largest fragment of df-skb
exceeded the output path mtu.
The ip output path will still catch this later on but we can avoid the
forward/postrouting hook traversal by rejecting right away.
This is what ipv6 already does.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
no change since v1.
net/ipv4/ip_forward.c | 18 +++++++++++-------
1 file changed, 11 insertions(+), 7 deletions(-)
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 3674484..2d3aa40 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -39,17 +39,21 @@
#include <net/route.h>
#include <net/xfrm.h>
-static bool ip_may_fragment(const struct sk_buff *skb)
-{
- return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
- skb->ignore_df;
-}
-
static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
{
if (skb->len <= mtu)
return false;
+ if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+ return false;
+
+ /* original fragment exceeds mtu and DF is set */
+ if (unlikely(IPCB(skb)->frag_max_size > mtu))
+ return true;
+
+ if (skb->ignore_df)
+ return false;
+
if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
return false;
@@ -114,7 +118,7 @@ int ip_forward(struct sk_buff *skb)
IPCB(skb)->flags |= IPSKB_FORWARDED;
mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
- if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, mtu)) {
+ if (ip_exceeds_mtu(skb, mtu)) {
IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(mtu));
--
2.0.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets Florian Westphal
` (3 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
To: netdev; +Cc: hannes, jesse, Florian Westphal
since commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.
In this case, we must not use the device mtu, we might increase the
fragment size, possibly breaking connectivity as later router
might send packet-too-big errors even if sender never sent fragments
exceeding the reported mtu:
mtu 1500 - 1500:1400 - 1400:1280 - 1280
A R1 R2 B
- A sends to B, fragment size 1500
- R1 sends pkttoobig error for 1400
- A sends to B, fragment size 1400
- R2 sends pkttoobig error for 1280
- A sends to B, fragment size 1280
- R2 sends pkttoobig error for 1280 again because it sees
fragments of size 1400.
This doesn't usually occur in practice because we use fraglist to
store and use the individual fragments.
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
No change since v1.
net/ipv6/ip6_output.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..604c99d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -564,18 +564,17 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
/* We must not fragment if the socket is set to force MTU discovery
* or if the skb it not generated by a local socket.
*/
- if (unlikely(!skb->ignore_df && skb->len > mtu) ||
- (IP6CB(skb)->frag_max_size &&
- IP6CB(skb)->frag_max_size > mtu)) {
- if (skb->sk && dst_allfrag(skb_dst(skb)))
- sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+ if (unlikely(!skb->ignore_df && skb->len > mtu))
+ goto fail_toobig;
- skb->dev = skb_dst(skb)->dev;
- icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
- IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
- IPSTATS_MIB_FRAGFAILS);
- kfree_skb(skb);
- return -EMSGSIZE;
+ if (IP6CB(skb)->frag_max_size) {
+ if (IP6CB(skb)->frag_max_size > mtu)
+ goto fail_toobig;
+
+ /* don't send larger fragments than what we received */
+ mtu = IP6CB(skb)->frag_max_size;
+ if (mtu < IPV6_MIN_MTU)
+ mtu = IPV6_MIN_MTU;
}
if (np && np->frag_size < mtu) {
@@ -815,6 +814,14 @@ slow_path:
consume_skb(skb);
return err;
+fail_toobig:
+ if (skb->sk && dst_allfrag(skb_dst(skb)))
+ sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+
+ skb->dev = skb_dst(skb)->dev;
+ icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+ err = -EMSGSIZE;
+
fail:
IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
IPSTATS_MIB_FRAGFAILS);
--
2.0.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 1/5] ip: reject too-big defragmented DF-skb when forwarding Florian Westphal
2015-05-04 20:54 ` [PATCH V2 -next 2/5] ipv6: don't increase size when refragmenting forwarded ipv6 skbs Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
2015-05-04 20:54 ` [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs Florian Westphal
` (2 subsequent siblings)
5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
To: netdev; +Cc: hannes, jesse, Florian Westphal
We always send fragments without DF bit set.
Thus, given following setup:
mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
A R1 R2 B
Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes. So far, so good.
However, the host A will never learn about the lower 1280 link.
The next packet presumably sent by A would use 1400 as the new fragment
size, but R1 will strip DF bit when refragmenting.
Whats worse: If R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 would send an icmp error in response to a packet that was never sent,
citing a mtu that the original sender never exceeded.
In order to 'replay' the original fragments to preserve their semantics,
the simple solution is to
1. set DF bit on the new fragments if it was set on original ones.
2. set the size of the new fragments generated by R1 during
refragmentation to the largest size seen when defragmenting.
R2 will then notice the problem and will send the expected
'too big, use 1280' icmp error, and further fragments of this size
would not grow anymore to 1400 link mtu when R1 refragments.
There is however, one important caveat. We cannot just use existing
IPCB(skb)->frag_max_size as upper boundary for refragmentation.
We have to consider a case where we receive a large fragment without DF,
followed by a small fragment with DF set.
In such scenario we must not generate a large spew of small DF-fragments
(else we induce packet/traffic amplification).
This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets, and set frag_max_size to the largest
value.
If the DF fragment value is not smaller than the largest with DF set,
we will let the output path know that it needs to refragment
using frag_max_size sized packets, also setting DF bit on each fragment.
If frag_max_size exceeds the mtu, we already drop the skb and send the
needed icmp error.
Joint work with Hannes Frederic Sowa.
Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
Changes since v1:
- always store frag_max_size and use it to avoid increasing
the size of the fragments when refragmenting.
include/net/inet_frag.h | 2 +-
include/net/ip.h | 1 +
net/ipv4/ip_fragment.c | 31 ++++++++++++++++++++++++++-----
net/ipv4/ip_output.c | 31 ++++++++++++++++++++++---------
4 files changed, 50 insertions(+), 15 deletions(-)
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 8d17655..e1300b3 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,7 +43,7 @@ enum {
* @len: total length of the original datagram
* @meat: length of received fragments so far
* @flags: fragment queue flags
- * @max_size: (ipv4 only) maximum received fragment size with IP_DF set
+ * @max_size: maximum received fragment size
* @net: namespace that this frag belongs to
*/
struct inet_frag_queue {
diff --git a/include/net/ip.h b/include/net/ip.h
index d14af7e..428e694 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -45,6 +45,7 @@ struct inet_skb_parm {
#define IPSKB_FRAG_COMPLETE BIT(3)
#define IPSKB_REROUTED BIT(4)
#define IPSKB_DOREDIRECT BIT(5)
+#define IPSKB_FRAG_PMTU BIT(6)
u16 frag_max_size;
};
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index cc1da6d..ad2404f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -75,6 +75,7 @@ struct ipq {
__be16 id;
u8 protocol;
u8 ecn; /* RFC3168 support */
+ u16 max_df_size; /* largest frag with DF set seen */
int iif;
unsigned int rid;
struct inet_peer *peer;
@@ -319,6 +320,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
{
struct sk_buff *prev, *next;
struct net_device *dev;
+ unsigned int fragsize;
int flags, offset;
int ihl, end;
int err = -ENOENT;
@@ -474,9 +476,14 @@ found:
if (offset == 0)
qp->q.flags |= INET_FRAG_FIRST_IN;
+ fragsize = skb->len + ihl;
+
+ if (fragsize > qp->q.max_size)
+ qp->q.max_size = fragsize;
+
if (ip_hdr(skb)->frag_off & htons(IP_DF) &&
- skb->len + ihl > qp->q.max_size)
- qp->q.max_size = skb->len + ihl;
+ fragsize > qp->max_df_size)
+ qp->max_df_size = fragsize;
if (qp->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
qp->q.meat == qp->q.len) {
@@ -606,13 +613,27 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
head->next = NULL;
head->dev = dev;
head->tstamp = qp->q.stamp;
- IPCB(head)->frag_max_size = qp->q.max_size;
+ IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
iph = ip_hdr(head);
- /* max_size != 0 implies at least one fragment had IP_DF set */
- iph->frag_off = qp->q.max_size ? htons(IP_DF) : 0;
iph->tot_len = htons(len);
iph->tos |= ecn;
+
+ /* if largest size with df is also largest seen we should also
+ * call ip_fragment & set DF on the new fragments when forwarding
+ * to not break path mtu discovery.
+ *
+ * This check exists because we don't want to send tiny packets if
+ * skb was built from small df-fragment and one huge fragment
+ * without df.
+ */
+ if (qp->max_df_size == qp->q.max_size) {
+ IPCB(head)->flags |= IPSKB_FRAG_PMTU;
+ iph->frag_off = htons(IP_DF);
+ } else {
+ iph->frag_off = 0;
+ }
+
IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
qp->q.fragments = NULL;
qp->q.fragments_tail = NULL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65b93a..26847e4 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -270,7 +270,8 @@ static int ip_finish_output(struct sock *sk, struct sk_buff *skb)
if (skb_is_gso(skb))
return ip_finish_output_gso(sk, skb);
- if (skb->len > ip_skb_dst_mtu(skb))
+ if (skb->len > ip_skb_dst_mtu(skb) ||
+ (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
return ip_fragment(sk, skb, ip_finish_output2);
return ip_finish_output2(sk, skb);
@@ -507,16 +508,19 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
iph = ip_hdr(skb);
mtu = ip_skb_dst_mtu(skb);
- if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
- (IPCB(skb)->frag_max_size &&
- IPCB(skb)->frag_max_size > mtu))) {
- IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
- icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
- htonl(mtu));
- kfree_skb(skb);
- return -EMSGSIZE;
+
+ if (unlikely((iph->frag_off & htons(IP_DF)))) {
+ if (!skb->ignore_df)
+ goto fail_toobig;
+
+ if (IPCB(skb)->frag_max_size &&
+ IPCB(skb)->frag_max_size > mtu)
+ goto fail_toobig;
}
+ if (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size < mtu)
+ mtu = IPCB(skb)->frag_max_size;
+
/*
* Setup starting values.
*/
@@ -711,6 +715,9 @@ slow_path:
iph = ip_hdr(skb2);
iph->frag_off = htons((offset >> 3));
+ if (IPCB(skb)->flags & IPSKB_FRAG_PMTU)
+ iph->frag_off |= htons(IP_DF);
+
/* ANK: dirty, but effective trick. Upgrade options only if
* the segment to be fragmented was THE FIRST (otherwise,
* options are already fixed) and make it ONCE
@@ -750,6 +757,12 @@ fail:
kfree_skb(skb);
IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
return err;
+fail_toobig:
+ IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+ icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
+ kfree_skb(skb);
+ return -EMSGSIZE;
+
}
EXPORT_SYMBOL(ip_fragment);
--
2.0.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
` (2 preceding siblings ...)
2015-05-04 20:54 ` [PATCH V2 -next 3/5] ip_fragment: don't remove df bit from defragmented packets Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
2015-05-04 20:54 ` [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier Florian Westphal
2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
To: netdev; +Cc: hannes, jesse, Florian Westphal, Eric Dumazet
There was interest in allowing us to record the original fragment sizes
in more detail, i.e. preserve length of all individual fragments.
This (re)enables this capability. Caveats are:
1. - this disables the optimizations made in
commit 3cc4949269e01f39443d0 ("ipv4: use skb coalescing in defragmentation")
for everyone as soon as nf_defrag_ipv4 module is loaded (it hooks earlier
than ipv4 stacks own defragmentation for local delivery).
Unfortunately there is no (easy) way to determine if we will forward the skb
at that stage.
2. - it doesn't work when skb_linearize() and friends are invoked later.
3. - we still call ip_fragment() when skbs are forwarded to (re-)create
the fragment headers.
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
I don't think this patch (and 5/5) is needed, but there was interest
in allowing to 'replay' original fragment geometry in more detail
when refragmenting, and this is one way of allowing this at least in
some cases.
net/ipv4/ip_fragment.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index ad2404f..2326ae8 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -94,7 +94,7 @@ int ip_frag_mem(struct net *net)
}
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
- struct net_device *dev);
+ struct net_device *dev, bool preserve_frags);
struct ip4_create_arg {
struct iphdr *iph;
@@ -316,7 +316,8 @@ static int ip_frag_reinit(struct ipq *qp)
}
/* Add new segment to existing queue. */
-static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
+static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb,
+ bool preserve_frags)
{
struct sk_buff *prev, *next;
struct net_device *dev;
@@ -490,7 +491,7 @@ found:
unsigned long orefdst = skb->_skb_refdst;
skb->_skb_refdst = 0UL;
- err = ip_frag_reasm(qp, prev, dev);
+ err = ip_frag_reasm(qp, prev, dev, preserve_frags);
skb->_skb_refdst = orefdst;
return err;
}
@@ -507,7 +508,7 @@ err:
/* Build a new IP datagram from all its fragments. */
static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
- struct net_device *dev)
+ struct net_device *dev, bool preserve_frags)
{
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct iphdr *iph;
@@ -597,7 +598,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
- if (skb_try_coalesce(head, fp, &headstolen, &delta)) {
+ if (!preserve_frags &&
+ skb_try_coalesce(head, fp, &headstolen, &delta)) {
kfree_skb_partial(fp, headstolen);
} else {
if (!skb_shinfo(head)->frag_list)
@@ -650,6 +652,11 @@ out_fail:
return err;
}
+static bool preserve_fraglist(u32 user)
+{
+ return user != IP_DEFRAG_LOCAL_DELIVER;
+}
+
/* Process an incoming IP datagram fragment. */
int ip_defrag(struct sk_buff *skb, u32 user)
{
@@ -666,7 +673,7 @@ int ip_defrag(struct sk_buff *skb, u32 user)
spin_lock(&qp->q.lock);
- ret = ip_frag_queue(qp, skb);
+ ret = ip_frag_queue(qp, skb, preserve_fraglist(user));
spin_unlock(&qp->q.lock);
ipq_put(qp);
--
2.0.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
` (3 preceding siblings ...)
2015-05-04 20:54 ` [PATCH RFC 4/5] net: ip_fragment: attempt to preserve frag sizes for netfilter defragmented skbs Florian Westphal
@ 2015-05-04 20:54 ` Florian Westphal
2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
5 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 20:54 UTC (permalink / raw)
To: netdev; +Cc: hannes, jesse, Florian Westphal
This extends the previous change to also set the DF bit in the
fragment list skb, i.e. if the original fragments had DF set,
then also set it when 'replaying' the fraglist skbs.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
net/ipv4/ip_output.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 26847e4..3a41844 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -543,6 +543,7 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
if (skb_has_frag_list(skb)) {
struct sk_buff *frag, *frag2;
int first_len = skb_pagelen(skb);
+ bool is_df_skb;
if (first_len - hlen > mtu ||
((first_len - hlen) & 7) ||
@@ -579,6 +580,9 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
skb->len = first_len;
iph->tot_len = htons(first_len);
iph->frag_off = htons(IP_MF);
+ is_df_skb = IPCB(skb)->flags & IPSKB_FRAG_PMTU;
+ if (is_df_skb)
+ iph->frag_off |= htons(IP_DF);
ip_send_check(iph);
for (;;) {
@@ -599,6 +603,8 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
iph->frag_off = htons(offset>>3);
if (frag->next)
iph->frag_off |= htons(IP_MF);
+ if (is_df_skb)
+ iph->frag_off |= htons(IP_DF);
/* Ready, complete checksum */
ip_send_check(iph);
}
--
2.0.5
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting
2015-05-04 20:54 [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting Florian Westphal
` (4 preceding siblings ...)
2015-05-04 20:54 ` [PATCH RFC 5/5] net: set DF bit on fragment list skbs if DF was set earlier Florian Westphal
@ 2015-05-04 23:29 ` David Miller
2015-05-04 23:48 ` Florian Westphal
5 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2015-05-04 23:29 UTC (permalink / raw)
To: fw; +Cc: netdev, hannes, jesse, herbert
From: Florian Westphal <fw@strlen.de>
Date: Mon, 4 May 2015 22:54:43 +0200
> #2 keep fragments attached to reassembled
>
> The idea is to attach the original skbs to the reassembled one, so the
> networking stack can choose which ones to use depending on the use
> case. Forwarding would operate on the original ones while code dealing
> with PACKET_HOST frames would use the reassembled one.
>
> - We have the overhead to carry more skbs around, which we
> currently don't do.
I disagree. It is much more cheaper to save around a chain of smaller
than PAGE_SIZE SKB fragments, than have to allocate multi-order linear
SKB to hold the whole thing.
Furthermore, the allocation of the incoming SKB fragments has by
definition _ALREADY_ suceeded. Therefore it is more likely to result
in successful passing of the frames through the host.
And I do not think you need to sets of packets. We have SKB
interfaces that can pull headers out of the SKB even if it crosses
a frag boundary, yet returns a pointer directly to the object inside
of skb->data if it is fully contained inside of the linear area which
is the common case.
All of this infrastructure is there and optimized for handling
spaghetti fragged SKBs if that's what we end up receiving, use it.
All of these overlapping frag etc. issues are just details, and I am
still not convinced these cannot be handled properly. Please try
harder.
> - This information cannot be stored in any of the currently
> available fields in the skb or shared_info. That said, a new
> pointer would be necessary in every skb, independently if it
> is fragmented or not. This change does impact fast path and
> skb size.
You could use the existing frag_list, or make a new member (but not
in sk_buff, but rather in the shinfo).
Just imposing a size limit does not preserve the geometry.
I consider this a show stopper, and I believe people like Herbert
Xu will agree with me.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting
2015-05-04 23:29 ` [PATCH V2 -next 0/5] don't exceed original maximum fragment size when refragmenting David Miller
@ 2015-05-04 23:48 ` Florian Westphal
0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2015-05-04 23:48 UTC (permalink / raw)
To: David Miller; +Cc: fw, netdev, hannes, jesse, herbert
David Miller <davem@davemloft.net> wrote:
> > #2 keep fragments attached to reassembled
> >
> > The idea is to attach the original skbs to the reassembled one, so the
> > networking stack can choose which ones to use depending on the use
> > case. Forwarding would operate on the original ones while code dealing
> > with PACKET_HOST frames would use the reassembled one.
> >
> > - We have the overhead to carry more skbs around, which we
> > currently don't do.
>
> I disagree. It is much more cheaper to save around a chain of smaller
> than PAGE_SIZE SKB fragments, than have to allocate multi-order linear
> SKB to hold the whole thing.
>
> Furthermore, the allocation of the incoming SKB fragments has by
> definition _ALREADY_ suceeded. Therefore it is more likely to result
> in successful passing of the frames through the host.
Sorry, but I am very confused.
The existing ipv4 defrag engine that we have attempts to use the
frag_list, we just don't push the individual skbs though and rather
the 'refragmented' one which will be non-linear and refer to the data
area of the fragments instead of realloactions etc.
Its just that this will 'normalize' the fragments to eliminate overlaps.
It also destroys arrival ordering (timing) because we need to put it
into correct 'logical' order.
> All of these overlapping frag etc. issues are just details, and I am
> still not convinced these cannot be handled properly. Please try
> harder.
>
> > - This information cannot be stored in any of the currently
> > available fields in the skb or shared_info. That said, a new
> > pointer would be necessary in every skb, independently if it
> > is fragmented or not. This change does impact fast path and
> > skb size.
>
> You could use the existing frag_list, or make a new member (but not
> in sk_buff, but rather in the shinfo).
>
> Just imposing a size limit does not preserve the geometry.
Yes, but there is no functionality loss. Nevertheless, patch #4
will preserve geometry for all fragments (discarding overlaps),
provided that no packet linearization took place (or any other
operation that destroys frag lists).
[ I am refering to http://patchwork.ozlabs.org/patch/467834/ ]
Is that not sufficient, i.e. what functionality is missing/which
property needs to be preserved?
Do you want arrival ordering, overlaps etc. to be kept?
Thank you.
^ permalink raw reply [flat|nested] 8+ messages in thread