* [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
@ 2016-08-22 12:06 Shmulik Ladkani
2016-08-22 12:58 ` Florian Westphal
0 siblings, 1 reply; 8+ messages in thread
From: Shmulik Ladkani @ 2016-08-22 12:06 UTC (permalink / raw)
To: David S. Miller
Cc: netdev, Shmulik Ladkani, Hannes Frederic Sowa, Eric Dumazet,
Florian Westphal
There are cases where gso skbs (which originate from an ingress
interface) have a gso_size value that exceeds the output dst mtu:
- ipv4 forwarding middlebox having in/out interfaces with different mtus
addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
- bridge having a tunnel member interface stacked over a device with small mtu
addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'
In both cases, such skbs are identified, then go through early software
segmentation+fragmentation as part of ip_finish_output_gso.
Another approach is to shrink the gso_size to a value suitable so
resulting segments are smaller than dst mtu, as suggeted by Eric
Dumazet (as part of [1]) and Florian Westphal (as part of [2]).
This will void the need for software segmentation/fragmentation at
ip_finish_output_gso, thus significantly improve throughput and lower
cpu load.
This RFC patch attempts to implement this gso_size clamping.
[1] https://patchwork.ozlabs.org/patch/314327/
[2] https://patchwork.ozlabs.org/patch/644724/
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---
Comments welcome.
Few questions embedded in the patch.
Florian, in fe6cc55f you described a BUG due to gso_size decrease.
I've tested both bridged and routed cases, but in my setups failed to
hit the issue; Appreciate if you can provide some hints.
net/ipv4/ip_output.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index dde37fb..b911b43 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,40 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
return -EINVAL;
}
+static inline int skb_gso_clamp(struct sk_buff *skb, unsigned int network_len)
+{
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+ unsigned short gso_size;
+ unsigned int seglen;
+
+ if (shinfo->gso_size == GSO_BY_FRAGS)
+ return -EINVAL;
+
+ seglen = skb_gso_network_seglen(skb);
+
+ /* Decrease gso_size to fit specified network length */
+ gso_size = shinfo->gso_size - (seglen - network_len);
+ if (shinfo->gso_type & SKB_GSO_UDP)
+ gso_size &= ~7;
+
+ if (!(shinfo->gso_type & SKB_GSO_DODGY)) {
+ /* Recalculate gso_segs for skbs of trusted sources.
+ * Untrusted skbs will have their gso_segs calculated by
+ * skb_gso_segment.
+ */
+ unsigned int hdr_len, payload_len;
+
+ hdr_len = seglen - shinfo->gso_size;
+ payload_len = skb->len - hdr_len;
+ shinfo->gso_segs = DIV_ROUND_UP(payload_len, gso_size);
+
+ // Q: need to verify gso_segs <= dev->gso_max_segs?
+ // seems to be protected by netif_skb_features
+ }
+ shinfo->gso_size = gso_size;
+ return 0;
+}
+
static int ip_finish_output_gso(struct net *net, struct sock *sk,
struct sk_buff *skb, unsigned int mtu)
{
@@ -237,6 +271,16 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk,
* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
* from host network stack.
*/
+
+ /* Attempt to clamp gso_size to avoid segmenting and fragmenting.
+ *
+ * Q: policy needed? per device?
+ */
+ if (sysctl_ip_output_gso_clamp) {
+ if (!skb_gso_clamp(skb, mtu))
+ return ip_finish_output2(net, sk, skb);
+ }
+
features = netif_skb_features(skb);
BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
--
1.9.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-22 12:06 [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu Shmulik Ladkani
@ 2016-08-22 12:58 ` Florian Westphal
2016-08-22 13:05 ` Shmulik Ladkani
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Florian Westphal @ 2016-08-22 12:58 UTC (permalink / raw)
To: Shmulik Ladkani
Cc: David S. Miller, netdev, Hannes Frederic Sowa, Eric Dumazet,
Florian Westphal
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> There are cases where gso skbs (which originate from an ingress
> interface) have a gso_size value that exceeds the output dst mtu:
>
> - ipv4 forwarding middlebox having in/out interfaces with different mtus
> addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
> - bridge having a tunnel member interface stacked over a device with small mtu
> addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'
>
> In both cases, such skbs are identified, then go through early software
> segmentation+fragmentation as part of ip_finish_output_gso.
>
> Another approach is to shrink the gso_size to a value suitable so
> resulting segments are smaller than dst mtu, as suggeted by Eric
> Dumazet (as part of [1]) and Florian Westphal (as part of [2]).
>
> This will void the need for software segmentation/fragmentation at
> ip_finish_output_gso, thus significantly improve throughput and lower
> cpu load.
>
> This RFC patch attempts to implement this gso_size clamping.
>
> [1] https://patchwork.ozlabs.org/patch/314327/
> [2] https://patchwork.ozlabs.org/patch/644724/
>
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Florian Westphal <fw@strlen.de>
>
> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> ---
>
> Comments welcome.
>
> Few questions embedded in the patch.
>
> Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> I've tested both bridged and routed cases, but in my setups failed to
> hit the issue; Appreciate if you can provide some hints.
Still get the BUG, I applied this patch on top of net-next.
On hypervisor:
10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero
On vm1 (which dies instantly, see below):
eth0 mtu 1500 (192.168.7.10)
eth1 mtu 1280 (10.0.0.1)
On vm2
eth0 mtu 1280 (10.0.0.2)
Normal ipv4 routing via vm1, no iptables etc. present, so
we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
Turning off gro avoids this problem.
------------[ cut here ]------------
kernel BUG at net-next/net/core/skbuff.c:3210!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.8.0-rc2+ #1842
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
task: ffff88013b100000 task.stack: ffff88013b0fc000
RIP: 0010:[<ffffffff8135ab44>] [<ffffffff8135ab44>] skb_segment+0x964/0xb20
RSP: 0018:ffff88013fd838d0 EFLAGS: 00010212
RAX: 00000000000005a8 RBX: ffff88013a9f9900 RCX: ffff88013b1cf500
RDX: 0000000000006612 RSI: 0000000000000494 RDI: 0000000000000114
RBP: ffff88013fd839a8 R08: 00000000000069ca R09: ffff88013b1cf400
R10: 0000000000000011 R11: 0000000000006612 R12: 00000000000064fe
R13: ffff8801394c7300 R14: ffff88013937ad80 R15: 0000000000000011
FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f059fc3b2b0 CR3: 0000000001806000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
000000000000003b ffffffffffffffbe fffffff400000000 ffff88013b1cf400
0000000000000000 0000000000000042 0000000000000040 0000000000000001
0000000000000042 ffff88013b1cf600 0000000000000000 ffff8801000004cc
Call Trace:
<IRQ>
[<ffffffff8123bacf>] ? swiotlb_map_page+0x5f/0x120
[<ffffffff813eda00>] tcp_gso_segment+0x100/0x480
[<ffffffff813eddb3>] tcp4_gso_segment+0x33/0x90
[<ffffffff813fda7a>] inet_gso_segment+0x12a/0x3b0
[<ffffffff81368c00>] ? dev_hard_start_xmit+0x20/0x110
[<ffffffff813684f0>] skb_mac_gso_segment+0x90/0xf0
[<ffffffff81368601>] __skb_gso_segment+0xb1/0x140
[<ffffffff81368a7f>] validate_xmit_skb+0x14f/0x2b0
[<ffffffff81368d2e>] validate_xmit_skb_list+0x3e/0x60
[<ffffffff8138cb6a>] sch_direct_xmit+0x10a/0x1a0
[<ffffffff81369199>] __dev_queue_xmit+0x369/0x5d0
[<ffffffff8136940b>] dev_queue_xmit+0xb/0x10
[<ffffffff813c8f47>] ip_finish_output2+0x247/0x310
[<ffffffff813cac10>] ip_finish_output+0x1c0/0x250
[<ffffffff813cadea>] ip_output+0x3a/0x40
[<ffffffff813c751c>] ip_forward+0x36c/0x410
[<ffffffff813c5b06>] ip_rcv+0x2e6/0x630
[<ffffffff81364d5f>] __netif_receive_skb_core+0x2cf/0x940
[<ffffffff813189bd>] ? e1000_alloc_rx_buffers+0x1bd/0x490
[<ffffffff813653e8>] __netif_receive_skb+0x18/0x60
[<ffffffff81365728>] netif_receive_skb_internal+0x28/0x90
[<ffffffff813ee3b0>] ? tcp4_gro_complete+0x80/0x90
[<ffffffff8136580a>] napi_gro_complete+0x7a/0xa0
[<ffffffff813697e5>] napi_gro_flush+0x55/0x70
[<ffffffff81369d06>] napi_complete_done+0x66/0xb0
[<ffffffff81319810>] e1000_clean+0x380/0x900
[<ffffffff81368c65>] ? dev_hard_start_xmit+0x85/0x110
[<ffffffff81369ef3>] net_rx_action+0x1a3/0x2b0
[<ffffffff81049c22>] __do_softirq+0xe2/0x1d0
[<ffffffff81049f09>] irq_exit+0x89/0x90
[<ffffffff810199bf>] do_IRQ+0x4f/0xd0
[<ffffffff81498882>] common_interrupt+0x82/0x82
<EOI>
[<ffffffff81035bd6>] ? native_safe_halt+0x6/0x10
[<ffffffff8101ff49>] default_idle+0x9/0x10
[<ffffffff8102052a>] arch_cpu_idle+0xa/0x10
[<ffffffff810791ce>] default_idle_call+0x2e/0x30
[<ffffffff8107933f>] cpu_startup_entry+0x16f/0x220
[<ffffffff8102d6f5>] start_secondary+0x105/0x130
Code: 00 08 02 48 89 df 44 89 44 24 18 83 e6 c0 e8 04 c7 ff ff 85 c0 0f 85 02 01 00 00 8b 83 b8 00 00 00 44 8b 44 24 18 e9 cc fe ff ff <0f> 0b 0f 0b 0f 0b 8b 4b 74 85 c9 0f 85 ce 00 00 00 48 8b 83 c0
RIP [<ffffffff8135ab44>] skb_segment+0x964/0xb20
RSP <ffff88013fd838d0>
---[ end trace 924612451efe8dce ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-22 12:58 ` Florian Westphal
@ 2016-08-22 13:05 ` Shmulik Ladkani
2016-08-24 14:53 ` Shmulik Ladkani
2016-08-25 9:05 ` Shmulik Ladkani
2 siblings, 0 replies; 8+ messages in thread
From: Shmulik Ladkani @ 2016-08-22 13:05 UTC (permalink / raw)
To: Florian Westphal
Cc: David S. Miller, netdev, Hannes Frederic Sowa, Eric Dumazet
Hi,
On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> >
> > Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> > I've tested both bridged and routed cases, but in my setups failed to
> > hit the issue; Appreciate if you can provide some hints.
>
> Still get the BUG, I applied this patch on top of net-next.
>
> On hypervisor:
> 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
> ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero
>
> On vm1 (which dies instantly, see below):
> eth0 mtu 1500 (192.168.7.10)
> eth1 mtu 1280 (10.0.0.1)
>
> On vm2
> eth0 mtu 1280 (10.0.0.2)
>
> Normal ipv4 routing via vm1, no iptables etc. present, so
>
> we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
>
> Turning off gro avoids this problem.
Thanks Florian, will dive into this.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-22 12:58 ` Florian Westphal
2016-08-22 13:05 ` Shmulik Ladkani
@ 2016-08-24 14:53 ` Shmulik Ladkani
2016-08-24 14:59 ` Florian Westphal
2016-08-25 9:05 ` Shmulik Ladkani
2 siblings, 1 reply; 8+ messages in thread
From: Shmulik Ladkani @ 2016-08-24 14:53 UTC (permalink / raw)
To: Florian Westphal
Cc: David S. Miller, netdev, Hannes Frederic Sowa, Eric Dumazet
Hi,
On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> > Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> > I've tested both bridged and routed cases, but in my setups failed to
> > hit the issue; Appreciate if you can provide some hints.
>
> Still get the BUG, I applied this patch on top of net-next.
>
> On hypervisor:
> 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
> ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero
>
> On vm1 (which dies instantly, see below):
> eth0 mtu 1500 (192.168.7.10)
> eth1 mtu 1280 (10.0.0.1)
>
> On vm2
> eth0 mtu 1280 (10.0.0.2)
>
> Normal ipv4 routing via vm1, no iptables etc. present, so
>
> we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
>
> Turning off gro avoids this problem.
I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500
default).
Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by
VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking
place in VM1's eth0 is based on arriving segments (sized 1240).
Meaning, "ingress" gso_size is actually 1240, and no "gso clamping"
occurs.
Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake
is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments.
This leads to "gso clamping" taking place, with the BUG in skb_segment
(which btw, seems sensitive to change in gso_size only if GRO was
merging into frag_list).
Can you please acknowledge our setup and reproduction are aligned?
Thanks,
Shmulik
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-24 14:53 ` Shmulik Ladkani
@ 2016-08-24 14:59 ` Florian Westphal
0 siblings, 0 replies; 8+ messages in thread
From: Florian Westphal @ 2016-08-24 14:59 UTC (permalink / raw)
To: Shmulik Ladkani
Cc: Florian Westphal, David S. Miller, netdev, Hannes Frederic Sowa,
Eric Dumazet
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> > Normal ipv4 routing via vm1, no iptables etc. present, so
> >
> > we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
> >
> > Turning off gro avoids this problem.
>
> I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500
> default).
Right,
> Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by
> VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking
> place in VM1's eth0 is based on arriving segments (sized 1240).
True.
> Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake
> is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments.
> This leads to "gso clamping" taking place, with the BUG in skb_segment
> (which btw, seems sensitive to change in gso_size only if GRO was
> merging into frag_list).
>
> Can you please acknowledge our setup and reproduction are aligned?
Yes, seems setups are identical.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-22 12:58 ` Florian Westphal
2016-08-22 13:05 ` Shmulik Ladkani
2016-08-24 14:53 ` Shmulik Ladkani
@ 2016-08-25 9:05 ` Shmulik Ladkani
2016-08-26 11:19 ` Herbert Xu
2016-09-09 5:48 ` Shmulik Ladkani
2 siblings, 2 replies; 8+ messages in thread
From: Shmulik Ladkani @ 2016-08-25 9:05 UTC (permalink / raw)
To: Florian Westphal, David S. Miller, Hannes Frederic Sowa,
Eric Dumazet, Herbert Xu
Cc: netdev
Hi,
On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> > There are cases where gso skbs (which originate from an ingress
> > interface) have a gso_size value that exceeds the output dst mtu:
> >
> > - ipv4 forwarding middlebox having in/out interfaces with different mtus
> > addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
> > - bridge having a tunnel member interface stacked over a device with small mtu
> > addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'
> >
> > In both cases, such skbs are identified, then go through early software
> > segmentation+fragmentation as part of ip_finish_output_gso.
> >
> > Another approach is to shrink the gso_size to a value suitable so
> > resulting segments are smaller than dst mtu, as suggeted by Eric
> > Dumazet (as part of [1]) and Florian Westphal (as part of [2]).
> >
> > This will void the need for software segmentation/fragmentation at
> > ip_finish_output_gso, thus significantly improve throughput and lower
> > cpu load.
> >
> > This RFC patch attempts to implement this gso_size clamping.
> >
> > [1] https://patchwork.ozlabs.org/patch/314327/
> > [2] https://patchwork.ozlabs.org/patch/644724/
> >
> > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> > ---
> >
> > Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> > I've tested both bridged and routed cases, but in my setups failed to
> > hit the issue; Appreciate if you can provide some hints.
>
> Still get the BUG, I applied this patch on top of net-next.
The BUG occurs when GRO occurs on the ingress, and only if GRO merges
skbs into the frag_list (OTOH when segments are only placed into frags[]
of a single skb, skb_segment succeeds even if gso_size was altered).
This is due to an assumption that the frag_list members terminate on
exact MSS boundaries (which no longer holds during gso_size clamping).
The assumption is dated back to original support of 'frag_list' to
skb_segment:
89319d3801 net: Add frag_list support to skb_segment
This patch adds limited support for handling frag_list packets in
skb_segment. The intention is to support GRO (Generic Receive Offload)
packets which will be constructed by chaining normal packets using
frag_list.
As such we require all frag_list members terminate on exact MSS
boundaries. This is checked using BUG_ON.
As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.
We have few alternatives for gso_size clamping:
1 Fix 'skb_segment' arithmentics to support inputs that do not match
the "frag_list members terminate on exact MSS" assumption.
2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
Other usecases will still benefit: (a) packets arriving from
virtualized interfaces, e.g. tap and friends; (b) packets arriving from
veth of other namespaces (packets are locally generated by TCP stack
on a different netns).
3 Ditch the idea, again ;)
We can persue (1), especially if there are other benefits doing so.
OTOH due to the current complexity of 'skb_segment' this is bit risky.
Going with (2) could be reasonable too, as it brings value for
the virtualized environmnets that need gso_size clamping, while
presenting minimal risk.
Appreciate your opinions.
Regards,
Shmulik
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-25 9:05 ` Shmulik Ladkani
@ 2016-08-26 11:19 ` Herbert Xu
2016-09-09 5:48 ` Shmulik Ladkani
1 sibling, 0 replies; 8+ messages in thread
From: Herbert Xu @ 2016-08-26 11:19 UTC (permalink / raw)
To: Shmulik Ladkani
Cc: Florian Westphal, David S. Miller, Hannes Frederic Sowa,
Eric Dumazet, netdev
On Thu, Aug 25, 2016 at 12:05:33PM +0300, Shmulik Ladkani wrote:
>
> We have few alternatives for gso_size clamping:
>
> 1 Fix 'skb_segment' arithmentics to support inputs that do not match
> the "frag_list members terminate on exact MSS" assumption.
>
> 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
> Other usecases will still benefit: (a) packets arriving from
> virtualized interfaces, e.g. tap and friends; (b) packets arriving from
> veth of other namespaces (packets are locally generated by TCP stack
> on a different netns).
>
> 3 Ditch the idea, again ;)
>
> We can persue (1), especially if there are other benefits doing so.
> OTOH due to the current complexity of 'skb_segment' this is bit risky.
>
> Going with (2) could be reasonable too, as it brings value for
> the virtualized environmnets that need gso_size clamping, while
> presenting minimal risk.
Please remember the overarching rule for GRO and that is it must
be completely transparent, i.e., the output must be as if it didn't
exist.
So if this is a DF packet and then we must generate ICMP packets
rather than downsizing the packets. If it's not DF then we must
fragment only within each frag_list skb.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu
2016-08-25 9:05 ` Shmulik Ladkani
2016-08-26 11:19 ` Herbert Xu
@ 2016-09-09 5:48 ` Shmulik Ladkani
1 sibling, 0 replies; 8+ messages in thread
From: Shmulik Ladkani @ 2016-09-09 5:48 UTC (permalink / raw)
To: netdev
Cc: Florian Westphal, David S. Miller, Hannes Frederic Sowa,
Eric Dumazet, Herbert Xu, Alexander Duyck
On Thu, 25 Aug 2016 12:05:33 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> The BUG occurs when GRO occurs on the ingress, and only if GRO merges
> skbs into the frag_list (OTOH when segments are only placed into frags[]
> of a single skb, skb_segment succeeds even if gso_size was altered).
>
> This is due to an assumption that the frag_list members terminate on
> exact MSS boundaries (which no longer holds during gso_size clamping).
>
> We have few alternatives for gso_size clamping:
>
> 1 Fix 'skb_segment' arithmentics to support inputs that do not match
> the "frag_list members terminate on exact MSS" assumption.
>
> 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
> Other usecases will still benefit: (a) packets arriving from
> virtualized interfaces, e.g. tap and friends; (b) packets arriving from
> veth of other namespaces (packets are locally generated by TCP stack
> on a different netns).
>
> 3 Ditch the idea, again ;)
>
> We can persue (1), especially if there are other benefits doing so.
> OTOH due to the current complexity of 'skb_segment' this is bit risky.
>
> Going with (2) could be reasonable too, as it brings value for
> the virtualized environmnets that need gso_size clamping, while
> presenting minimal risk.
Summarizing actions taken, in case someone refers to this thread.
- Re (1): Spent a short while massaging skb_segment().
Code is not prepared to support various gso_size inputs.
Main issue is that if nskb's frags[] get exausted (but original
frag_skb's frags[] not yet fully traversed), there's no generation of
a new skb. Code expects interation of both nskb's frags[] and
frag_skb's frags[] to terminate together; the following allocated new
skb is always a clone of next frag_skb in the original head_skb.
Supporting various gso_size inputs required an intrusive rewrite.
- Re (2): There's no easy way for ip_finish_output_gso() to detect that
the skb is safe for "gso_size clamping" while preserving GSO/GRO
transparency:
We can know it is "gso_size clamping safe" PER SKB, but it doesn't
suffice; to preserve GRO transparecy rule, we must know skb arrived
from a code flow that is ALWAYS safe for gso_size clamping.
So I ended up identifying the relevant code-flow of the use-case I'm
interested on, verified it is indeed safe for altering gso_size (while
taking a slight risk that this might not hold true in the future).
I've used that mark as the criteria for safe "gso_size clamping" in
'ip_finish_output_gso'. Yep, not too elegant.
Regards,
Shmulik
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-09-09 5:48 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-22 12:06 [RFC PATCH] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu Shmulik Ladkani
2016-08-22 12:58 ` Florian Westphal
2016-08-22 13:05 ` Shmulik Ladkani
2016-08-24 14:53 ` Shmulik Ladkani
2016-08-24 14:59 ` Florian Westphal
2016-08-25 9:05 ` Shmulik Ladkani
2016-08-26 11:19 ` Herbert Xu
2016-09-09 5:48 ` Shmulik Ladkani
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).