* RE: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 15:10 ` Andi Kleen
@ 2007-04-25 14:31 ` Ristuccia, Brian
0 siblings, 0 replies; 9+ messages in thread
From: Ristuccia, Brian @ 2007-04-25 14:31 UTC (permalink / raw)
To: Andi Kleen; +Cc: linux-kernel, netdev
> > 08:39:55.493029 IP 12.33.234.69.35026 > 10.2.10.254.22: S
> > 2768979373:2768979373(
> > 0) win 5840 <mss 1460,sackOK,timestamp 3873837730 0,nop,wscale 2>
> > 08:39:55.493119 IP 10.2.10.254.22 > 12.33.234.69.35026: S
> > 963242385:963242385(0)
> > ack 2768979374 win 17896 <mss 8960,sackOK,timestamp 413751
>
> The MSS clamp for sending to 10.2.10.254.22 is 8960. MSS is
> only one way -- each uses what the other tells it.
>
Right - except that 10.2.10.254 keeps sending to 12.33.234.69 with an
increasingly large MSS, even though 12.33.234.69 has asked for no larger
than 1460.
> > In the following dump, the system eventually gets in a
> state where it
> > oscillates between sendng undeliverable 2896 byte packets and
> > deliverable 1448 byte ones.
>
> This should only happen on PMTU expire, which is normally ~15mins.
> Perhaps you misconfigured it manually using sysctl.
>
This is /proc/sys/net/ipv4/route/mtu_expires? It's 600.
--
Brian Ristuccia
"This email message and any attachments are confidential information of Starent Networks, Corp. The information transmitted may not be used to create or change any contractual obligations of Starent Networks, Corp. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this e-mail and its attachments by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify the sender immediately -- by replying to this message or by sending an email to postmaster@starentnetworks.com -- and destroy all copies of this message and any attachments without reading or disclosing their contents. Thank you."
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
[not found] <7CCD07160348804497EF29E9EA5560D7020F6203@exchtewks2.starentnetworks.com>
@ 2007-04-25 15:10 ` Andi Kleen
2007-04-25 14:31 ` Ristuccia, Brian
0 siblings, 1 reply; 9+ messages in thread
From: Andi Kleen @ 2007-04-25 15:10 UTC (permalink / raw)
To: Ristuccia, Brian; +Cc: linux-kernel, netdev
"Ristuccia, Brian" <bristuccia@starentnetworks.com> writes:
> I'm seeing a problem where the kernel attempts to send packets with a
> MSS larger than the one negotiated when the TCP connection is
> established. Even after ICMP "can't fragment" messages arrive, the
> kernel still attempts to increase the MSS rather aggressively. The end
> result is extremely poor throughput when sending to a network with a
> smaller MTU.
>
> In /proc/sys/net/ipv4:
> ip_no_pmtu_disc:0
> tcp_mtu_probing:0
>
> The sending host (10.2.10.254) has an MTU of 9000. The destination host
> (12.33.234.69) has an MTU of 1500. There is one router between the hosts
> which will drop packets with the "DF" flag when they don't fit the
> destination interface's MTU and generates the required icmp can't
> fragment message.
>
> The dump shows the initial handshake with correct mss options sent:
>
> 08:39:55.493029 IP 12.33.234.69.35026 > 10.2.10.254.22: S
> 2768979373:2768979373(
> 0) win 5840 <mss 1460,sackOK,timestamp 3873837730 0,nop,wscale 2>
> 08:39:55.493119 IP 10.2.10.254.22 > 12.33.234.69.35026: S
> 963242385:963242385(0)
> ack 2768979374 win 17896 <mss 8960,sackOK,timestamp 413751
The MSS clamp for sending to 10.2.10.254.22 is 8960. MSS is only
one way -- each uses what the other tells it.
> In the following dump, the system eventually gets in a state where it
> oscillates between sendng undeliverable 2896 byte packets and
> deliverable 1448 byte ones.
This should only happen on PMTU expire, which is normally ~15mins.
Perhaps you misconfigured it manually using sysctl.
-And
^ permalink raw reply [flat|nested] 9+ messages in thread
* 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
@ 2007-04-25 20:09 Ristuccia, Brian
2007-04-25 20:11 ` Ristuccia, Brian
2007-04-25 22:48 ` Herbert Xu
0 siblings, 2 replies; 9+ messages in thread
From: Ristuccia, Brian @ 2007-04-25 20:09 UTC (permalink / raw)
To: netdev
I had previously posted this message to linux-kernel, but David Miller
asked me to post here instead. Some replies to my message on l-k have
already been copied here. I'm seeing a problem where the kernel attempts
to send packets with a MSS larger than the one negotiated when the TCP
connection is established. Even after ICMP "can't fragment" messages
arrive, the kernel still attempts to increase the MSS rather
aggressively. The end result is extremely poor throughput when sending
to a network with a smaller MTU.
In /proc/sys/net/ipv4:
ip_no_pmtu_disc:0
tcp_mtu_probing:0
The sending host (10.2.10.254) has an MTU of 9000. The destination host
(12.33.234.69) has an MTU of 1500. There is one router between the hosts
which will drop packets with the "DF" flag when they don't fit the
destination interface's MTU and generates the required icmp can't
fragment message.
The dump shows the initial handshake with correct mss options sent:
08:39:55.493029 IP 12.33.234.69.35026 > 10.2.10.254.22: S
2768979373:2768979373(
0) win 5840 <mss 1460,sackOK,timestamp 3873837730 0,nop,wscale 2>
08:39:55.493119 IP 10.2.10.254.22 > 12.33.234.69.35026: S
963242385:963242385(0)
ack 2768979374 win 17896 <mss 8960,sackOK,timestamp 413751
3873837730,nop,wscal
e 5>
Then I see the system send larger packets (larger than the mss),
provoking a "can't fragment" from the router. Now I suppose it might be
reasonable to occasionally probe a larger MSS when the current MSS is a
result of reductions due to path mtu discovery. After all, the path
taken could change over time. But when the current MSS is at the value
negotiated by the MSS option during the TCP handshake, it seems like
there's no sense in trying to send with a lager MSS. Even if there were,
there's certainly no justification for making such an attempt every
other packet (2.6.18) or every fourth packet (2.6.20.7).
In the following dump, the system eventually gets in a state where it
oscillates between sendng undeliverable 2896 byte packets and
deliverable 1448 byte ones.
08:39:55.649689 IP 10.2.10.254.22 > 12.33.234.69.35026: .
5906:10250(4344) ack 1
794 win 674 <nop,nop,timestamp 413790 3873837887>
08:39:55.650532 IP 10.2.10.1 > 10.2.10.254: icmp 92: 12.33.234.69
unreachable -
need to frag (mtu 1500)
08:39:55.689774 IP 12.33.234.69.35026 > 10.2.10.254.22: . ack 5906 win
4544 <nop
,nop,timestamp 3873837927 413790>
08:39:55.689784 IP 10.2.10.254.22 > 12.33.234.69.35026: .
10250:13146(2896) ack
1794 win 674 <nop,nop,timestamp 413800 3873837927>
08:39:55.690497 IP 10.2.10.1 > 10.2.10.254: icmp 92: 12.33.234.69
unreachable -
need to frag (mtu 1500)
08:39:55.902494 IP 10.2.10.254.22 > 12.33.234.69.35026: .
5906:7354(1448) ack 17
94 win 674 <nop,nop,timestamp 413853 3873837927>
Since any sane router will only generate can't fragment ICMP's at a
limited rate, for two hosts on gigabit ethernet, one on a MTU 1500
subnet and another on a MTU 9000 subnet, I can move only 40-50KB/sec
over an affected TCP connection.
I was unable to find any reference to this problem in the kernel
changelogs, or even any reports of anyone else having a similar problem.
The above dumps are from 2.6.19.7. I could also reproduce the problem in
2.6.18, although the dumps looked slightly different. I was unable to
reproduce this problem with the 2.6.9-42.0.10.ELsmp kernel which ships
in RHEL4.
I can send a pcap dump to anyone interested.
--
Brian Ristuccia
"This email message and any attachments are confidential information of Starent Networks, Corp. The information transmitted may not be used to create or change any contractual obligations of Starent Networks, Corp. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this e-mail and its attachments by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify the sender immediately -- by replying to this message or by sending an email to postmaster@starentnetworks.com -- and destroy all copies of this message and any attachments without reading or disclosing their contents. Thank you."
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 20:09 2.6.20.7 mss negotiation and path mtu discovery mostly broken? Ristuccia, Brian
@ 2007-04-25 20:11 ` Ristuccia, Brian
2007-04-25 20:16 ` David Miller
2007-04-25 22:48 ` Herbert Xu
1 sibling, 1 reply; 9+ messages in thread
From: Ristuccia, Brian @ 2007-04-25 20:11 UTC (permalink / raw)
To: netdev
> I'm seeing a
> problem where the kernel attempts to send packets with a MSS
> larger than the one negotiated when the TCP connection is
> established. Even after ICMP "can't fragment" messages
> arrive, the kernel still attempts to increase the MSS rather
> aggressively. The end result is extremely poor throughput
> when sending to a network with a smaller MTU.
I've tracked this problem to the TSO feature in the bnx2 driver. Turning
off TSO with "ethtool -K eth1 tso off" seems to work around the problem.
It appears that the bnx2 device is not using the correct mss when
performing segmentation offload.
-Brian
"This email message and any attachments are confidential information of Starent Networks, Corp. The information transmitted may not be used to create or change any contractual obligations of Starent Networks, Corp. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this e-mail and its attachments by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify the sender immediately -- by replying to this message or by sending an email to postmaster@starentnetworks.com -- and destroy all copies of this message and any attachments without reading or disclosing their contents. Thank you."
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 20:11 ` Ristuccia, Brian
@ 2007-04-25 20:16 ` David Miller
2007-05-03 23:24 ` Michael Chan
0 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2007-04-25 20:16 UTC (permalink / raw)
To: bristuccia; +Cc: netdev, mchan
From: "Ristuccia, Brian" <bristuccia@starentnetworks.com>
Date: Wed, 25 Apr 2007 16:11:51 -0400
> > I'm seeing a
> > problem where the kernel attempts to send packets with a MSS
> > larger than the one negotiated when the TCP connection is
> > established. Even after ICMP "can't fragment" messages
> > arrive, the kernel still attempts to increase the MSS rather
> > aggressively. The end result is extremely poor throughput
> > when sending to a network with a smaller MTU.
>
> I've tracked this problem to the TSO feature in the bnx2 driver. Turning
> off TSO with "ethtool -K eth1 tso off" seems to work around the problem.
> It appears that the bnx2 device is not using the correct mss when
> performing segmentation offload.
Thanks for narrowing it down like that.
Michael can you have a look? Is the bnx2 firmware using the MTU
setting in the device and ignoring the passed in MSS or something
like that?
Thanks.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 20:09 2.6.20.7 mss negotiation and path mtu discovery mostly broken? Ristuccia, Brian
2007-04-25 20:11 ` Ristuccia, Brian
@ 2007-04-25 22:48 ` Herbert Xu
2007-04-26 15:53 ` Ristuccia, Brian
1 sibling, 1 reply; 9+ messages in thread
From: Herbert Xu @ 2007-04-25 22:48 UTC (permalink / raw)
To: Ristuccia, Brian; +Cc: netdev
Ristuccia, Brian <bristuccia@starentnetworks.com> wrote:
>
> 08:39:55.649689 IP 10.2.10.254.22 > 12.33.234.69.35026: .
> 5906:10250(4344) ack 1
> 794 win 674 <nop,nop,timestamp 413790 3873837887>
> 08:39:55.650532 IP 10.2.10.1 > 10.2.10.254: icmp 92: 12.33.234.69
> unreachable -
> need to frag (mtu 1500)
Where was this dump taken, on 10.2.10.254?
If so could youd either take the dump further down the route or show
the full contents (with tcpdump -x) of the ICMP error here so that
we can see what the actual packet size was?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 9+ messages in thread
* RE: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 22:48 ` Herbert Xu
@ 2007-04-26 15:53 ` Ristuccia, Brian
0 siblings, 0 replies; 9+ messages in thread
From: Ristuccia, Brian @ 2007-04-26 15:53 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev
> >
> > 08:39:55.649689 IP 10.2.10.254.22 > 12.33.234.69.35026: .
> > 5906:10250(4344) ack 1
> > 794 win 674 <nop,nop,timestamp 413790 3873837887>
> > 08:39:55.650532 IP 10.2.10.1 > 10.2.10.254: icmp 92: 12.33.234.69
> > unreachable - need to frag (mtu 1500)
>
> Where was this dump taken, on 10.2.10.254?
>
Yes. It appears that if tcpdump is run on the sending system it sees the
packets before segmentation offload does its thing.
> If so could youd either take the dump further down the route
> or show the full contents (with tcpdump -x) of the ICMP error
> here so that we can see what the actual packet size was?
>
Here's an example IP packet followed by the resulting ICMP as seen with
tcpdump on 10.2.10.254. ICMP is complaining about a 2896 byte packet,
which leads me to believe that the TSO is using the wrong mss or simply
isn't doing anything at all.
11:36:12.006330 IP (tos 0x8, ttl 64, id 20681, offset 0, flags [DF],
length: 29
48) 10.2.10.254.22 > 12.33.234.69.47703: . 91224:94120(2896) ack 49 win
674 <nop
,nop,timestamp 24659048 3970819017>
0x0000: 4508 0b84 50c9 4000 4006 d33c 0a02 0afe
E...P.@.@..<....
0x0010: 0c21 ea45 0016 ba57 09f8 1838 7577 90cb
.!.E...W...8uw..
0x0020: 8010 02a2 16dd 0000 0101 080a 0178 4468
.............xDh
0x0030: ecad e3c9 96f8 ad0d e909 4e89 202a 9d85
..........N..*..
0x0040: 4b58 a65f 53d9 07f4 0435 2557 970b 943d
KX._S....5%W...=
0x0050: 1eba ..
11:36:12.007019 IP (tos 0x0, ttl 64, id 43865, offset 0, flags [DF],
length: 11
2) 10.2.10.1 > 10.2.10.254: icmp 92: 12.33.234.69 unreachable - need to
frag (mt
u 1500) for IP (tos 0x8, ttl 64, id 20681, offset 0, flags [DF],
length: 2948)
10.2.10.254.22 > 12.33.234.69.47703: . 91224:94120(2896) ack 49 win 674
<nop,nop
,timestamp 24659048 3970819017>
0x0000: 4500 0070 ab59 4000 4001 6631 0a02 0a01
E..p.Y@.@.f1....
0x0010: 0a02 0afe 0304 7cc8 0000 05dc 4508 0b84
......|.....E...
0x0020: 50c9 4000 4006 d33c 0a02 0afe 0c21 ea45
P.@.@..<.....!.E
0x0030: 0016 ba57 09f8 1838 7577 90cb 8010 02a2
...W...8uw......
0x0040: 706d 0000 0101 080a 0178 4468 ecad e3c9
pm.......xDh....
0x0050: 96f8
-Brian
"This email message and any attachments are confidential information of Starent Networks, Corp. The information transmitted may not be used to create or change any contractual obligations of Starent Networks, Corp. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this e-mail and its attachments by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify the sender immediately -- by replying to this message or by sending an email to postmaster@starentnetworks.com -- and destroy all copies of this message and any attachments without reading or disclosing their contents. Thank you."
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-04-25 20:16 ` David Miller
@ 2007-05-03 23:24 ` Michael Chan
2007-05-04 0:23 ` David Miller
0 siblings, 1 reply; 9+ messages in thread
From: Michael Chan @ 2007-05-03 23:24 UTC (permalink / raw)
To: David Miller; +Cc: bristuccia, netdev
On Wed, 2007-04-25 at 13:16 -0700, David Miller wrote:
> From: "Ristuccia, Brian" <bristuccia@starentnetworks.com>
> Date: Wed, 25 Apr 2007 16:11:51 -0400
>
> > > I'm seeing a
> > > problem where the kernel attempts to send packets with a MSS
> > > larger than the one negotiated when the TCP connection is
> > > established. Even after ICMP "can't fragment" messages
> > > arrive, the kernel still attempts to increase the MSS rather
> > > aggressively. The end result is extremely poor throughput
> > > when sending to a network with a smaller MTU.
> >
> > I've tracked this problem to the TSO feature in the bnx2 driver. Turning
> > off TSO with "ethtool -K eth1 tso off" seems to work around the problem.
> > It appears that the bnx2 device is not using the correct mss when
> > performing segmentation offload.
>
> Thanks for narrowing it down like that.
>
> Michael can you have a look? Is the bnx2 firmware using the MTU
> setting in the device and ignoring the passed in MSS or something
> like that?
>
This should fix it. I will send this and other selected BNX2 fixes to
stable as well. Thanks.
[BNX2]: Fix TSO problem with small MSS.
Remove the check for skb->len greater than MTU when doing TSO. When
the destination has a smaller MSS than the source, a TSO packet may
be smaller than the MTU at the source and we still need to process it
as a TSO packet.
Thanks to Brian Ristuccia <bristuccia@starentnetworks.com> for
reporting the problem.
Signed-off-by: Michael Chan <mchan@broadcom.com>
diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 605347f..88b33c6 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -4836,8 +4836,7 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
vlan_tag_flags |=
(TX_BD_FLAGS_VLAN_TAG | (vlan_tx_tag_get(skb) << 16));
}
- if ((mss = skb_shinfo(skb)->gso_size) &&
- (skb->len > (bp->dev->mtu + ETH_HLEN))) {
+ if ((mss = skb_shinfo(skb)->gso_size)) {
u32 tcp_opt_len, ip_tcp_len;
struct iphdr *iph;
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: 2.6.20.7 mss negotiation and path mtu discovery mostly broken?
2007-05-03 23:24 ` Michael Chan
@ 2007-05-04 0:23 ` David Miller
0 siblings, 0 replies; 9+ messages in thread
From: David Miller @ 2007-05-04 0:23 UTC (permalink / raw)
To: mchan; +Cc: bristuccia, netdev
From: "Michael Chan" <mchan@broadcom.com>
Date: Thu, 03 May 2007 16:24:20 -0700
> [BNX2]: Fix TSO problem with small MSS.
>
> Remove the check for skb->len greater than MTU when doing TSO. When
> the destination has a smaller MSS than the source, a TSO packet may
> be smaller than the MTU at the source and we still need to process it
> as a TSO packet.
>
> Thanks to Brian Ristuccia <bristuccia@starentnetworks.com> for
> reporting the problem.
>
> Signed-off-by: Michael Chan <mchan@broadcom.com>
Applied, thanks Michael.
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-05-04 0:23 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-25 20:09 2.6.20.7 mss negotiation and path mtu discovery mostly broken? Ristuccia, Brian
2007-04-25 20:11 ` Ristuccia, Brian
2007-04-25 20:16 ` David Miller
2007-05-03 23:24 ` Michael Chan
2007-05-04 0:23 ` David Miller
2007-04-25 22:48 ` Herbert Xu
2007-04-26 15:53 ` Ristuccia, Brian
[not found] <7CCD07160348804497EF29E9EA5560D7020F6203@exchtewks2.starentnetworks.com>
2007-04-25 15:10 ` Andi Kleen
2007-04-25 14:31 ` Ristuccia, Brian
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).