* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 19:30 2.6.17 regression: Very slow net transfer from some hosts Daniel Drake
@ 2006-04-11 19:21 ` Stephen Hemminger
2006-04-11 19:23 ` John Heffner
1 sibling, 0 replies; 13+ messages in thread
From: Stephen Hemminger @ 2006-04-11 19:21 UTC (permalink / raw)
To: Daniel Drake; +Cc: jheffner, netdev, linux-kernel
On Tue, 11 Apr 2006 20:30:46 +0100
Daniel Drake <dsd@gentoo.org> wrote:
> Hi,
>
> Since sometime after 2.6.16, some websites have been very slow to load.
> Examples include:
>
> http://zd1211.ath.cx
> http://developer.osdl.org/shemminger/blog/
> http://www.reactivated.net/weblog
>
> On a good kernel, "wget http://zd1211.ath.cx" says:
> 20:23:38 (90.44 KB/s) - `index.html' saved [20895/20895]
>
> On a bad kernel:
> 20:14:18 (327.01 B/s) - `index.html' saved [20895/20895]
>
> I reproduced this on two different internet connections (same ISP
> though). However I cannot reproduce it on my other system.
>
> git-bisect tracked it down to:
>
> 7b4f4b5ebceab67ce440a61081a69f0265e17c2a is first bad commit
> diff-tree 7b4f4b5ebceab67ce440a61081a69f0265e17c2a (from
> 2babf9daae4a3561f3264638a22ac7d0b14a6f52)
> Author: John Heffner <jheffner@psc.edu>
> Date: Sat Mar 25 01:34:07 2006 -0800
>
> [TCP]: Set default max buffers from memory pool size
>
> Indeed, reverting this patch from 2.6.17-rc1-git4 allows those sites to
> load again.
>
> Any ideas?
Get a tcpdump. There are tools to sanitize the file if you worry about
ip addresses, etc.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 19:30 2.6.17 regression: Very slow net transfer from some hosts Daniel Drake
2006-04-11 19:21 ` Stephen Hemminger
@ 2006-04-11 19:23 ` John Heffner
2006-04-11 20:03 ` Daniel Drake
1 sibling, 1 reply; 13+ messages in thread
From: John Heffner @ 2006-04-11 19:23 UTC (permalink / raw)
To: Daniel Drake; +Cc: netdev, linux-kernel
Daniel Drake wrote:
> Hi,
>
> Since sometime after 2.6.16, some websites have been very slow to load.
> Examples include:
>
> http://zd1211.ath.cx
> http://developer.osdl.org/shemminger/blog/
> http://www.reactivated.net/weblog
>
> On a good kernel, "wget http://zd1211.ath.cx" says:
> 20:23:38 (90.44 KB/s) - `index.html' saved [20895/20895]
>
> On a bad kernel:
> 20:14:18 (327.01 B/s) - `index.html' saved [20895/20895]
>
> I reproduced this on two different internet connections (same ISP
> though). However I cannot reproduce it on my other system.
>
> git-bisect tracked it down to:
>
> 7b4f4b5ebceab67ce440a61081a69f0265e17c2a is first bad commit
> diff-tree 7b4f4b5ebceab67ce440a61081a69f0265e17c2a (from
> 2babf9daae4a3561f3264638a22ac7d0b14a6f52)
> Author: John Heffner <jheffner@psc.edu>
> Date: Sat Mar 25 01:34:07 2006 -0800
>
> [TCP]: Set default max buffers from memory pool size
>
> Indeed, reverting this patch from 2.6.17-rc1-git4 allows those sites to
> load again.
>
> Any ideas?
I'm not seeing this behavior myself. What are the values of
/proc/sys/net/ipv4/tcp_wmem, tcp_rmem, and tcp_mem? How much memory
does this system have? (A binary tcpdump might be good, too.)
Thanks,
-John
^ permalink raw reply [flat|nested] 13+ messages in thread
* 2.6.17 regression: Very slow net transfer from some hosts
@ 2006-04-11 19:30 Daniel Drake
2006-04-11 19:21 ` Stephen Hemminger
2006-04-11 19:23 ` John Heffner
0 siblings, 2 replies; 13+ messages in thread
From: Daniel Drake @ 2006-04-11 19:30 UTC (permalink / raw)
To: jheffner; +Cc: netdev, linux-kernel
Hi,
Since sometime after 2.6.16, some websites have been very slow to load.
Examples include:
http://zd1211.ath.cx
http://developer.osdl.org/shemminger/blog/
http://www.reactivated.net/weblog
On a good kernel, "wget http://zd1211.ath.cx" says:
20:23:38 (90.44 KB/s) - `index.html' saved [20895/20895]
On a bad kernel:
20:14:18 (327.01 B/s) - `index.html' saved [20895/20895]
I reproduced this on two different internet connections (same ISP
though). However I cannot reproduce it on my other system.
git-bisect tracked it down to:
7b4f4b5ebceab67ce440a61081a69f0265e17c2a is first bad commit
diff-tree 7b4f4b5ebceab67ce440a61081a69f0265e17c2a (from
2babf9daae4a3561f3264638a22ac7d0b14a6f52)
Author: John Heffner <jheffner@psc.edu>
Date: Sat Mar 25 01:34:07 2006 -0800
[TCP]: Set default max buffers from memory pool size
Indeed, reverting this patch from 2.6.17-rc1-git4 allows those sites to
load again.
Any ideas?
Thanks,
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 20:03 ` Daniel Drake
@ 2006-04-11 19:55 ` John Heffner
2006-04-11 20:53 ` Daniel Drake
0 siblings, 1 reply; 13+ messages in thread
From: John Heffner @ 2006-04-11 19:55 UTC (permalink / raw)
To: Daniel Drake; +Cc: netdev, linux-kernel
Daniel Drake wrote:
> John Heffner wrote:
>> I'm not seeing this behavior myself. What are the values of
>> /proc/sys/net/ipv4/tcp_wmem, tcp_rmem, and tcp_mem? How much memory
>> does this system have? (A binary tcpdump might be good, too.)
>
> tcp_wmem: 4096 16384 131072
> tcp_rmem: 4096 87380 174760
> tcp_mem: 98304 131072 196608
These are (I assume) with the patch reversed. What are the values with
the patch applied?
Thanks,
-John
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 19:23 ` John Heffner
@ 2006-04-11 20:03 ` Daniel Drake
2006-04-11 19:55 ` John Heffner
0 siblings, 1 reply; 13+ messages in thread
From: Daniel Drake @ 2006-04-11 20:03 UTC (permalink / raw)
To: John Heffner; +Cc: netdev, linux-kernel
John Heffner wrote:
> I'm not seeing this behavior myself. What are the values of
> /proc/sys/net/ipv4/tcp_wmem, tcp_rmem, and tcp_mem? How much memory
> does this system have? (A binary tcpdump might be good, too.)
tcp_wmem: 4096 16384 131072
tcp_rmem: 4096 87380 174760
tcp_mem: 98304 131072 196608
tcpdumps coming later. This is on an x86_64 system with 1GB RAM. I
connect via the forcedeth driver but also reproduced this through ipw2200.
Thanks!
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 19:55 ` John Heffner
@ 2006-04-11 20:53 ` Daniel Drake
2006-04-11 20:54 ` John Heffner
0 siblings, 1 reply; 13+ messages in thread
From: Daniel Drake @ 2006-04-11 20:53 UTC (permalink / raw)
To: John Heffner; +Cc: netdev, linux-kernel
John Heffner wrote:
>> tcp_wmem: 4096 16384 131072
>> tcp_rmem: 4096 87380 174760
>> tcp_mem: 98304 131072 196608
>
> These are (I assume) with the patch reversed. What are the values with
> the patch applied?
Yes- that was on a good kernel, with the patch reversed.
On a bad kernel, with the patch applied (2.6.16-git16):
tcp_wmem: 4096 16384 4194304
tcp_rmem: 4096 87380 4194304
tcp_mem: 98304 131072 196608
They seem to be identical, which makes sense, since most websites work
just fine.
I am sending tcpdump's privately to you and Stephen. If anyone else
wants to see them, just ask.
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 20:53 ` Daniel Drake
@ 2006-04-11 20:54 ` John Heffner
2006-04-11 22:20 ` Daniel Drake
0 siblings, 1 reply; 13+ messages in thread
From: John Heffner @ 2006-04-11 20:54 UTC (permalink / raw)
To: Daniel Drake; +Cc: netdev, linux-kernel
This is almost certainly due to a buggy firewall that doesn't understand
TCP window scaling. I've usually seen this in the past with OpenBSD
firewalls. Do you have one of these in your path?
Thanks,
-John
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 20:54 ` John Heffner
@ 2006-04-11 22:20 ` Daniel Drake
2006-04-11 22:33 ` Stephen Hemminger
0 siblings, 1 reply; 13+ messages in thread
From: Daniel Drake @ 2006-04-11 22:20 UTC (permalink / raw)
To: John Heffner; +Cc: netdev, linux-kernel
John Heffner wrote:
> This is almost certainly due to a buggy firewall that doesn't understand
> TCP window scaling. I've usually seen this in the past with OpenBSD
> firewalls. Do you have one of these in your path?
At home I'm behind a Linux gateway box currently running 2.6.15-rc6 - I
am connected through ethernet to that.
At my student house I am connected wirelessly to a Linksys WRT54Gv5
router (the model that doesnt run Linux).
I have reproduced it at both those locations (same ISP).
This is very familiar, and I just found the article I was thinking of:
http://lwn.net/Articles/92727/
I was also hit by that bug, on the same collection of websites, but that
particular problem was fixed for 2.6.8 or so. So I guess it is extremely
likely that my ISP has broken routers. nmap isn't able to identify the
OS of any ISP routers in my path.
It's a huge ISP over here, so contacting them over technical matters is
not easy...
Thanks,
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 22:20 ` Daniel Drake
@ 2006-04-11 22:33 ` Stephen Hemminger
2006-04-12 0:06 ` Daniel Drake
0 siblings, 1 reply; 13+ messages in thread
From: Stephen Hemminger @ 2006-04-11 22:33 UTC (permalink / raw)
To: Daniel Drake; +Cc: John Heffner, netdev, linux-kernel
On Tue, 11 Apr 2006 23:20:42 +0100
Daniel Drake <dsd@gentoo.org> wrote:
> John Heffner wrote:
> > This is almost certainly due to a buggy firewall that doesn't understand
> > TCP window scaling. I've usually seen this in the past with OpenBSD
> > firewalls. Do you have one of these in your path?
>
> At home I'm behind a Linux gateway box currently running 2.6.15-rc6 - I
> am connected through ethernet to that.
>
> At my student house I am connected wirelessly to a Linksys WRT54Gv5
> router (the model that doesnt run Linux).
>
> I have reproduced it at both those locations (same ISP).
>
> This is very familiar, and I just found the article I was thinking of:
> http://lwn.net/Articles/92727/
>
> I was also hit by that bug, on the same collection of websites, but that
> particular problem was fixed for 2.6.8 or so. So I guess it is extremely
> likely that my ISP has broken routers. nmap isn't able to identify the
> OS of any ISP routers in my path.
We never fixed it, its kind of hard to fix other peoples equipment ;-)
>
> It's a huge ISP over here, so contacting them over technical matters is
> not easy...
>
> Thanks,
> Daniel
>
Turn off TCP window scaling, your performance will be limited but about
as good as you can get with a corrupting firewall in between.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-12 0:06 ` Daniel Drake
@ 2006-04-11 23:59 ` Stephen Hemminger
2006-04-12 0:32 ` John Heffner
2006-04-12 0:42 ` John Heffner
2 siblings, 0 replies; 13+ messages in thread
From: Stephen Hemminger @ 2006-04-11 23:59 UTC (permalink / raw)
To: Daniel Drake; +Cc: John Heffner, netdev, linux-kernel
On Wed, 12 Apr 2006 01:06:09 +0100
Daniel Drake <dsd@gentoo.org> wrote:
> Stephen Hemminger wrote:
> >> This is very familiar, and I just found the article I was thinking of:
> >> http://lwn.net/Articles/92727/
> >>
> >> I was also hit by that bug, on the same collection of websites, but that
> >> particular problem was fixed for 2.6.8 or so. So I guess it is extremely
> >> likely that my ISP has broken routers. nmap isn't able to identify the
> >> OS of any ISP routers in my path.
> >
> > We never fixed it, its kind of hard to fix other peoples equipment ;-)
>
> Weird, things started working for me around 2.6.9 without having to
> modify any sysctl stuff.
What we did was default the window scaling needed to match the max
possible memory usage. The normal value for tcp_wmem correlated to
a window scale of 2. If a corrupting middlebox lost the window
scale option, then connection would proceed but all windows would
be 1/4 of possible; and connection would still limp along at
somewhat reduced bandwidth.
John's changes cause tcp_wmem to be bigger, so we ask for a bigger
window scale. If the "window scale lost in translation" problem gets
too bad, the sender will never send anything because it thinks
the receiver is doing silly-window-syndrome.
>
> > Turn off TCP window scaling, your performance will be limited but about
> > as good as you can get with a corrupting firewall in between.
>
> I was wrong in my previous mail where I said that the rmem/wmem output
> hasn't changed over the two kernels - it has, the 3rd column differs. I
> simply set those values back to what they were on 2.6.16 and now things
> work again - I presumably have window scale 2 (scale factor 4) again,
> which appears to be a decent compromise between having a window and
> things actually working.
>
> For anyone else interested, the ISP is NTL (UK). The fix:
>
> echo "4096 16384 131072 " > /proc/sys/net/ipv4/tcp_wmem
> echo "4096 87380 174760 " > /proc/sys/net/ipv4/tcp_rmem
>
>
> This issue is visible on my 1GB system but not on my laptop (256mb RAM).
> The key thing is that more memory means a higher window scale factor is
> used, which appears to trigger ntl's brokenness.
>
> Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-11 22:33 ` Stephen Hemminger
@ 2006-04-12 0:06 ` Daniel Drake
2006-04-11 23:59 ` Stephen Hemminger
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Daniel Drake @ 2006-04-12 0:06 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: John Heffner, netdev, linux-kernel
Stephen Hemminger wrote:
>> This is very familiar, and I just found the article I was thinking of:
>> http://lwn.net/Articles/92727/
>>
>> I was also hit by that bug, on the same collection of websites, but that
>> particular problem was fixed for 2.6.8 or so. So I guess it is extremely
>> likely that my ISP has broken routers. nmap isn't able to identify the
>> OS of any ISP routers in my path.
>
> We never fixed it, its kind of hard to fix other peoples equipment ;-)
Weird, things started working for me around 2.6.9 without having to
modify any sysctl stuff.
> Turn off TCP window scaling, your performance will be limited but about
> as good as you can get with a corrupting firewall in between.
I was wrong in my previous mail where I said that the rmem/wmem output
hasn't changed over the two kernels - it has, the 3rd column differs. I
simply set those values back to what they were on 2.6.16 and now things
work again - I presumably have window scale 2 (scale factor 4) again,
which appears to be a decent compromise between having a window and
things actually working.
For anyone else interested, the ISP is NTL (UK). The fix:
echo "4096 16384 131072 " > /proc/sys/net/ipv4/tcp_wmem
echo "4096 87380 174760 " > /proc/sys/net/ipv4/tcp_rmem
This issue is visible on my 1GB system but not on my laptop (256mb RAM).
The key thing is that more memory means a higher window scale factor is
used, which appears to trigger ntl's brokenness.
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-12 0:06 ` Daniel Drake
2006-04-11 23:59 ` Stephen Hemminger
@ 2006-04-12 0:32 ` John Heffner
2006-04-12 0:42 ` John Heffner
2 siblings, 0 replies; 13+ messages in thread
From: John Heffner @ 2006-04-12 0:32 UTC (permalink / raw)
To: Daniel Drake; +Cc: Stephen Hemminger, netdev, linux-kernel
Daniel Drake wrote:
> Stephen Hemminger wrote:
>> Turn off TCP window scaling, your performance will be limited but about
>> as good as you can get with a corrupting firewall in between.
[snip]
> For anyone else interested, the ISP is NTL (UK). The fix:
>
> echo "4096 16384 131072 " > /proc/sys/net/ipv4/tcp_wmem
> echo "4096 87380 174760 " > /proc/sys/net/ipv4/tcp_rmem
For the record, I think Stephen's suggested workaround is better:
echo 0 > /proc/sys/net/ipv4/tcp_window_scaling
It will prevent the other end of the connection from using a window
scale, so it "fixes" both directions of the connection, not just receiving.
Thanks,
-John
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: 2.6.17 regression: Very slow net transfer from some hosts
2006-04-12 0:06 ` Daniel Drake
2006-04-11 23:59 ` Stephen Hemminger
2006-04-12 0:32 ` John Heffner
@ 2006-04-12 0:42 ` John Heffner
2 siblings, 0 replies; 13+ messages in thread
From: John Heffner @ 2006-04-12 0:42 UTC (permalink / raw)
To: Daniel Drake; +Cc: Stephen Hemminger, netdev, linux-kernel
Daniel Drake wrote:
> Stephen Hemminger wrote:
>>> This is very familiar, and I just found the article I was thinking
>>> of: http://lwn.net/Articles/92727/
>>>
>>> I was also hit by that bug, on the same collection of websites, but
>>> that particular problem was fixed for 2.6.8 or so. So I guess it is
>>> extremely likely that my ISP has broken routers. nmap isn't able to
>>> identify the OS of any ISP routers in my path.
>>
>> We never fixed it, its kind of hard to fix other peoples equipment ;-)
>
> Weird, things started working for me around 2.6.9 without having to
> modify any sysctl stuff.
Ah, I remember now. 2.6.7 introduced the tcp_default_win_scale
variable, and 2.6.9 got rid of it, doing the calculation based on the
max possible tcp rcvbuf. This is conceptually the right thing to do,
regardless of broken middleboxes, but had the side effect of hiding this
latent problem a bit longer.
Thanks,
-John
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-04-12 0:42 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-11 19:30 2.6.17 regression: Very slow net transfer from some hosts Daniel Drake
2006-04-11 19:21 ` Stephen Hemminger
2006-04-11 19:23 ` John Heffner
2006-04-11 20:03 ` Daniel Drake
2006-04-11 19:55 ` John Heffner
2006-04-11 20:53 ` Daniel Drake
2006-04-11 20:54 ` John Heffner
2006-04-11 22:20 ` Daniel Drake
2006-04-11 22:33 ` Stephen Hemminger
2006-04-12 0:06 ` Daniel Drake
2006-04-11 23:59 ` Stephen Hemminger
2006-04-12 0:32 ` John Heffner
2006-04-12 0:42 ` John Heffner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).