* Networking hangs when too many parallel requests are made at once
@ 2010-11-09 18:30 Luke Hutchison
2010-11-09 19:04 ` Luke Hutchison
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 18:30 UTC (permalink / raw)
To: netdev
Since around Linux kernel 2.6.33 or so (but maybe as early as
2.6.31, not sure exactly what version), when restoring a crashed or
closed browser session of either Firefox or Chrome where lots of tabs
(say 10-40) open simultaneously, the networking stack is brought to
its knees -- most or all the tabs eventually time out without data, or
a few tabs might get some data and then display a partial web page.
This behavior occurs with either wifi or ethernet, and occurs when
booting from Fedora 14 on liveusb, so it does not appear to be a
configuration problem. I have a Toshiba Satellite Pro S300M-S2142
laptop with a Core 2 Duo P8600 CPU, Intel GM45 gfx, Intel 82567V
Gigabit Ethernet and Intel 5100 Wifi, running kernel
kernel-2.6.36-1.1.fc15.x86_64 on top of Fedora 14.
Sorry for the length of the following bug report, but it's quite hard
to describe the behavior succinctly.
Even after all tabs have timed out, it's impossible to get data by
opening a new tab -- nothing seems able to access the network
connection. Networking is broken for other processes too -- for
example, commandline tools like ping don't work either. The
connection still shows as up in NetworkManager, and sometimes after
5-10 minutes goes back to normal, but not always. "service network
restart" and/or "service NetworkManager restart" and/or "ifdown eth0 ;
ifup eth0" sometimes fixes the problem, but sometimes normal network
activity isn't restored for several minutes and may not act completely
normal again until a reboot.
DNS resolution is the most obviously affected by this. If I reopen a
browser session and wait a few seconds for networking to hang, I can't
usually ping by domain name but I can (usually) ping by IP address.
However new browser tabs will hang at either name resolution *or*
waiting for data, so I'm not convinced this is just a problem with DNS
resolution.
Also sometimes (but not always) whatever weird state the network stack
on my laptop gets into, things are funky enough to screw up my home
router (two different Motorola Surfboard cable modems/routers), and
the cable modem sometimes has to be reset to get the connection back
to full speed again. However it is not a router problem in general,
because:
(1) all these symptoms (except this last one where the router somehow
gets screwed up by the laptop's odd behavior) are present whether I
use a wired or wireless connection, and regardless of which network I
am connected to (home or anywhere else, or even when tethered to my
Nexus One), and in multiple countries I have been to in the last 6
months (Portugal, Germany, China).
also
(2) I used to be able to reopen a closed browser session with 40 tabs
and they would all load up just fine. Then at some point after a
Rawhide update, this broke.
I can't put my finger on exactly when this broke, because I was
dealing with worse breakage for a while since Fedora kernel 2.6.31.5,
as I reported at the following link:
https://bugzilla.redhat.com/show_bug.cgi?id=555213#c1
Synopsis of the above "worse" bug report:
Basically in the very same situation (opening lots of browser tabs),
the machine would lock up hard and the fan would immediately blow at
100% speed. It took a couple of months of Rawhide updates for this
bug to go away, but by the time this lockup bug was fixed around the
release of Fedora 13 at kernel version 2.6.33, the other network
issues I have described above became evident, and were triggered in
the same way -- thus I believe the two bugs may be related somehow.
My computer has been close to unusable for moderate browsing activity
for about 8 months of the year so far, across nearly two releases of
Fedora (F13 and F14 beta). I filed the above bug report but it was
never commented on by RedHat engineers. I figured the bug was
probably visible enough that somebody else should notice it and I just
kept hoping the next update would contain a fix, but not yet. I
emailed one of the Red Hat kernel engineers and he suggested I ask
upstream.
Please advise me as to how to debug this problem further. (I haven't
seen anything that looks suspicious in
dmesg output or /var/log/messages, to start with.)
Thank you,
Luke Hutchison
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 18:30 Networking hangs when too many parallel requests are made at once Luke Hutchison
@ 2010-11-09 19:04 ` Luke Hutchison
2010-11-09 19:16 ` Ben Greear
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 19:04 UTC (permalink / raw)
To: netdev
On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison <luke.hutch@gmail.com> wrote:
> Since around Linux kernel 2.6.33 or so (but maybe as early as
> 2.6.31, not sure exactly what version), when restoring a crashed or
> closed browser session of either Firefox or Chrome where lots of tabs
> (say 10-40) open simultaneously, the networking stack is brought to
> its knees -- most or all the tabs eventually time out without data, or
> a few tabs might get some data and then display a partial web page.
I forgot to mention, I have glibc-2.12.90-18.x86_64.
Also the following screenshot may be useful
http://web.mit.edu/~luke_h/www/dns-hang-problem.png
Basically in the usage depicted by the screenshot, I had Chrome open
with probably 30-50 tabs across several windows, I then started an scp
transfer of a large file and waited for it to stabilize, then closed
the browser and re-opened it, restoring the tabs. Within a second or
two (after the first few lucky browser tabs got some content), DNS
hung, and pinging a domain name from the commandline no longer worked
(ruling out a bug in the browser itself). However the scp transfer
continued at the same rate, and pinging an IP address directly
continued to work fine (in this case; at other times network
connections to already-resolved IP addresses can seem flaky I think,
but I haven't been able to reproduce these problems as easily as with
DNS, which has 100% reproducibility). You can see that CPU usage
dropped from 100% to something like 50% when the browser tabs all
started blocking (but actually I'm surprised that CPU usage didn't
drop to zero). In this instance, as soon as I shut down the browser,
pinging a domain name worked immediately again (although, as I
mentioned previously, sometimes it can take a minute or more after
killing the browser for name resolution to jump back into working
mode). "ifdown eth0 ; ifup eth0" *usually* fixes the problem by
canceling all pending requests.
>From one of the RH engineers:
> It's not a driver issue, since it occurs
> with two different devices... it's not a configuration issue since it
> occurs on a LiveCD... For the same reason it's unlikely to be a
> userspace issue... It's unlikely to be a local network issue since you
> say it happens in multiple locations...
>
> Absolutely bizarre. :/
Any help greatly appreciated.
Thanks,
Luke
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 19:04 ` Luke Hutchison
@ 2010-11-09 19:16 ` Ben Greear
2010-11-09 20:27 ` Luke Hutchison
0 siblings, 1 reply; 11+ messages in thread
From: Ben Greear @ 2010-11-09 19:16 UTC (permalink / raw)
To: Luke Hutchison; +Cc: netdev
On 11/09/2010 11:04 AM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison<luke.hutch@gmail.com> wrote:
>> Since around Linux kernel 2.6.33 or so (but maybe as early as
>> 2.6.31, not sure exactly what version), when restoring a crashed or
>> closed browser session of either Firefox or Chrome where lots of tabs
>> (say 10-40) open simultaneously, the networking stack is brought to
>> its knees -- most or all the tabs eventually time out without data, or
>> a few tabs might get some data and then display a partial web page.
Have you been able to reproduce this on any other machine? I suspect
it might be an issue with your specific NIC or other hardware.
At the least, it's not a general problem with opening lots
of TCP connections, as we routinely test with thousands...
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 19:16 ` Ben Greear
@ 2010-11-09 20:27 ` Luke Hutchison
2010-11-09 20:35 ` Ben Greear
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 20:27 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev
On Tue, Nov 9, 2010 at 2:16 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 11/09/2010 11:04 AM, Luke Hutchison wrote:
>>
>> On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison<luke.hutch@gmail.com>
>> wrote:
>>>
>>> Since around Linux kernel 2.6.33 or so (but maybe as early as
>>> 2.6.31, not sure exactly what version), when restoring a crashed or
>>> closed browser session of either Firefox or Chrome where lots of tabs
>>> (say 10-40) open simultaneously, the networking stack is brought to
>>> its knees -- most or all the tabs eventually time out without data, or
>>> a few tabs might get some data and then display a partial web page.
>
> Have you been able to reproduce this on any other machine? I suspect
> it might be an issue with your specific NIC or other hardware.
>
> At the least, it's not a general problem with opening lots
> of TCP connections, as we routinely test with thousands...
>
> Thanks,
> Ben
No, I haven't been able to reproduce on any other machine. But it
happens on both my wifi NIC and my ethernet NIC in this machine.
Thanks,
Luke
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 20:27 ` Luke Hutchison
@ 2010-11-09 20:35 ` Ben Greear
2010-11-09 21:17 ` Luke Hutchison
0 siblings, 1 reply; 11+ messages in thread
From: Ben Greear @ 2010-11-09 20:35 UTC (permalink / raw)
To: Luke Hutchison; +Cc: netdev
On 11/09/2010 12:27 PM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 2:16 PM, Ben Greear<greearb@candelatech.com> wrote:
>> On 11/09/2010 11:04 AM, Luke Hutchison wrote:
>>>
>>> On Tue, Nov 9, 2010 at 1:30 PM, Luke Hutchison<luke.hutch@gmail.com>
>>> wrote:
>>>>
>>>> Since around Linux kernel 2.6.33 or so (but maybe as early as
>>>> 2.6.31, not sure exactly what version), when restoring a crashed or
>>>> closed browser session of either Firefox or Chrome where lots of tabs
>>>> (say 10-40) open simultaneously, the networking stack is brought to
>>>> its knees -- most or all the tabs eventually time out without data, or
>>>> a few tabs might get some data and then display a partial web page.
>>
>> Have you been able to reproduce this on any other machine? I suspect
>> it might be an issue with your specific NIC or other hardware.
>>
>> At the least, it's not a general problem with opening lots
>> of TCP connections, as we routinely test with thousands...
>>
>> Thanks,
>> Ben
>
> No, I haven't been able to reproduce on any other machine. But it
> happens on both my wifi NIC and my ethernet NIC in this machine.
Well, let us know what those are, at least.
And, a network capture of your system going into this state might
be useful. I'd try to disable your wireless NIC entirely and focus
on debugging the wired NIC as that is usually easier to debug.
Thanks,
Ben
>
> Thanks,
> Luke
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 20:35 ` Ben Greear
@ 2010-11-09 21:17 ` Luke Hutchison
2010-11-09 22:14 ` Ben Greear
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 21:17 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev
On Tue, Nov 9, 2010 at 3:35 PM, Ben Greear <greearb@candelatech.com> wrote:
> On 11/09/2010 12:27 PM, Luke Hutchison wrote:
>> No, I haven't been able to reproduce on any other machine. But it
>> happens on both my wifi NIC and my ethernet NIC in this machine.
>
> Well, let us know what those are, at least.
From my first email:
> I have a Toshiba Satellite Pro S300M-S2142
> laptop with a Core 2 Duo P8600 CPU, Intel GM45 gfx,
> Intel 82567V Gigabit Ethernet and Intel 5100 Wifi,
> running kernel kernel-2.6.36-1.1.fc15.x86_64 on top
> of Fedora 14.
On Tue, Nov 9, 2010 at 3:35 PM, Ben Greear <greearb@candelatech.com> wrote:
> And, a network capture of your system going into this state might
> be useful. I'd try to disable your wireless NIC entirely and focus
> on debugging the wired NIC as that is usually easier to debug.
Sure -- a wireshark trace is here: http://web.mit.edu/~luke_h/www/trace.bz2
In this particular trace, I opened about 20 browser tabs at once.
They all locked up after about 5 seconds. A few of them loaded some
more content after a minute or two. A minute or two later, I killed
them all. In this particular example, pinging to a specific domain
name continued to work (it doesn't always), although I couldn't get
content from the domains in question: e.g. I could ping google.com,
but opening a new tab and trying to visit google.com caused the new
tab to hang too.
Thanks,
Luke
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 21:17 ` Luke Hutchison
@ 2010-11-09 22:14 ` Ben Greear
2010-11-09 22:20 ` Luke Hutchison
0 siblings, 1 reply; 11+ messages in thread
From: Ben Greear @ 2010-11-09 22:14 UTC (permalink / raw)
To: Luke Hutchison; +Cc: netdev
On 11/09/2010 01:17 PM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 3:35 PM, Ben Greear<greearb@candelatech.com> wrote:
>> And, a network capture of your system going into this state might
>> be useful. I'd try to disable your wireless NIC entirely and focus
>> on debugging the wired NIC as that is usually easier to debug.
>
> Sure -- a wireshark trace is here: http://web.mit.edu/~luke_h/www/trace.bz2
>
> In this particular trace, I opened about 20 browser tabs at once.
> They all locked up after about 5 seconds. A few of them loaded some
> more content after a minute or two. A minute or two later, I killed
> them all. In this particular example, pinging to a specific domain
> name continued to work (it doesn't always), although I couldn't get
> content from the domains in question: e.g. I could ping google.com,
> but opening a new tab and trying to visit google.com caused the new
> tab to hang too.
Have you tried using a different DNS server (open-dns?), or maybe a caching one one
your local machine? Maybe some part of your network is throwing away
some of your DNS requests since you send so many at once?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 22:14 ` Ben Greear
@ 2010-11-09 22:20 ` Luke Hutchison
2010-11-09 22:29 ` Ben Greear
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 22:20 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev
On Tue, Nov 9, 2010 at 5:14 PM, Ben Greear <greearb@candelatech.com> wrote:
> Have you tried using a different DNS server (open-dns?), or maybe a caching
> one one
> your local machine? Maybe some part of your network is throwing away
> some of your DNS requests since you send so many at once?
I have tried Google's DNS as well as Comcast's, no difference in
effect. Also this has been a problem when I've been in the US,
Portugal, Germany and China, so I have probably used a range of DNS
servers. I have tried nscd (it's off by default) and it has the
expected behavior: if it resolves a name before re-opening the
browser, then that name can continue to be resolved after the network
gets flooded. If I ask it to resolve a domain name after the network
is flooded, the request times out, and subsequently nscd says the
domain name doesn't exist, even after the network link has been
restored to normal.
Thanks,
Luke
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 22:20 ` Luke Hutchison
@ 2010-11-09 22:29 ` Ben Greear
2010-11-09 22:38 ` Luke Hutchison
0 siblings, 1 reply; 11+ messages in thread
From: Ben Greear @ 2010-11-09 22:29 UTC (permalink / raw)
To: Luke Hutchison; +Cc: netdev
On 11/09/2010 02:20 PM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 5:14 PM, Ben Greear<greearb@candelatech.com> wrote:
>> Have you tried using a different DNS server (open-dns?), or maybe a caching
>> one one
>> your local machine? Maybe some part of your network is throwing away
>> some of your DNS requests since you send so many at once?
>
> I have tried Google's DNS as well as Comcast's, no difference in
> effect. Also this has been a problem when I've been in the US,
> Portugal, Germany and China, so I have probably used a range of DNS
> servers. I have tried nscd (it's off by default) and it has the
> expected behavior: if it resolves a name before re-opening the
> browser, then that name can continue to be resolved after the network
> gets flooded. If I ask it to resolve a domain name after the network
> is flooded, the request times out, and subsequently nscd says the
> domain name doesn't exist, even after the network link has been
> restored to normal.
If you get all names resolved with your caching name-server, can you then
open the browser tabs w/out problem?
Have you tried setting all your browser tabs to simple low-bandwidth pages (no ads being
served from various hosts, etc) to see if that works?
Maybe you are just flooding the network so hard that responses are being
dropped?
Thanks,
Ben
>
> Thanks,
> Luke
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 22:29 ` Ben Greear
@ 2010-11-09 22:38 ` Luke Hutchison
2010-11-09 22:49 ` Ben Greear
0 siblings, 1 reply; 11+ messages in thread
From: Luke Hutchison @ 2010-11-09 22:38 UTC (permalink / raw)
To: Ben Greear; +Cc: netdev
On Tue, Nov 9, 2010 at 5:29 PM, Ben Greear <greearb@candelatech.com> wrote:
> If you get all names resolved with your caching name-server, can you then
> open the browser tabs w/out problem?
This is hard to test, because to get all the same domain names
resolved for all resources on all pages, I have to successfully open
all the pages once first. Even opening the pages a few seconds apart
seems to break things quite frequently. And there is a period where
the connection starts acting up but is not hard locked up, and it's
hard to know at that point if it's the connection or the individual
website. The only way I can think of of reliably triggering this 100%
of the time is to open a bunch of browser tabs all at the same time --
and that hangs the dns caching server's requests too.
> Have you tried setting all your browser tabs to simple low-bandwidth pages (no ads being
> served from various hosts, etc) to see if that works?
Not exactly, but I have one browser window with about 20 Wikipedia
articles open, and not all of them load (some get stalled until they
time out). I think this serves the same purpose as your suggested
test, because Wikipedia doesn't draw from many external domains.
> Maybe you are just flooding the network so hard that responses are being
> dropped?
Yes, but you pointed out earlier that you routinely test with
thousands of TCP connections, and we're only talking about 20-30
browser tabs here, maybe a few thousand HTTP requests at most. Also,
this used to work fine on old Fedora kernels and no longer works with
more recent kernels.
Thanks,
Luke
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Networking hangs when too many parallel requests are made at once
2010-11-09 22:38 ` Luke Hutchison
@ 2010-11-09 22:49 ` Ben Greear
0 siblings, 0 replies; 11+ messages in thread
From: Ben Greear @ 2010-11-09 22:49 UTC (permalink / raw)
To: Luke Hutchison; +Cc: netdev
On 11/09/2010 02:38 PM, Luke Hutchison wrote:
> On Tue, Nov 9, 2010 at 5:29 PM, Ben Greear<greearb@candelatech.com> wrote:
>> If you get all names resolved with your caching name-server, can you then
>> open the browser tabs w/out problem?
>
> This is hard to test, because to get all the same domain names
> resolved for all resources on all pages, I have to successfully open
> all the pages once first. Even opening the pages a few seconds apart
> seems to break things quite frequently. And there is a period where
> the connection starts acting up but is not hard locked up, and it's
> hard to know at that point if it's the connection or the individual
> website. The only way I can think of of reliably triggering this 100%
> of the time is to open a bunch of browser tabs all at the same time --
> and that hangs the dns caching server's requests too.
>
>> Have you tried setting all your browser tabs to simple low-bandwidth pages (no ads being
>> served from various hosts, etc) to see if that works?
>
> Not exactly, but I have one browser window with about 20 Wikipedia
> articles open, and not all of them load (some get stalled until they
> time out). I think this serves the same purpose as your suggested
> test, because Wikipedia doesn't draw from many external domains.
>
>> Maybe you are just flooding the network so hard that responses are being
>> dropped?
>
> Yes, but you pointed out earlier that you routinely test with
> thousands of TCP connections, and we're only talking about 20-30
> browser tabs here, maybe a few thousand HTTP requests at most. Also,
> this used to work fine on old Fedora kernels and no longer works with
> more recent kernels.
Well, I'm low on ideas.
For our tests though, we are running across 1G Ethernet most of the time,
so bandwidth is not an issue. Also, we aren't dependent on external DNS for
this type of test.
From looking at your capture, you are not getting DNS responses back
reliably. On the great wild internet, there are lots of reasons why
that might be happening, so without a more controlled test case, I'm
not sure anyone can help you.
It wouldn't be quick, but if you were able to do a git-bisect to figure
out which kernel change affected you, then that might be a start.
If there were a way for you to tune your TCP stack to run slower, that
might help too. Maybe hard limit the max window size to something small like
8k?
Thanks,
Ben
--
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc http://www.candelatech.com
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2010-11-09 22:49 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-09 18:30 Networking hangs when too many parallel requests are made at once Luke Hutchison
2010-11-09 19:04 ` Luke Hutchison
2010-11-09 19:16 ` Ben Greear
2010-11-09 20:27 ` Luke Hutchison
2010-11-09 20:35 ` Ben Greear
2010-11-09 21:17 ` Luke Hutchison
2010-11-09 22:14 ` Ben Greear
2010-11-09 22:20 ` Luke Hutchison
2010-11-09 22:29 ` Ben Greear
2010-11-09 22:38 ` Luke Hutchison
2010-11-09 22:49 ` Ben Greear
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).