network hang

All of lore.kernel.org
 help / color / mirror / Atom feed

* network hang
@ 2004-09-02  0:55 James Harper
  2004-09-02  2:23 ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: James Harper @ 2004-09-02  0:55 UTC (permalink / raw)
  To: xen-devel@lists.sourceforge.net

[-- Attachment #1: Type: text/plain, Size: 659 bytes --]

I'm playing with nbd+raid1, and am finding that during a resync the network in xenU is dying and simply not sending packets anymore. At first I thought this was a bridging problem, but in xen0 I have removed the vif from the bridge and given it its own ip address, and given the eth interface in xenU a similar ip address, but no traffic is passing anymore.

After a while though, it seemed to come good again and I was able to add the interface to the bridge again and it started working.

The only strange thing in the kernel logs was this in xenU:

eth0: full queue wasn't stopped!

but i'm not sure at what point this was logged though.

James

[-- Attachment #2: Type: text/html, Size: 1270 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: network hang
  2004-09-02  0:55 network hang James Harper
@ 2004-09-02  2:23 ` Keir Fraser
  2004-09-09 21:43   ` More on networking hang Rob Gardner
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2004-09-02  2:23 UTC (permalink / raw)
  To: James Harper; +Cc: xen-devel@lists.sourceforge.net

"full queue wasn't stopped" should never be printed. If you can
reproduce this message then it is worth printing some more info at th
esame point -- for example, np->tx->req_prod, np->tx->resp_prod,
np->tx_resp_cons. This will let us see whether the ring is indeed full.
If the network stack/driver has got itself into a state wher eit can print
this message, I'm not surprised it hangs for a while.

e.g., add this where that message gets printed in netfront.c:

 {
   unsigned long flags;
   local_irq_save(flags);
   printk(KERN_ALERT "full=%d req_prod=%08x rsp_prod=%08x"
          "rsp_cons=%08x\n", np->tx_full, np->tx->req_prod,
          np->tx->resp_prod, np->tx_resp_cons);
   local_irq_restore(flags);
 }

 -- Keir

> I'm playing with nbd+raid1, and am finding that during a resync the network in xenU is dying and simply not sending packets anymore. At first I thought this was a bridging problem, but in xen0 I have removed the vif from the bridge and given it its own ip address, and given the eth interface in xenU a similar ip address, but no traffic is passing anymore.
> 
> After a while though, it seemed to come good again and I was able to add the interface to the bridge again and it started working.
> 
> The only strange thing in the kernel logs was this in xenU:
> 
> eth0: full queue wasn't stopped!
> 
> but i'm not sure at what point this was logged though.
> 
> James
\x1f -=- MIME -=- \x1f\f

--_5CA9A503-EEF7-4FB1-8592-2E2052031B95_
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I'm playing with nbd+raid1, and am finding that during a resync the network=
 in xenU is dying and simply not sending packets anymore. At first I though=
t this was a bridging problem, but in xen0 I have removed the vif from the =
bridge and given it its own ip address, and given the eth interface in xenU=
 a similar ip address, but no traffic is passing anymore.

After a while though, it seemed to come good again and I was able to add th=
e interface to the bridge again and it started working.

The only strange thing in the kernel logs was this in xenU:

eth0: full queue wasn't stopped!

but i'm not sure at what point this was logged though.

James

--_5CA9A503-EEF7-4FB1-8592-2E2052031B95_
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<HTML dir=3Dltr><HEAD></HEAD>
<BODY>
<DIV><FONT face=3DArial color=3D#000000 size=3D2>I'm playing with nbd+raid1=
, and am finding that during a resync the network in xenU is dying and simp=
ly not sending packets anymore. At first I thought this was a bridging prob=
lem, but in xen0 I have removed the vif from the bridge and given it its ow=
n ip address, and given the eth interface in xenU a similar ip address, but=
 no traffic is passing anymore.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>After a while though, it seemed to come go=
od again and I was able to add the interface to the bridge again and it sta=
rted working.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>The only strange thing in the kernel logs =
was this in xenU:</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>eth0: full queue wasn't stopped!<BR></DIV>=
</FONT>
<DIV><FONT face=3DArial size=3D2>but i'm not sure at what point this was lo=
gged though.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>James</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></BODY></HTML>

--_5CA9A503-EEF7-4FB1-8592-2E2052031B95_--

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click

^ permalink raw reply	[flat|nested] 10+ messages in thread

* More on networking hang
  2004-09-02  2:23 ` Keir Fraser
@ 2004-09-09 21:43   ` Rob Gardner
  2004-09-09 22:06     ` Ian Pratt
  0 siblings, 1 reply; 10+ messages in thread
From: Rob Gardner @ 2004-09-09 21:43 UTC (permalink / raw)
  To: xen-devel

I saw something go by on the list a week or so ago about network hangs, 
and I may be observing something similar.

The basic setup is: two guest domains running apache, and a different 
box running httpperf against them, 100 requests per second for the same 
100kbyte file.

This runs ok for a time, then suddenly chokes and all traffic comes to a 
stop. Then a few seconds later traffic seems to pick up again.

This behavior is not observed with a workload of 40 requests/second. At 
80/second, the problem starts appearing, but not very frequently.

We can provide sufficient detail if anyone wants to try to reproduce this.

Have there been any fixes relating to this lately? We are using xen bits 
that are a few weeks old right now.

Rob Gardner
HP

-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-09 21:43   ` More on networking hang Rob Gardner
@ 2004-09-09 22:06     ` Ian Pratt
  2004-09-09 22:33       ` Rob Gardner
  2004-09-10 22:36       ` Rob Gardner
  0 siblings, 2 replies; 10+ messages in thread
From: Ian Pratt @ 2004-09-09 22:06 UTC (permalink / raw)
  To: Rob Gardner; +Cc: xen-devel, Ian.Pratt

> I saw something go by on the list a week or so ago about network hangs, 
> and I may be observing something similar.
> 
> The basic setup is: two guest domains running apache, and a different 
> box running httpperf against them, 100 requests per second for the same 
> 100kbyte file.
> 
> This runs ok for a time, then suddenly chokes and all traffic comes to a 
> stop. Then a few seconds later traffic seems to pick up again.
> 
> This behavior is not observed with a workload of 40 requests/second. At 
> 80/second, the problem starts appearing, but not very frequently.
> 
> We can provide sufficient detail if anyone wants to try to reproduce this.

Just to check I understand your setup: You have a domain 0
implementing bridging, then a domain 1 and a domain 2 each
running apache.

When the domain chokes, do you you see any drops or errors in the
stats as reported by ifconfig?

It would be good to enable the debugging printf's in both the
netfront and netback drivers.

Can you repeat this with a single non-0 domain? Can you repeat it more
easily by generating a background network load in dom0?

BTW: Have you got CONNECTION_TRACKING compiled into the dom0
kernel?  This seems to cripple Linux performance, hence it was
recently made a module in our config.

Ian

-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-09 22:06     ` Ian Pratt
@ 2004-09-09 22:33       ` Rob Gardner
  2004-09-10 22:36       ` Rob Gardner
  1 sibling, 0 replies; 10+ messages in thread
From: Rob Gardner @ 2004-09-09 22:33 UTC (permalink / raw)
  To: xen-devel

Ian Pratt wrote:
> 
> Just to check I understand your setup: You have a domain 0
> implementing bridging, then a domain 1 and a domain 2 each
> running apache.

That's correct.

> When the domain chokes, do you you see any drops or errors in the
> stats as reported by ifconfig?

In domain 0, ifconfig reports 0 errors, 0 dropped on eth0; 0 errors, 2 
dropped on vif1.0, and 0 errors, 5 dropped on vif2.0; 0 errors, 0 
dropped on xen-br0.

In domain 1, ifconfig reports 0 for errors and dropped.

For some reason I can't get a console onto the other domain right now, 
but I suspect it will report the same thing as domain 1.

> It would be good to enable the debugging printf's in both the
> netfront and netback drivers.

Can do.


> Can you repeat this with a single non-0 domain? Can you repeat it more
> easily by generating a background network load in dom0?

Can try these.


> BTW: Have you got CONNECTION_TRACKING compiled into the dom0
> kernel?  This seems to cripple Linux performance, hence it was
> recently made a module in our config.

We are using only default options.



Rob Gardner
HP




-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: More on networking hang
@ 2004-09-10  3:54 James Harper
  0 siblings, 0 replies; 10+ messages in thread
From: James Harper @ 2004-09-10  3:54 UTC (permalink / raw)
  To: Rob Gardner, xen-devel

I have been seeing this in much the same circumstances, I was attempting
to use a raid1 array of nbd devices, but it wouldn't make it through the
sync most of the time. I was never able to prove one way or another if
it was the bridge code or xen causing a problem.

I've gone back to iscsi, but haven't really tested it much as I'm
hacking the iscsitarget enough to get it to run on 2.6 (it compiles now
but oops's. doh!)

James

> -----Original Message-----
> From: xen-devel-admin@lists.sourceforge.net [mailto:xen-devel-
> admin@lists.sourceforge.net] On Behalf Of Rob Gardner
> Sent: Friday, 10 September 2004 07:43
> To: xen-devel@lists.sourceforge.net
> Subject: [Xen-devel] More on networking hang
> 
> I saw something go by on the list a week or so ago about network
hangs,
> and I may be observing something similar.
> 
> The basic setup is: two guest domains running apache, and a different
> box running httpperf against them, 100 requests per second for the
same
> 100kbyte file.
> 
> This runs ok for a time, then suddenly chokes and all traffic comes to
a
> stop. Then a few seconds later traffic seems to pick up again.
> 
> This behavior is not observed with a workload of 40 requests/second.
At
> 80/second, the problem starts appearing, but not very frequently.
> 
> We can provide sufficient detail if anyone wants to try to reproduce
this.
> 
> Have there been any fixes relating to this lately? We are using xen
bits
> that are a few weeks old right now.
> 
> 
> Rob Gardner
> HP
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> Project Admins to receive an Apple iPod Mini FREE for your judgement
on
> who ports your project to Linux PPC the best. Sponsored by IBM.
> Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel



-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-09 22:06     ` Ian Pratt
  2004-09-09 22:33       ` Rob Gardner
@ 2004-09-10 22:36       ` Rob Gardner
  2004-09-11  3:59         ` Keir Fraser
  2004-09-11  7:45         ` Ian Pratt
  1 sibling, 2 replies; 10+ messages in thread
From: Rob Gardner @ 2004-09-10 22:36 UTC (permalink / raw)
  To: xen-devel

On Thu, 2004-09-09 at 16:06, Ian Pratt wrote:
> It would be good to enable the debugging printf's in both the
> netfront and netback drivers.
> 
> Can you repeat this with a single non-0 domain? Can you repeat it more
> easily by generating a background network load in dom0?

The network hang behavior does occur with just a single non-0 domain.

We got the following output in dmesg that looks interesting:

...
Freeing unused kernel memory: 116k freed
EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,1), internal journal
Adding Swap: 1020592k swap-space (priority -1)
ioperm not fully supported - set iopl to 3
ioperm not fully supported - set iopl to 3
ioperm not fully supported - set iopl to 3
ioperm not fully supported - set iopl to 3
device eth0 entered promiscuous mode
xen-br0: port 1(eth0) entering learning state
xen-br0: port 1(eth0) entering forwarding state
xen-br0: topology change detected, propagating
(file=interface.c, line=140) Successfully created netif
device vif1.0 entered promiscuous mode
xen-br0: port 2(vif1.0) entering learning state
xen-br0: port 2(vif1.0) entering forwarding state
xen-br0: topology change detected, propagating
(file=interface.c, line=140) Successfully created netif
device vif2.0 entered promiscuous mode
xen-br0: port 3(vif2.0) entering learning state
xen-br0: port 3(vif2.0) entering forwarding state
xen-br0: topology change detected, propagating
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.
NET: 481 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 532 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 547 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 393 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 25 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 23 messages suppressed.
ip_conntrack: table full, dropping packet.
NET: 33 messages suppressed.
ip_conntrack: table full, dropping packet.

-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-10 22:36       ` Rob Gardner
@ 2004-09-11  3:59         ` Keir Fraser
  2004-09-11  4:29           ` Keir Fraser
  2004-09-11  7:45         ` Ian Pratt
  1 sibling, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2004-09-11  3:59 UTC (permalink / raw)
  To: Rob Gardner; +Cc: xen-devel


Looks as though perhaps the connection-tracking table is full. :-)

If you are churning through a lot of TCP connections then the
conntrack table may be full of defunct TCBs in TIME WAIT (2MSL)
state. 

Not sure what the best solution is: when we were doing evaluation
tests for our paper we disabled connection tracking (which means that
things like NAT are unavailable). I haven't looked around, but there
may well be a way to tell Linux to reuse TCBs in TIME WAIT state.

This would exaplain networking drop-outs. No more connections can be
made until some old TCBs are garbage collected, after a 120s timeout.

 -- Keir

> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.
> NET: 481 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 532 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 547 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 393 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 25 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 23 messages suppressed.
> ip_conntrack: table full, dropping packet.
> NET: 33 messages suppressed.
> ip_conntrack: table full, dropping packet.


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-11  3:59         ` Keir Fraser
@ 2004-09-11  4:29           ` Keir Fraser
  0 siblings, 0 replies; 10+ messages in thread
From: Keir Fraser @ 2004-09-11  4:29 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Rob Gardner, xen-devel

> Not sure what the best solution is: when we were doing evaluation
> tests for our paper we disabled connection tracking (which means that
> things like NAT are unavailable). I haven't looked around, but there
> may well be a way to tell Linux to reuse TCBs in TIME WAIT state.

More detail:

Look in /proc/net/ip_conntrack. Most likely you'll see lots of
connects in TIME_WAIT.

You can adjust the maximum number of tracked connections by echoing to
/proc/sys/net/ipv4/ip_conntrack_max. A better solution, however, is
probably to modify the individual timeout values for each state. For
example:

 echo "5" >/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_time_wait

[disclaimer: I haven't tried this myself, but google + src indicates
 this is the most promising approach.]

 -- Keir


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: More on networking hang
  2004-09-10 22:36       ` Rob Gardner
  2004-09-11  3:59         ` Keir Fraser
@ 2004-09-11  7:45         ` Ian Pratt
  1 sibling, 0 replies; 10+ messages in thread
From: Ian Pratt @ 2004-09-11  7:45 UTC (permalink / raw)
  To: Rob Gardner; +Cc: xen-devel, Ian.Pratt

> xen-br0: port 3(vif2.0) entering learning state
> xen-br0: port 3(vif2.0) entering forwarding state
> xen-br0: topology change detected, propagating
> ip_conntrack: table full, dropping packet.
> ip_conntrack: table full, dropping packet.

As I suspected in a previous email, you're using an old
configuration file where Linux's connection tracking was enabled
by default.

Linux's connection tracking code seems to remain active and slow
things down even if you're not using it. That's why I changed the
config option into 'module' rather than 'yes' several weeks back.

Ian


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM. 
Deadline: Sept. 13. Go here: http://sf.net/ppc_contest.php

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2004-09-11  7:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-09-02  0:55 network hang James Harper
2004-09-02  2:23 ` Keir Fraser
2004-09-09 21:43   ` More on networking hang Rob Gardner
2004-09-09 22:06     ` Ian Pratt
2004-09-09 22:33       ` Rob Gardner
2004-09-10 22:36       ` Rob Gardner
2004-09-11  3:59         ` Keir Fraser
2004-09-11  4:29           ` Keir Fraser
2004-09-11  7:45         ` Ian Pratt
  -- strict thread matches above, loose matches on Subject: below --
2004-09-10  3:54 James Harper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.