All of lore.kernel.org
 help / color / mirror / Atom feed
* (repeatable) cross-domain networking failure
@ 2005-01-15  1:38 mukesh agrawal
  0 siblings, 0 replies; 17+ messages in thread
From: mukesh agrawal @ 2005-01-15  1:38 UTC (permalink / raw)
  To: xen-devel


Summary:

I'm running into a situation where, after sending some UDP traffic between 
two xen domains (Domain 0 and Domain 1) the networking between the 
domains fails. This failure is 100% repeatable.

In more detail:

I have two xen domains. They run the kernels from the 2.0.3 release. (I've 
run into the same problem with 2.0.1 as well.) Domain 0 has 5 physical 
ethernet interfaces, and a virtual interface to Domain 1. Domain 1 has 
just the virtual interface to Domain 0.

D0 is configured with IP address 192.168.0.1, and D1 with 192.168.1.1. The 
netmask is set to 255.255.0.0.

When I bring up D1, I can ping D1 from D0, ssh into D1, etc.

I then start a UDP server in D0, and a traffic generator in D1. After the 
traffic generator sends its 128-th packet, networking between the domains 
fails. The 128th packet is received successfully by the UDP server, but no 
later traffic arrives in D0. This includes UDP, TCP, ICMP, and ARP.

Looking at the interrupt counts in /proc/interrupts, I see that D0 no 
longer receives packets sent by D1. D1, however, does receive packets sent 
by D0. (To be clear, D0->D1 traffic is ICMP ping requests, unrelated to 
the UDP traffic. There is not UDP traffic sent from D0 to D1.)

(I suspect the stuff in this paragraph doesn't matter, but include it for 
completeness.) Eventually, D0's ARP cache entry for D1 expires. D0 ARPs 
for D1, and D1 replies. But D0 never receives these replies. And 
eventually, D1 stops replying to the ARPs entirely. (D1's sending behavior 
is observed via tcpdump running in the console connection to D1.)

Note that the networking failure only occurs if the UDP packets are 
delivered to a user-level process in D0. In particular, UDP traffic to 
D0's kernel NFS server does not induce the failure. Nor does traffic sent 
to D0 for which there is no user process to accept the packets. And 
neither does traffic which is forwarded on to other hosts via NAT. (I 
haven't tested the regular forwarding case.)

Also, for what it's worth, Domain 0's network connectivity on its other 
interfaces (which are connected to the world at large) are unaffected.

Looking through the mailing list archive, I saw a prior bug that seemed 
similar, but involved IP fragmentation. That is not the case here, as the 
UDP packets sent by D1 are small (<100 bytes).

Any suggestions for debugging this?

Thanks,
mukesh


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* (repeatable) cross-domain networking failure
@ 2005-01-15 16:40 mukesh agrawal
  2005-01-15 17:04 ` Keir Fraser
  2005-01-15 21:14 ` Nivedita Singhvi
  0 siblings, 2 replies; 17+ messages in thread
From: mukesh agrawal @ 2005-01-15 16:40 UTC (permalink / raw)
  To: xen-devel


Summary:

After sending some UDP traffic between two xen domains (Domain 0 and 
Domain 1) the networking between the domains fails. This failure is 100% 
repeatable.

In more detail:

I have two xen domains. They run the kernels from the 2.0.3 release. (I've run 
into the same problem with 2.0.1 as well.) Domain 0 has 5 physical ethernet 
interfaces, and a virtual interface to Domain 1. Domain 1 has just the virtual 
interface to Domain 0.

D0 is configured with IP address 192.168.0.1, and D1 with 192.168.1.1. The 
netmask is set to 255.255.0.0.

When I bring up D1, I can ping D1 from D0, ssh into D1, etc.

I then start a UDP server in D0, and a traffic generator in D1. After the 
traffic generator sends its 128-th packet, networking between the domains 
fails. The 128th packet is received successfully by the UDP server, but no 
later traffic arrives in D0. This includes UDP, TCP, ICMP, and ARP.

Looking at the interrupt counts in /proc/interrupts, I see that D0 no longer 
receives packets sent by D1. D1, however, does receive packets sent by D0. (To 
be clear, D0->D1 traffic is ICMP ping requests, unrelated to the UDP traffic. 
There is not UDP traffic sent from D0 to D1.)

(I suspect the stuff in this paragraph doesn't matter, but include it for 
completeness.) Eventually, D0's ARP cache entry for D1 expires. D0 ARPs for D1, 
and D1 replies. But D0 never receives these replies. And eventually, D1 stops 
replying to the ARPs entirely. (D1's sending behavior is observed via tcpdump 
running in the console connection to D1.)

Note that the networking failure only occurs if the UDP packets are delivered 
to a user-level process in D0. In particular, UDP traffic to D0's kernel NFS 
server does not induce the failure. Nor does traffic sent to D0 for which there 
is no user process to accept the packets. And neither does traffic which is 
forwarded on to other hosts via NAT. (I haven't tested the regular forwarding 
case.)

Also, for what it's worth, Domain 0's network connectivity on its other 
interfaces (which are connected to the world at large) are unaffected.

Looking through the mailing list archive, I saw a prior bug that seemed 
similar, but involved IP fragmentation. That is not the case here, as the UDP 
packets sent by D1 are small (<100 bytes).

Any suggestions for debugging this?

Thanks,
mukesh


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-15 16:40 mukesh agrawal
@ 2005-01-15 17:04 ` Keir Fraser
  2005-01-15 21:14 ` Nivedita Singhvi
  1 sibling, 0 replies; 17+ messages in thread
From: Keir Fraser @ 2005-01-15 17:04 UTC (permalink / raw)
  To: mukesh agrawal; +Cc: xen-devel


Maybe add some tracing to the backend driver -- it's possible the
backend isn't sending responses for those packets back to domU, and so
things seize up for a while. If no responses are being generated it is
because the backend thinks the packets are still in flight, so there
would be some bug-hunting to find out why that is.

 -- Keir

> 
> Summary:
> 
> After sending some UDP traffic between two xen domains (Domain 0 and 
> Domain 1) the networking between the domains fails. This failure is 100% 
> repeatable.
> 
> In more detail:
> 
> I have two xen domains. They run the kernels from the 2.0.3 release. (I've run 
> into the same problem with 2.0.1 as well.) Domain 0 has 5 physical ethernet 
> interfaces, and a virtual interface to Domain 1. Domain 1 has just the virtual 
> interface to Domain 0.
> 
> D0 is configured with IP address 192.168.0.1, and D1 with 192.168.1.1. The 
> netmask is set to 255.255.0.0.
> 
> When I bring up D1, I can ping D1 from D0, ssh into D1, etc.
> 
> I then start a UDP server in D0, and a traffic generator in D1. After the 
> traffic generator sends its 128-th packet, networking between the domains 
> fails. The 128th packet is received successfully by the UDP server, but no 
> later traffic arrives in D0. This includes UDP, TCP, ICMP, and ARP.
> 
> Looking at the interrupt counts in /proc/interrupts, I see that D0 no longer 
> receives packets sent by D1. D1, however, does receive packets sent by D0. (To 
> be clear, D0->D1 traffic is ICMP ping requests, unrelated to the UDP traffic. 
> There is not UDP traffic sent from D0 to D1.)
> 
> (I suspect the stuff in this paragraph doesn't matter, but include it for 
> completeness.) Eventually, D0's ARP cache entry for D1 expires. D0 ARPs for D1, 
> and D1 replies. But D0 never receives these replies. And eventually, D1 stops 
> replying to the ARPs entirely. (D1's sending behavior is observed via tcpdump 
> running in the console connection to D1.)
> 
> Note that the networking failure only occurs if the UDP packets are delivered 
> to a user-level process in D0. In particular, UDP traffic to D0's kernel NFS 
> server does not induce the failure. Nor does traffic sent to D0 for which there 
> is no user process to accept the packets. And neither does traffic which is 
> forwarded on to other hosts via NAT. (I haven't tested the regular forwarding 
> case.)
> 
> Also, for what it's worth, Domain 0's network connectivity on its other 
> interfaces (which are connected to the world at large) are unaffected.
> 
> Looking through the mailing list archive, I saw a prior bug that seemed 
> similar, but involved IP fragmentation. That is not the case here, as the UDP 
> packets sent by D1 are small (<100 bytes).
> 
> Any suggestions for debugging this?
> 
> Thanks,
> mukesh
> 
> 
> -------------------------------------------------------
> The SF.Net email is sponsored by: Beat the post-holiday blues
> Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
> It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/xen-devel
> 



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-15 16:40 mukesh agrawal
  2005-01-15 17:04 ` Keir Fraser
@ 2005-01-15 21:14 ` Nivedita Singhvi
       [not found]   ` <e15e04f905011611313312b9f4@mail.gmail.com>
  1 sibling, 1 reply; 17+ messages in thread
From: Nivedita Singhvi @ 2005-01-15 21:14 UTC (permalink / raw)
  To: mukesh agrawal; +Cc: xen-devel

mukesh agrawal wrote:
> 
> Summary:
> 
> After sending some UDP traffic between two xen domains (Domain 0 and 
> Domain 1) the networking between the domains fails. This failure is 100% 
> repeatable.

I don't have boxes at the moment and can't reproduce till
Monday, but can you show us the output of netstat -uan and
netstat -s on both domains? Is there stuff in the receive
or send queues? And was all the udp traffic going to the
same port? i.e. any successful udp traffic to another
endpoint?

> I then start a UDP server in D0, and a traffic generator in D1. After 
> the traffic generator sends its 128-th packet, networking between the 
> domains fails. The 128th packet is received successfully by the UDP 
> server, but no later traffic arrives in D0. This includes UDP, TCP, 
> ICMP, and ARP.

What does ifconfig on dom0 show?
Are there any error messages in /var/log/messages?

> Looking at the interrupt counts in /proc/interrupts, I see that D0 no 
> longer receives packets sent by D1. D1, however, does receive packets 
> sent by D0. (To be clear, D0->D1 traffic is ICMP ping requests, 
> unrelated to the UDP traffic. There is not UDP traffic sent from D0 to D1.)

Is there any other successful traffic from D0 -> D1 (tcp?)

thanks,
Nivedita




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
       [not found]   ` <e15e04f905011611313312b9f4@mail.gmail.com>
@ 2005-01-16 20:49     ` mukesh agrawal
  2005-01-16 21:09       ` Keir Fraser
  0 siblings, 1 reply; 17+ messages in thread
From: mukesh agrawal @ 2005-01-16 20:49 UTC (permalink / raw)
  To: xen-devel, Nivedita Singhvi

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5289 bytes --]


Nivedita Singhvi <niv@us.ibm.com> wrote:

> I don't have boxes at the moment and can't reproduce till
> Monday, but can you show us the output of netstat -uan and
> netstat -s on both domains? Is there stuff in the receive
> or send queues?

The detailed output of netstat follows. But their is neither anything in 
the send queue on domU, nor anything in the receive queue on dom0. (The 
UDP server in question is running on port 2000.)

On dom0:

$ netstat -uan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
udp        0      0 0.0.0.0:1024            0.0.0.0:*
udp        0      0 0.0.0.0:2049            0.0.0.0:*
udp        0      0 0.0.0.0:514             0.0.0.0:*
udp        0      0 0.0.0.0:1027            0.0.0.0:*
udp        0      0 155.98.36.34:1028       155.98.32.70:8509       ESTABLISHED
udp        0      0 0.0.0.0:775             0.0.0.0:*
udp        0      0 0.0.0.0:653             0.0.0.0:*
udp        0      0 192.168.0.1:2000        192.168.1.1:1024        ESTABLISHED
udp        0      0 224.4.0.1:2917          0.0.0.0:*
udp        0      0 224.4.0.1:2917          0.0.0.0:*
udp        0      0 224.4.0.1:2917          0.0.0.0:*
udp        0      0 0.0.0.0:111             0.0.0.0:*
udp        0      0 0.0.0.0:759             0.0.0.0:*

On domU:

# netstat -uan
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
udp        0      0 192.168.1.1:1024        192.168.0.1:2000        ESTABLISHED

The netstat -s output is a bit long, so I've attached those, instead of 
including them inline.

> And was all the udp traffic going to the same port? i.e. any successful 
> udp traffic to another endpoint?

All the traffic was going to port 2000. Trying to send UDP traffic from 
domU to a different port in dom0 (after the networking failure) does not 
succeed. (If you're asking if traffic could be sent to multiple ports 
while the networking is functional, I believe the answer is yes, but would 
double check.)

> What does ifconfig on dom0 show?
> Are there any error messages in /var/log/messages?

$ ifconfig vif1.0
vif1.0    Link encap:Ethernet  HWaddr AA:00:01:7B:92:C2
           inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.0.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:134 errors:0 dropped:0 overruns:0 frame:0
           TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:0
           RX bytes:5884 (5.7 Kb)  TX bytes:676 (676.0 b)

$ sudo tail /var/log/messages
Jan 16 19:34:09 node1 ntpd[993]: kernel time sync disabled 0041
Jan 16 19:35:15 node1 ntpd[993]: kernel time sync enabled 0001
Jan 16 19:39:29 node1 ntpd[993]: synchronized to 155.98.33.74, stratum=2
Jan 16 19:49:07 node1 ntpd[993]: time correction of -18001 seconds exceeds 
sanity limit (1000); set clock manually to the correct UTC time.
Jan 16 19:59:15 node1 sshd(pam_unix)[1457]: session opened for user mukesh 
by (uid=30245)
Jan 16 19:59:18 node1 sshd(pam_unix)[1486]: session opened for user mukesh 
by (uid=30245)
Jan 16 19:59:30 node1 sshd(pam_unix)[1517]: session opened for user mukesh 
by (uid=30245)
Jan 16 20:09:29 node1 modprobe: modprobe: Can't open dependencies file 
/lib/modules/2.4.27-xen0/modules.dep (No such file or directory)
Jan 16 20:09:44 node1 last message repeated 2 times
Jan 16 20:16:02 node1 kernel: device vif1.0 entered promiscuous mode

>> Looking at the interrupt counts in /proc/interrupts, I see that D0 no
>> longer receives packets sent by D1. D1, however, does receive packets
>> sent by D0. (To be clear, D0->D1 traffic is ICMP ping requests,
>> unrelated to the UDP traffic. There is not UDP traffic sent from D0 to D1.)
>
> Is there any other successful traffic from D0 -> D1 (tcp?)

Any traffic is successful from D0->D1, even after the network stops 
working. This includes ICMP, UDP, and TCP. (Sorry if my comment about 
"There is not UDP traffic sent from D0 to D1" was confusing. What I meant 
was that I wasn't sending and UDP traffic from D0 to D1. Not that such 
traffic fails.)

This is subject to the limitation mentioned in my first message. Namely, 
that dom0's ARP cache entry for domU eventually times out. At that point, 
dom0 attempts to ARP for domU's MAC. domU sees this, and replies (as seen 
by tcpdump on domU). But dom0 never gets the ARP replies, so eventually 
D0->D1 traffic fails as well. (E.g. "telnet 192.168.1.1" returns "No route 
to host".)

Also, let me add some more detail to my original report:

1. The networking fails after the 128th UDP packet received in dom0, even 
if I restart domU. Specifically:

 	- If I send one UDP packet from domU to dom0, shut down domU, and
 	  start a fresh domU, then I can only send 127 (rather than
 	  128) UDP packets from the new domU before networking will fail.

 	- If I shut down domU after the networking failure, and start a
           new domU, networking between the new domU and dom0 does not
           work.

2. The server run in dom0 is
 	nc -l -u -p 2000

3. The traffic generator run in domU is

 	i=0; while true; do
 		((++i)); echo $i
 		echo $i | nc -u -w 1 192.168.0.1 2000
 	done &

thanks,
mukesh

[-- Attachment #2: netstat -s for domain0 --]
[-- Type: TEXT/plain, Size: 2109 bytes --]

$ netstat -s
Ip:
    177642 total packets received
    0 forwarded
    0 incoming packets discarded
    177538 incoming packets delivered
    98742 requests sent out
Icmp:
    0 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    0 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
Tcp:
    86 active connections openings
    10 passive connection openings
    0 failed connection attempts
    0 connection resets received
    20 connections established
    177122 segments received
    98563 segments send out
    0 segments retransmited
    0 bad segments received.
    0 resets sent
Udp:
    290 packets received
    0 packets to unknown port received.
    0 packet receive errors
    179 packets sent
TcpExt:
    ArpFilter: 0
    65 TCP sockets finished time wait in fast timer
    1011 delayed acks sent
    1 delayed acks further delayed because of locked socket
    94 packets directly queued to recvmsg prequeue.
    1644 packets directly received from backlog
    4038 packets directly received from prequeue
    170004 packets header predicted
    83 packets header predicted and directly queued to user
    TCPPureAcks: 2720
    TCPHPAcks: 1773
    TCPRenoRecovery: 0
    TCPSackRecovery: 0
    TCPSACKReneging: 0
    TCPFACKReorder: 0
    TCPSACKReorder: 0
    TCPRenoReorder: 0
    TCPTSReorder: 0
    TCPFullUndo: 0
    TCPPartialUndo: 0
    TCPDSACKUndo: 0
    TCPLossUndo: 0
    TCPLoss: 0
    TCPLostRetransmit: 0
    TCPRenoFailures: 0
    TCPSackFailures: 0
    TCPLossFailures: 0
    TCPFastRetrans: 0
    TCPForwardRetrans: 0
    TCPSlowStartRetrans: 0
    TCPTimeouts: 0
    TCPRenoRecoveryFail: 0
    TCPSackRecoveryFail: 0
    TCPSchedulerFailed: 0
    TCPRcvCollapsed: 0
    TCPDSACKOldSent: 0
    TCPDSACKOfoSent: 0
    TCPDSACKRecv: 0
    TCPDSACKOfoRecv: 0
    TCPAbortOnSyn: 0
    TCPAbortOnData: 0
    TCPAbortOnClose: 0
    TCPAbortOnMemory: 0
    TCPAbortOnTimeout: 0
    TCPAbortOnLinger: 0
    TCPAbortFailed: 0
    TCPMemoryPressures: 0


[-- Attachment #3: netstat -s for domain1 --]
[-- Type: TEXT/plain, Size: 1717 bytes --]

# netstat -s
Ip:
    4 total packets received
    0 forwarded
    0 incoming packets discarded
    4 incoming packets delivered
    275 requests sent out
Icmp:
    0 ICMP messages received
    0 input ICMP message failed.
    ICMP input histogram:
    0 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
Tcp:
    0 active connections openings
    0 passive connection openings
    0 failed connection attempts
    0 connection resets received
    0 connections established
    0 segments received
    0 segments send out
    0 segments retransmited
    0 bad segments received.
    0 resets sent
Udp:
    4 packets received
    0 packets to unknown port received.
    0 packet receive errors
    275 packets sent
TcpExt:
    ArpFilter: 0
    0 packets header predicted
    TCPPureAcks: 0
    TCPHPAcks: 0
    TCPRenoRecovery: 0
    TCPSackRecovery: 0
    TCPSACKReneging: 0
    TCPFACKReorder: 0
    TCPSACKReorder: 0
    TCPRenoReorder: 0
    TCPTSReorder: 0
    TCPFullUndo: 0
    TCPPartialUndo: 0
    TCPDSACKUndo: 0
    TCPLossUndo: 0
    TCPLoss: 0
    TCPLostRetransmit: 0
    TCPRenoFailures: 0
    TCPSackFailures: 0
    TCPLossFailures: 0
    TCPFastRetrans: 0
    TCPForwardRetrans: 0
    TCPSlowStartRetrans: 0
    TCPTimeouts: 0
    TCPRenoRecoveryFail: 0
    TCPSackRecoveryFail: 0
    TCPSchedulerFailed: 0
    TCPRcvCollapsed: 0
    TCPDSACKOldSent: 0
    TCPDSACKOfoSent: 0
    TCPDSACKRecv: 0
    TCPDSACKOfoRecv: 0
    TCPAbortOnSyn: 0
    TCPAbortOnData: 0
    TCPAbortOnClose: 0
    TCPAbortOnMemory: 0
    TCPAbortOnTimeout: 0
    TCPAbortOnLinger: 0
    TCPAbortFailed: 0
    TCPMemoryPressures: 0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-16 20:49     ` mukesh agrawal
@ 2005-01-16 21:09       ` Keir Fraser
  2005-01-16 21:56         ` mukesh agrawal
  0 siblings, 1 reply; 17+ messages in thread
From: Keir Fraser @ 2005-01-16 21:09 UTC (permalink / raw)
  To: mukesh agrawal; +Cc: xen-devel, Nivedita Singhvi

> Also, let me add some more detail to my original report:
> 
> 1. The networking fails after the 128th UDP packet received in dom0, even 
> if I restart domU. Specifically:
> 
>  	- If I send one UDP packet from domU to dom0, shut down domU, and
>  	  start a fresh domU, then I can only send 127 (rather than
>  	  128) UDP packets from the new domU before networking will fail.
> 
>  	- If I shut down domU after the networking failure, and start a
>            new domU, networking between the new domU and dom0 does not
>            work.
> 

This corroborates my intial guess that the backend driver (in DOM0) is
sending the packets into the DOM0 networking layer, and never hearing
back when the packet is freed. Normally this would trigger a response
to be sent back to the domU and resources in the backend driver would
get freed up. This isn't happening and you eventually hit a limit on
the number of packets that the driver will simultaneously put in
flight. 

Either those UDP packets are queued up somewhere in the DOM0 network
stack, or the destructor callback is not getting called for some
reason or has got overwritten(!).

 -- Keir


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-16 21:09       ` Keir Fraser
@ 2005-01-16 21:56         ` mukesh agrawal
  0 siblings, 0 replies; 17+ messages in thread
From: mukesh agrawal @ 2005-01-16 21:56 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Nivedita Singhvi

On Sun, 16 Jan 2005, Keir Fraser wrote:

> This corroborates my intial guess that the backend driver (in DOM0) is
> sending the packets into the DOM0 networking layer, and never hearing
> back when the packet is freed. Normally this would trigger a response
> to be sent back to the domU and resources in the backend driver would
> get freed up. This isn't happening and you eventually hit a limit on
> the number of packets that the driver will simultaneously put in
> flight.

When you say "resources in the backend driver would get freed up", that's 
the domU (sender) backend driver?

> Either those UDP packets are queued up somewhere in the DOM0 network
> stack, or the destructor callback is not getting called for some
> reason or has got overwritten(!).

Well, the packets aren't stuck in the dom0 network stack... They get 
delivered all the way up to the application just fine (nc in the trivial 
test case). So I think it must be the latter... After delivering the UDP 
packet to the application, the destructor is not being called back.

Further, this seems to be specific to the receive path for packets 
delivered to userspace (since traffic to the kernel NFS server doesn't 
seem to trigger it, nor traffic to closed ports).

What (specific source files or documentation) would you suggest starting 
at, to see an example of how the destruction is supposed to be done? I 
guess the TCP receive code works properly, so maybe I should compare that 
to the UDP code?

Thanks,
mukesh



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: (repeatable) cross-domain networking failure
@ 2005-01-16 22:52 Ian Pratt
  2005-01-16 22:57 ` mukesh agrawal
  0 siblings, 1 reply; 17+ messages in thread
From: Ian Pratt @ 2005-01-16 22:52 UTC (permalink / raw)
  To: mukesh agrawal, Keir Fraser; +Cc: xen-devel, Nivedita Singhvi

> What (specific source files or documentation) would you 
> suggest starting 
> at, to see an example of how the destruction is supposed to 
> be done? I 
> guess the TCP receive code works properly, so maybe I should 
> compare that 
> to the UDP code?

Have you modified the config of your kernel at all? Can you reproduce
with one of the kernels compiled by us?

To debug this, I'd start off by instrumenting calls to skb_dequeue in
netback's net_rx_action, along with calls to skb_free and __kfree_skb in
skbuff.c

Ian


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: (repeatable) cross-domain networking failure
  2005-01-16 22:52 Ian Pratt
@ 2005-01-16 22:57 ` mukesh agrawal
  0 siblings, 0 replies; 17+ messages in thread
From: mukesh agrawal @ 2005-01-16 22:57 UTC (permalink / raw)
  To: Ian Pratt; +Cc: Keir Fraser, xen-devel, Nivedita Singhvi

On Sun, 16 Jan 2005, Ian Pratt wrote:

> Have you modified the config of your kernel at all? Can you reproduce
> with one of the kernels compiled by us?

Yep. I've experienced these hangs with the kernels and hypervisor from 
the Xen 2.0.3 release.

> To debug this, I'd start off by instrumenting calls to skb_dequeue in
> netback's net_rx_action, along with calls to skb_free and __kfree_skb in
> skbuff.c

Ok, will do.

Thanks,
mukesh



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: (repeatable) cross-domain networking failure
@ 2005-01-17 23:14 Ian Pratt
  2005-01-18  2:06 ` Adam Heath
  2005-01-18 11:05 ` Keir Fraser
  0 siblings, 2 replies; 17+ messages in thread
From: Ian Pratt @ 2005-01-17 23:14 UTC (permalink / raw)
  To: Ian Pratt, mukesh agrawal, Keir Fraser; +Cc: xen-devel, Nivedita Singhvi


OK, I have a good handle on the problem with UDP hangs into user-space
of domain 0.

It's down to message size: if the UDP payload size is less than 24
bytes, the buffer is not freed properly. Bizarre, but it explains why
our regression tests weren't picking it up as they all use larger
message sizes.

Anyhow, now we can reproduce, a fix should be forthcoming.

Ian


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: (repeatable) cross-domain networking failure
  2005-01-17 23:14 (repeatable) cross-domain networking failure Ian Pratt
@ 2005-01-18  2:06 ` Adam Heath
  2005-01-18 11:05 ` Keir Fraser
  1 sibling, 0 replies; 17+ messages in thread
From: Adam Heath @ 2005-01-18  2:06 UTC (permalink / raw)
  To: Ian Pratt
  Cc: mukesh agrawal, Keir Fraser, xen-devel@lists.sourceforge.net,
	Nivedita Singhvi

On Mon, 17 Jan 2005, Ian Pratt wrote:

>
> OK, I have a good handle on the problem with UDP hangs into user-space
> of domain 0.
>
> It's down to message size: if the UDP payload size is less than 24
> bytes, the buffer is not freed properly. Bizarre, but it explains why
> our regression tests weren't picking it up as they all use larger
> message sizes.
>
> Anyhow, now we can reproduce, a fix should be forthcoming.

Is it possible for an nfs request/response to be less than 24 bytes in size?


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-17 23:14 (repeatable) cross-domain networking failure Ian Pratt
  2005-01-18  2:06 ` Adam Heath
@ 2005-01-18 11:05 ` Keir Fraser
  2005-01-18 11:28   ` Keir Fraser
  2005-01-19 23:17   ` mukesh agrawal
  1 sibling, 2 replies; 17+ messages in thread
From: Keir Fraser @ 2005-01-18 11:05 UTC (permalink / raw)
  To: Ian Pratt; +Cc: mukesh agrawal, Keir Fraser, xen-devel, Nivedita Singhvi

> 
> OK, I have a good handle on the problem with UDP hangs into user-space
> of domain 0.
> 
> It's down to message size: if the UDP payload size is less than 24
> bytes, the buffer is not freed properly. Bizarre, but it explains why
> our regression tests weren't picking it up as they all use larger
> message sizes.
> 
> Anyhow, now we can reproduce, a fix should be forthcoming.
> 
> Ian
> 
\x1f -=- MIME -=- \x1f\f


OK, I have a good handle on the problem with UDP hangs into user-space
of domain 0.

It's down to message size: if the UDP payload size is less than 24
bytes, the buffer is not freed properly. Bizarre, but it explains why
our regression tests weren't picking it up as they all use larger
message sizes.

Anyhow, now we can reproduce, a fix should be forthcoming.

Ian


This bug is now (hopefully) fixed in the testing and unstable trees.

 -- Keir


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-18 11:05 ` Keir Fraser
@ 2005-01-18 11:28   ` Keir Fraser
  2005-01-18 16:04     ` Nivedita Singhvi
  2005-01-20 19:11     ` Adam Heath
  2005-01-19 23:17   ` mukesh agrawal
  1 sibling, 2 replies; 17+ messages in thread
From: Keir Fraser @ 2005-01-18 11:28 UTC (permalink / raw)
  To: xen-devel


> OK, I have a good handle on the problem with UDP hangs into user-space
> of domain 0.
> 
> It's down to message size: if the UDP payload size is less than 24
> bytes, the buffer is not freed properly. Bizarre, but it explains why
> our regression tests weren't picking it up as they all use larger
> message sizes.
> 
> Anyhow, now we can reproduce, a fix should be forthcoming.
> 
> Ian

This bug is now (hopefully) fixed in the testing and unstable trees.
 
  -- Keir


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-18 11:28   ` Keir Fraser
@ 2005-01-18 16:04     ` Nivedita Singhvi
  2005-01-20 19:11     ` Adam Heath
  1 sibling, 0 replies; 17+ messages in thread
From: Nivedita Singhvi @ 2005-01-18 16:04 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

Keir Fraser wrote:

>>Anyhow, now we can reproduce, a fix should be forthcoming.
>>
>>Ian
> 
> 
> This bug is now (hopefully) fixed in the testing and unstable trees.

Many thanks, Ian and Keir!

I know this was recently mentioned on a thread but I'm unable
to remember or locate it - but are your regression tests
available publicly? I'm currently assisting some engineers
to put some automated testing for this internally. The small
message test (a netperf with msg size going from say 1 byte
in steps to > ~64K) is very handy indeed, it has often
exposed problems. We'd be glad to throw some tests at you
as well.

thanks,
Nivedita




-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-18 11:05 ` Keir Fraser
  2005-01-18 11:28   ` Keir Fraser
@ 2005-01-19 23:17   ` mukesh agrawal
  1 sibling, 0 replies; 17+ messages in thread
From: mukesh agrawal @ 2005-01-19 23:17 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Ian Pratt, xen-devel

On Tue, 18 Jan 2005, Keir Fraser wrote:

> This bug is now (hopefully) fixed in the testing and unstable trees.

Yep, works for me now.

Thanks!



-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: (repeatable) cross-domain networking failure
  2005-01-18 11:28   ` Keir Fraser
  2005-01-18 16:04     ` Nivedita Singhvi
@ 2005-01-20 19:11     ` Adam Heath
  1 sibling, 0 replies; 17+ messages in thread
From: Adam Heath @ 2005-01-20 19:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.sourceforge.net

On Tue, 18 Jan 2005, Keir Fraser wrote:

>
> > OK, I have a good handle on the problem with UDP hangs into user-space
> > of domain 0.
> >
> > It's down to message size: if the UDP payload size is less than 24
> > bytes, the buffer is not freed properly. Bizarre, but it explains why
> > our regression tests weren't picking it up as they all use larger
> > message sizes.
> >
> > Anyhow, now we can reproduce, a fix should be forthcoming.
> >
> > Ian
>
> This bug is now (hopefully) fixed in the testing and unstable trees.

Does this bug exist in the stable(2.0) tree?


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: (repeatable) cross-domain networking failure
@ 2005-01-20 22:08 Ian Pratt
  0 siblings, 0 replies; 17+ messages in thread
From: Ian Pratt @ 2005-01-20 22:08 UTC (permalink / raw)
  To: Adam Heath, Keir Fraser; +Cc: xen-devel

 > > > It's down to message size: if the UDP payload size is less than
24
> > > bytes, the buffer is not freed properly. Bizarre, but it 
> explains why
> > > our regression tests weren't picking it up as they all use larger
> > > message sizes.
> > >
> > > Anyhow, now we can reproduce, a fix should be forthcoming.
> > >
> > > Ian
> >
> > This bug is now (hopefully) fixed in the testing and unstable trees.
> 
> Does this bug exist in the stable(2.0) tree?

Yes - it will be fixed in 2.0.4. It was pretty obscure (having been in
there ever since 1.3) so we're not rushing head long to doing a new
release.

Ian


-------------------------------------------------------
This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
Tool for open source databases. Create drag-&-drop reports. Save time
by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
Download a FREE copy at http://www.intelliview.com/go/osdn_nl

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2005-01-20 22:08 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-17 23:14 (repeatable) cross-domain networking failure Ian Pratt
2005-01-18  2:06 ` Adam Heath
2005-01-18 11:05 ` Keir Fraser
2005-01-18 11:28   ` Keir Fraser
2005-01-18 16:04     ` Nivedita Singhvi
2005-01-20 19:11     ` Adam Heath
2005-01-19 23:17   ` mukesh agrawal
  -- strict thread matches above, loose matches on Subject: below --
2005-01-20 22:08 Ian Pratt
2005-01-16 22:52 Ian Pratt
2005-01-16 22:57 ` mukesh agrawal
2005-01-15 16:40 mukesh agrawal
2005-01-15 17:04 ` Keir Fraser
2005-01-15 21:14 ` Nivedita Singhvi
     [not found]   ` <e15e04f905011611313312b9f4@mail.gmail.com>
2005-01-16 20:49     ` mukesh agrawal
2005-01-16 21:09       ` Keir Fraser
2005-01-16 21:56         ` mukesh agrawal
2005-01-15  1:38 mukesh agrawal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.