* Re: kernel oops/IRQ exception when networking between many domUs [not found] ` <b60a57e1c8d95c01eb0c5b383b9b8e18@cl.cam.ac.uk> @ 2005-06-06 6:42 ` Birger Toedtmann [not found] ` <20050605165716.GA1231@exp-math.uni-essen.de> 1 sibling, 0 replies; 10+ messages in thread From: Birger Toedtmann @ 2005-06-06 6:42 UTC (permalink / raw) To: xen-users, xen-devel Re-post without attachments for list readers. Keir Fraser schrieb am Sun, Jun 05, 2005 at 05:52:13PM +0100: > > On 4 Jun 2005, at 18:05, Birger Tödtmann wrote: > > >Funnily, nothing happend after > >starting the first 10-12 nodes, but after "xm create"ing one or two > >more > >nodes, the system oopsed with at least some info, but sysrq gone as > >well. So I wrote it down on a peace of paper ;-) , hopefully someone > >can make sense of it: > > Do you have the vmlinux file? It would be useful to know where in > net_rx_action the crash is happening. Apparently it is happening somewhere here: [...] 0xc028cbe5 <net_rx_action+1135>: test %eax,%eax 0xc028cbe7 <net_rx_action+1137>: je 0xc028ca82 <net_rx_action+780> 0xc028cbed <net_rx_action+1143>: mov %esi,%eax 0xc028cbef <net_rx_action+1145>: shr $0xc,%eax 0xc028cbf2 <net_rx_action+1148>: mov %eax,(%esp) 0xc028cbf5 <net_rx_action+1151>: call 0xc028c4c4 <free_mfn> 0xc028cbfa <net_rx_action+1156>: mov $0xffffffff,%ecx ^^^^^^^^^^ 0xc028cbff <net_rx_action+1161>: jmp 0xc028ca82 <net_rx_action+780> 0xc028cc04 <net_rx_action+1166>: call 0xc02c59fe <net_ratelimit> 0xc028cc09 <net_rx_action+1171>: test %eax,%eax 0xc028cc0b <net_rx_action+1173>: jne 0xc028cc47 <net_rx_action+1233> 0xc028cc0d <net_rx_action+1175>: mov 0xc0378b60,%eax [...] which is, I presume, reflected by this section within net_rx_action(): [...] /* Check the reassignment error code. */ status = NETIF_RSP_OKAY; if ( unlikely(mcl[1].args[5] != 0) ) { DPRINTK("Failed MMU update transferring to DOM%u\n", netif->domid); free_mfn(mdata >> PAGE_SHIFT); status = NETIF_RSP_ERROR; } [...] Kernel image and System.map attached. Regards, -- Birger Tödtmann Technik der Rechnernetze, Institut für Experimentelle Mathematik und Institut für Informatik und Wirtschaftsinformatik, Universität Duisburg-Essen email:btoedtmann@iem.uni-due.de skype:birger.toedtmann pgp:0x6FB166C9 icq:294947817 ^ permalink raw reply [flat|nested] 10+ messages in thread
[parent not found: <20050605165716.GA1231@exp-math.uni-essen.de>]
* Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs [not found] ` <20050605165716.GA1231@exp-math.uni-essen.de> @ 2005-06-06 8:23 ` Keir Fraser 2005-06-06 8:52 ` Birger Tödtmann 0 siblings, 1 reply; 10+ messages in thread From: Keir Fraser @ 2005-06-06 8:23 UTC (permalink / raw) To: Birger Toedtmann; +Cc: doll, xen-devel, xen-users On 5 Jun 2005, at 17:57, Birger Toedtmann wrote: > Apparently it is happening somewhere here: > > [...] > 0xc028cbe5 <net_rx_action+1135>: test %eax,%eax > 0xc028cbe7 <net_rx_action+1137>: je 0xc028ca82 > <net_rx_action+780> > 0xc028cbed <net_rx_action+1143>: mov %esi,%eax > 0xc028cbef <net_rx_action+1145>: shr $0xc,%eax > 0xc028cbf2 <net_rx_action+1148>: mov %eax,(%esp) > 0xc028cbf5 <net_rx_action+1151>: call 0xc028c4c4 <free_mfn> > 0xc028cbfa <net_rx_action+1156>: mov $0xffffffff,%ecx > ^^^^^^^^^^ Most likely the driver has tried to send a bogus page to a domU. Because it's bogus the transfer fails. The driver then tries to free the page back to Xen, but that also fails because the page is bogus. This confuses the driver, which then BUG()s out. It's not at all clear where the bogus address comes from: the driver basically just reads the address out of an skbuff, and converts it from virtual to physical address. But something is obviously going wrong, perhaps under memory pressure. :-( -- Keir ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-06 8:23 ` [Xen-devel] " Keir Fraser @ 2005-06-06 8:52 ` Birger Tödtmann 2005-06-06 8:56 ` [Xen-devel] " Birger Tödtmann 2005-06-06 9:26 ` Keir Fraser 0 siblings, 2 replies; 10+ messages in thread From: Birger Tödtmann @ 2005-06-06 8:52 UTC (permalink / raw) To: Keir Fraser; +Cc: xen-devel, xen-users Am Montag, den 06.06.2005, 09:23 +0100 schrieb Keir Fraser: > On 5 Jun 2005, at 17:57, Birger Toedtmann wrote: > > > Apparently it is happening somewhere here: > > > > [...] > > 0xc028cbe5 <net_rx_action+1135>: test %eax,%eax > > 0xc028cbe7 <net_rx_action+1137>: je 0xc028ca82 > > <net_rx_action+780> > > 0xc028cbed <net_rx_action+1143>: mov %esi,%eax > > 0xc028cbef <net_rx_action+1145>: shr $0xc,%eax > > 0xc028cbf2 <net_rx_action+1148>: mov %eax,(%esp) > > 0xc028cbf5 <net_rx_action+1151>: call 0xc028c4c4 <free_mfn> > > 0xc028cbfa <net_rx_action+1156>: mov $0xffffffff,%ecx > > ^^^^^^^^^^ > > Most likely the driver has tried to send a bogus page to a domU. > Because it's bogus the transfer fails. The driver then tries to free > the page back to Xen, but that also fails because the page is bogus. > This confuses the driver, which then BUG()s out. I commented out the free_mfn() and status= lines: the kernel now reports the following after it configured the 10th domU and ~80th vif, with approx. 20-25 bridges up. Just an idea: the number of vifs + bridges is somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the crash happens - could this hint to something? [...] Jun 6 10:12:14 lomin kernel: 10.2.23.8: port 2(vif10.3) entering forwarding state Jun 6 10:12:14 lomin kernel: 10.2.35.16: topology change detected, propagating Jun 6 10:12:14 lomin kernel: 10.2.35.16: port 2(vif10.4) entering forwarding state Jun 6 10:12:14 lomin kernel: 10.2.35.20: topology change detected, propagating Jun 6 10:12:14 lomin kernel: 10.2.35.20: port 2(vif10.5) entering forwarding state Jun 6 10:12:20 lomin kernel: c014cea4 Jun 6 10:12:20 lomin kernel: [do_page_fault+643/1665] do_page_fault +0x469/0x738 Jun 6 10:12:20 lomin kernel: [<c0115720>] do_page_fault+0x469/0x738 Jun 6 10:12:20 lomin kernel: [fixup_4gb_segment+2/12] page_fault +0x2e/0x34 Jun 6 10:12:20 lomin kernel: [<c0109a7e>] page_fault+0x2e/0x34 Jun 6 10:12:20 lomin kernel: [do_page_fault+49/1665] do_page_fault +0x217/0x738 Jun 6 10:12:20 lomin kernel: [<c01154ce>] do_page_fault+0x217/0x738 Jun 6 10:12:20 lomin kernel: [fixup_4gb_segment+2/12] page_fault +0x2e/0x34 Jun 6 10:12:20 lomin kernel: [<c0109a7e>] page_fault+0x2e/0x34 Jun 6 10:12:20 lomin kernel: PREEMPT Jun 6 10:12:20 lomin kernel: Modules linked in: dm_snapshot pcmcia bridge ipt_REJECT ipt_state iptable_filter ipt_MASQUERADE iptable_nat ip_conntrack ip_tables autofs4 snd_seq snd_seq_device evdev usbhid rfcomm l2cap bluetooth dm_mod cryptoloop snd_pcm_oss snd_mixer_oss snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd soundcore snd_page_alloc tun uhci_hcd usb_storage usbcore irtty_sir sir_dev ircomm_tty ircomm irda yenta_socket rsrc_nonstatic pcmcia_core 3c59x Jun 6 10:12:20 lomin kernel: CPU: 0 Jun 6 10:12:20 lomin kernel: EIP: 0061:[do_wp_page+622/1175] Not tainted VLI Jun 6 10:12:20 lomin kernel: EIP: 0061:[<c014cea4>] Not tainted VLI Jun 6 10:12:20 lomin kernel: EFLAGS: 00010206 (2.6.11.11-xen0) Jun 6 10:12:20 lomin kernel: EIP is at handle_mm_fault+0x5d/0x222 Jun 6 10:12:20 lomin kernel: eax: 15555b18 ebx: d8788000 ecx: 00000b18 edx: 15555b18 Jun 6 10:12:20 lomin kernel: esi: dcfc3b4c edi: dcaf5580 ebp: d8789ee4 esp: d8789ebc Jun 6 10:12:20 lomin kernel: ds: 0069 es: 0069 ss: 0069 Jun 6 10:12:20 lomin kernel: Process python (pid: 4670, threadinfo=d8788000 task=de1a1520) Jun 6 10:12:20 lomin kernel: Stack: 00000040 00000001 d40e687c d40e6874 00000006 d40e685c d8789f14 dcaf5580 Jun 6 10:12:20 lomin kernel: dcaf55ac d40e6b1c d8789fbc c01154ce dcaf5580 d40e6b1c b4ec6ff0 00000001 Jun 6 10:12:20 lomin kernel: 00000001 de1a1520 b4ec6ff0 00000006 d8789fc4 d8789fc4 c03405b0 00000006 Jun 6 10:12:20 lomin kernel: Call Trace: Jun 6 10:12:20 lomin kernel: [dump_stack+16/32] show_stack+0x80/0x96 Jun 6 10:12:20 lomin kernel: [<c0109c51>] show_stack+0x80/0x96 Jun 6 10:12:20 lomin kernel: [show_registers+384/457] show_registers +0x15a/0x1d1 Jun 6 10:12:20 lomin kernel: [<c0109de1>] show_registers+0x15a/0x1d1 Jun 6 10:12:20 lomin kernel: [die+301/458] die+0x106/0x1c4 Jun 6 10:12:20 lomin kernel: [<c010a001>] die+0x106/0x1c4 Jun 6 10:12:20 lomin kernel: [do_page_fault+675/1665] do_page_fault +0x489/0x738 Jun 6 10:12:20 lomin kernel: [<c0115740>] do_page_fault+0x489/0x738 Jun 6 10:12:20 lomin kernel: [fixup_4gb_segment+2/12] page_fault +0x2e/0x34 Jun 6 10:12:20 lomin kernel: [<c0109a7e>] page_fault+0x2e/0x34 Jun 6 10:12:20 lomin kernel: [do_page_fault+49/1665] do_page_fault +0x217/0x738 Jun 6 10:12:20 lomin kernel: [<c01154ce>] do_page_fault+0x217/0x738 Jun 6 10:12:20 lomin kernel: [fixup_4gb_segment+2/12] page_fault +0x2e/0x34 Jun 6 10:12:20 lomin kernel: [<c0109a7e>] page_fault+0x2e/0x34 Jun 6 10:12:20 lomin kernel: Code: 8b 47 1c c1 ea 16 83 43 14 01 8d 34 90 85 f6 0f 84 52 01 00 00 89 f2 8b 4d 10 89 f8 e8 4a d1 ff ff 85 c0 89 c2 0f 84 3c 01 00 00 <8b> 00 a8 81 75 3d 85 c0 0f 84 01 01 00 00 a8 40 0f 84 a4 00 00 > > It's not at all clear where the bogus address comes from: the driver > basically just reads the address out of an skbuff, and converts it from > virtual to physical address. But something is obviously going wrong, > perhaps under memory pressure. :-( Where, within the domUs or dom0? The latter has lots of memory at hand, the domU are quite strapped of memory. I'll try to find out... Regards, -- Birger Tödtmann Technik der Rechnernetze, Institut für Experimentelle Mathematik Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de skype:birger.toedtmann pgp:0x6FB166C9 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Xen-devel] kernel oops/IRQ exception when networking between many domUs 2005-06-06 8:52 ` Birger Tödtmann @ 2005-06-06 8:56 ` Birger Tödtmann 2005-06-06 9:26 ` Keir Fraser 1 sibling, 0 replies; 10+ messages in thread From: Birger Tödtmann @ 2005-06-06 8:56 UTC (permalink / raw) To: Keir Fraser; +Cc: xen-devel, xen-users Am Montag, den 06.06.2005, 10:52 +0200 schrieb Birger Tödtmann: [...] > > I commented out the free_mfn() and status= lines: the kernel now reports > the following after it configured the 10th domU and ~80th vif, with > approx. 20-25 bridges up. Just an idea: the number of vifs + bridges is Correction: I meant 40-45 bridge devices are then up and running. > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > crash happens - could this hint to something? > -- Birger Tödtmann Technik der Rechnernetze, Institut für Experimentelle Mathematik Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de skype:birger.toedtmann pgp:0x6FB166C9 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-06 8:52 ` Birger Tödtmann 2005-06-06 8:56 ` [Xen-devel] " Birger Tödtmann @ 2005-06-06 9:26 ` Keir Fraser 2005-06-06 12:30 ` Birger Tödtmann 1 sibling, 1 reply; 10+ messages in thread From: Keir Fraser @ 2005-06-06 9:26 UTC (permalink / raw) To: Birger Tödtmann; +Cc: xen-devel, xen-users On 6 Jun 2005, at 09:52, Birger Tödtmann wrote: > I commented out the free_mfn() and status= lines: the kernel now > reports > the following after it configured the 10th domU and ~80th vif, with > approx. 20-25 bridges up. Just an idea: the number of vifs + bridges > is > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > crash happens - could this hint to something? The crashes you see with free_mfn removed will be impossible to debug -- things are very screwed by that point. Even the crash within free_mfn might be far removed from the cause of the crash, if it's due to memory corruption. It's perhaps worth investigating what critical limit you might be hitting, and what resource it is that's limited. e.g., can you can create a few vifs, but connected together by some very large number of bridges (daisy chained together)? Or can you create a large number of vifs if they are connected together by just one bridge? This kind of thing will give us an idea of where the bug might be lurking. -- Keir ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-06 9:26 ` Keir Fraser @ 2005-06-06 12:30 ` Birger Tödtmann 2005-06-07 16:46 ` Nils Toedtmann 2005-06-07 16:47 ` Nils Toedtmann 0 siblings, 2 replies; 10+ messages in thread From: Birger Tödtmann @ 2005-06-06 12:30 UTC (permalink / raw) To: Keir Fraser; +Cc: xen-devel, xen-users Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser: [...] > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > > crash happens - could this hint to something? > > The crashes you see with free_mfn removed will be impossible to debug > -- things are very screwed by that point. Even the crash within > free_mfn might be far removed from the cause of the crash, if it's due > to memory corruption. > > It's perhaps worth investigating what critical limit you might be > hitting, and what resource it is that's limited. e.g., can you can > create a few vifs, but connected together by some very large number of > bridges (daisy chained together)? Or can you create a large number of > vifs if they are connected together by just one bridge? This is getting really weird - as I found out I'll enounter problems with far fewer vifs/bridges that suspected. I just fired up a network with 7 nodes, all with four interfaces each connected to the same four bridge interfaces. The nodes can ping through the network, however after a short time, the system (dom0) crashes as well. This time, it dies in net_rx_action() at a slightly different place: [...] [<c02b6e15>] kfree_skbmem+0x12/0x29 [<c02b6ed1>] __kfree_skb+0xa5/0x13f [<c028c9b3>] net_rx_action+0x23d/0x4df [...] Funnily, I cannot reproduce this with 5 nodes (domUs) running. I'm a bit unsure where to go from here... Maybe I should try a different machine for further testing. Regards -- Birger Tödtmann Technik der Rechnernetze, Institut für Experimentelle Mathematik Universität Duisburg-Essen, Campus Essen email:btoedtmann@iem.uni-due.de skype:birger.toedtmann pgp:0x6FB166C9 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-06 12:30 ` Birger Tödtmann @ 2005-06-07 16:46 ` Nils Toedtmann 2005-06-07 16:47 ` Nils Toedtmann 1 sibling, 0 replies; 10+ messages in thread From: Nils Toedtmann @ 2005-06-07 16:46 UTC (permalink / raw) To: Birger Tödtmann; +Cc: xen-devel Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann: > Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser: > [...] > > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > > > crash happens - could this hint to something? > > > > The crashes you see with free_mfn removed will be impossible to debug > > -- things are very screwed by that point. Even the crash within > > free_mfn might be far removed from the cause of the crash, if it's due > > to memory corruption. > > > > It's perhaps worth investigating what critical limit you might be > > hitting, and what resource it is that's limited. e.g., can you can > > create a few vifs, but connected together by some very large number of > > bridges (daisy chained together)? Or can you create a large number of > > vifs if they are connected together by just one bridge? > > This is getting really weird - as I found out I'll enounter problems > with far fewer vifs/bridges that suspected. I just fired up a network > with 7 nodes, all with four interfaces each connected to the same four > bridge interfaces. The nodes can ping through the network, however > after a short time, the system (dom0) crashes as well. This time, it > dies in net_rx_action() at a slightly different place: > > [...] > [<c02b6e15>] kfree_skbmem+0x12/0x29 > [<c02b6ed1>] __kfree_skb+0xa5/0x13f > [<c028c9b3>] net_rx_action+0x23d/0x4df > [...] > > Funnily, I cannot reproduce this with 5 nodes (domUs) running. I'm a > bit unsure where to go from here... Maybe I should try a different > machine for further testing. I can confirm this bug on AMD Athlon using xen-unstable from june 5th (latest ChangeSet 1.1677). All testing domains run OSPF daemons which will start talking via multicast to each other as soon as the network connections are established. * 'xm create' 20 domains with 122 vifs (+ vif0.0), but that xen- version does not UP the vifs. Everything is fine. * Create 51 transfer bridges, connect the some vifs to them (not more than two vifs to each) UP all vifs. Now i have lo + eth0 + veth0 + 123 vif* + 51 br* = 177 devices, all UP. All transfer networks work, OSPF tables grow, everything is fine. * Create a 52th bridge. Connect 20 vifs to it but DOWN THEM BEFORE. Everything ist fine. * Now UP all the vifs connected to the 52th bridge one after the other. More and more multicast traffic shows up. After UPing the 9th vif, dom0 BOOOOOMs (net_rx_action, too). Further experiments show that its seems to be the amount of traffic (and the number of connected vifs?) which triggers the oops: with all OSPF daemons stopped, i could UP all bridges & vifs. But when i did a flood- broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one with more that two active ports), dom0 OOPSed again. I could only reproduce that "too-much-traffic-oops" on bridges connecting more that 10 vifs. Would be interesting if that happens with unicast traffic, too. Have no time left, test more tomorrow. /nils. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-06 12:30 ` Birger Tödtmann 2005-06-07 16:46 ` Nils Toedtmann @ 2005-06-07 16:47 ` Nils Toedtmann 2005-06-08 12:34 ` Nils Toedtmann 1 sibling, 1 reply; 10+ messages in thread From: Nils Toedtmann @ 2005-06-07 16:47 UTC (permalink / raw) To: Birger Tödtmann; +Cc: xen-devel Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann: > Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser: > [...] > > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > > > crash happens - could this hint to something? > > > > The crashes you see with free_mfn removed will be impossible to debug > > -- things are very screwed by that point. Even the crash within > > free_mfn might be far removed from the cause of the crash, if it's due > > to memory corruption. > > > > It's perhaps worth investigating what critical limit you might be > > hitting, and what resource it is that's limited. e.g., can you can > > create a few vifs, but connected together by some very large number of > > bridges (daisy chained together)? Or can you create a large number of > > vifs if they are connected together by just one bridge? > > This is getting really weird - as I found out I'll enounter problems > with far fewer vifs/bridges that suspected. I just fired up a network > with 7 nodes, all with four interfaces each connected to the same four > bridge interfaces. The nodes can ping through the network, however > after a short time, the system (dom0) crashes as well. This time, it > dies in net_rx_action() at a slightly different place: > > [...] > [<c02b6e15>] kfree_skbmem+0x12/0x29 > [<c02b6ed1>] __kfree_skb+0xa5/0x13f > [<c028c9b3>] net_rx_action+0x23d/0x4df > [...] > > Funnily, I cannot reproduce this with 5 nodes (domUs) running. I'm a > bit unsure where to go from here... Maybe I should try a different > machine for further testing. I can confirm this bug on AMD Athlon using xen-unstable from june 5th (latest ChangeSet 1.1677). All testing domains run OSPF daemons which will start talking via multicast to each other as soon as the network connections are established. * 'xm create' 20 domains with 122 vifs (+ vif0.0), but that xen- version does not UP the vifs. Everything is fine. * Create 51 transfer bridges, connect the some vifs to them (not more than two vifs to each) UP all vifs. Now i have lo + eth0 + veth0 + 123 vif* + 51 br* = 177 devices, all UP. All transfer networks work, OSPF tables grow, everything is fine. * Create a 52th bridge. Connect 20 vifs to it but DOWN THEM BEFORE. Everything ist fine. * Now UP all the vifs connected to the 52th bridge one after the other. More and more multicast traffic shows up. After UPing the 9th vif, dom0 BOOOOOMs (net_rx_action, too). Further experiments show that its seems to be the amount of traffic (and the number of connected vifs?) which triggers the oops: with all OSPF daemons stopped, i could UP all bridges & vifs. But when i did a flood- broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one with more that two active ports), dom0 OOPSed again. I could only reproduce that "too-much-traffic-oops" on bridges connecting more that 10 vifs. Would be interesting if that happens with unicast traffic, too. Have no time left, test more tomorrow. /nils. ps: Shall we continue crossporting to devel+users? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-07 16:47 ` Nils Toedtmann @ 2005-06-08 12:34 ` Nils Toedtmann 2005-06-08 14:40 ` Nils Toedtmann 0 siblings, 1 reply; 10+ messages in thread From: Nils Toedtmann @ 2005-06-08 12:34 UTC (permalink / raw) To: Birger Tödtmann; +Cc: xen-devel Am Dienstag, den 07.06.2005, 18:47 +0200 schrieb Nils Toedtmann: > Am Montag, den 06.06.2005, 14:30 +0200 schrieb Birger Tödtmann: > > Am Montag, den 06.06.2005, 10:26 +0100 schrieb Keir Fraser: > > [...] > > > > somewhere around the magic 128 (NR_IRQS problem in 2.0.x!) when the > > > > crash happens - could this hint to something? > > > > > > The crashes you see with free_mfn removed will be impossible to debug > > > -- things are very screwed by that point. Even the crash within > > > free_mfn might be far removed from the cause of the crash, if it's due > > > to memory corruption. > > > > > > It's perhaps worth investigating what critical limit you might be > > > hitting, and what resource it is that's limited. e.g., can you can > > > create a few vifs, but connected together by some very large number of > > > bridges (daisy chained together)? Or can you create a large number of > > > vifs if they are connected together by just one bridge? > > > > This is getting really weird - as I found out I'll enounter problems > > with far fewer vifs/bridges that suspected. I just fired up a network > > with 7 nodes, all with four interfaces each connected to the same four > > bridge interfaces. The nodes can ping through the network, however > > after a short time, the system (dom0) crashes as well. This time, it > > dies in net_rx_action() at a slightly different place: > > > > [...] > > [<c02b6e15>] kfree_skbmem+0x12/0x29 > > [<c02b6ed1>] __kfree_skb+0xa5/0x13f > > [<c028c9b3>] net_rx_action+0x23d/0x4df > > [...] > > > > Funnily, I cannot reproduce this with 5 nodes (domUs) running. I'm a > > bit unsure where to go from here... Maybe I should try a different > > machine for further testing. > > I can confirm this bug on AMD Athlon using xen-unstable from june 5th > (latest ChangeSet 1.1677). [...] errr ... sorry for the dupe. > Further experiments show that its seems to be the amount of traffic (and > the number of connected vifs?) which triggers the oops: with all OSPF > daemons stopped, i could UP all bridges & vifs. But when i did a flood- > broadcast ping (ping -f -b $broadcastadr) on the 52th bridge (that one > with more that two active ports), dom0 OOPSed again. > > I could only reproduce that "too-much-traffic-oops" on bridges > connecting more that 10 vifs. > > Would be interesting if that happens with unicast traffic, too. Have no > time left, test more tomorrow. Ok, reproduced the dom0 kernel panic in a simpler situation: * create some domUs, each having 1 interface in the same subnet * bridge all the interfaces together (dom0 not having an ip on that bridge) * trigger unicast traffic as much as you want (like unicast flood pings): No problem. * Now trigger some broadcast traffic between the domUs: ping -i 0,1 -b 192.168.0.255 BOOOM. Instead, you may down all vifs first, start the flood broadcast ping in the first domU and bring up one vif after the other (wait each time >15sec until the bridge put the added port in forwarding state). After bringing up 10-15 vifs, dom0 panics. I could _not_ reproduce this with massive unicast traffic. The problem disappears if i set "net.ipv4.icmp_echo_ignore_broadcasts=1" in all domains. Maybe the probem rises if to many domUs answer to broadcasts at the same time (collisions?). /nils. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: kernel oops/IRQ exception when networking between many domUs 2005-06-08 12:34 ` Nils Toedtmann @ 2005-06-08 14:40 ` Nils Toedtmann 0 siblings, 0 replies; 10+ messages in thread From: Nils Toedtmann @ 2005-06-08 14:40 UTC (permalink / raw) To: Birger Tödtmann; +Cc: xen-devel Am Mittwoch, den 08.06.2005, 14:34 +0200 schrieb Nils Toedtmann: [...] > Ok, reproduced the dom0 kernel panic in a simpler situation: > > * create some domUs, each having 1 interface in the same subnet > * bridge all the interfaces together (dom0 not having an ip on that > bridge) > * trigger unicast traffic as much as you want (like unicast flood > pings): No problem. > * Now trigger some broadcast traffic between the domUs: > > ping -i 0,1 -b 192.168.0.255 > > BOOOM. > > > Instead, you may down all vifs first, start the flood broadcast ping in > the first domU and bring up one vif after the other (wait each time > >15sec until the bridge put the added port in forwarding state). After > bringing up 10-15 vifs, dom0 panics. > > I could _not_ reproduce this with massive unicast traffic. The problem > disappears if i set "net.ipv4.icmp_echo_ignore_broadcasts=1" in all > domains. Maybe the probem rises if to many domUs answer to broadcasts at > the same time (collisions?). More testing: again doing a [root@domUtest01 ~]# ping -f -b 192.168.0.255 into the bridged vif-subnet. With all domains having "net.ipv4.icmp_echo_ignore_broadcasts=1" (so noone answers the pings) everything is fine. When i switch in the pinging domUtest01 itself (and _only_ in that domain) to "net.ipv4.icmp_echo_ignore_broadcasts=0", dom0 immediately panics (if there are 15-20 domUs in that bridged subnet). Another test: putting dom0's vif0.0 on the bridge too, pinging from dom0. Then in needed (yet) all domains to have "net.ipv4.icmp_echo_ignore_broadcasts=0" to get my oops. The oopses happen in different places, not all contain "net_rx_action" (all are "Fatal exception in interupt". These "dumps" may contain typos because i copied them from monitor by hand): [...] error_code kfree_skbmem __kfree_skb net_rx_action tasklet_action __do_softirq soft_irq irq_exit do_IRQ evtchn_do_upcall hypervisor_callback __wake_up sock_def_readable unix_stream_sendmsg sys_sendto sys_send sys_socketcall syscall_call or [...] error_code tasklet_action __do_softirq soft_irq irq_exit do_IRQ evtchn_do_upcall hypervisor_callback or [...] error_code tasklet_action __do_softirq soft_irq evtchn_do_upcall hypervisor_callback cpu_idle start_kernel or [...] error_code kfree_skbmem __kfree_skb net_rx_action tasklet_action __do_softirq soft_irq irq_exit do_IRQ evtchn_do_upcall hypervisor_callback __mmx_memcpy memcpy dup_task_struct copy_process do_fork sys_clone syscall_call or [...] error_code kfree_skbmem __kfree_skb net_rx_action tasklet_action __do_softirq soft_irq irq_exit do_IRQ evtchn_do_upcall hypervisor_callback __wake_up sock_def_readable unix_stream_sendmsg sys_sendto sys_send sys_socketcall syscall_call or [...] error_code kfree_skbmem __kfree_skb net_rx_action tasklet_action __do_softirq do_softirq local_bh_enable dev_queue_xmit nf_hook_slow ip_finish_output dst_output ip_push_pending_frames raw_sendmsg sock_sendmsg sys_sendmsg sys_socketcall syscall_call and more ... ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2005-06-08 14:40 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1117904746.7507.31.camel@lomin>
[not found] ` <b60a57e1c8d95c01eb0c5b383b9b8e18@cl.cam.ac.uk>
2005-06-06 6:42 ` kernel oops/IRQ exception when networking between many domUs Birger Toedtmann
[not found] ` <20050605165716.GA1231@exp-math.uni-essen.de>
2005-06-06 8:23 ` [Xen-devel] " Keir Fraser
2005-06-06 8:52 ` Birger Tödtmann
2005-06-06 8:56 ` [Xen-devel] " Birger Tödtmann
2005-06-06 9:26 ` Keir Fraser
2005-06-06 12:30 ` Birger Tödtmann
2005-06-07 16:46 ` Nils Toedtmann
2005-06-07 16:47 ` Nils Toedtmann
2005-06-08 12:34 ` Nils Toedtmann
2005-06-08 14:40 ` Nils Toedtmann
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.