From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: Re: Network dies and kernel errors Date: Fri, 29 Jul 2011 11:03:48 -0400 Message-ID: <20110729150347.GF5458@dumpdata.com> References: <201107251418.21569.johnm@advocap.org> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline In-Reply-To: <201107251418.21569.johnm@advocap.org> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: John McMonagle , tinnycloud@hotmail.com Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org On Mon, Jul 25, 2011 at 02:18:21PM -0500, John McMonagle wrote: > Have a new amd 6100 based server. > http://www.supermicro.com/Aplus/system/2U/2022/AS-2022G-URF.cfm > Running debian squeeze with debian 2.6.32 xen kernel > Running xen 4.1.1 built from source from xen.org >=20 > I'm seeing 2 errors. > during boot get this: >=20 > [ 0.004823] ------------[ cut here ]------------ > [ 0.004833] WARNING:=20 > at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debi= an/build/source_amd64_xen/arch/x86/xen/enlighten.c:726=20 > init_hw_perf_events+0x32d/0x3cd() > [ 0.004838] Hardware name: H8DGU > [ 0.004841] Modules linked in: > [ 0.004847] Pid: 0, comm: swapper Not tainted 2.6.32-5-xen-amd64 #1 > [ 0.004850] Call Trace: > [ 0.004857] [] ? init_hw_perf_events+0x32d/0x3cd > [ 0.004862] [] ? init_hw_perf_events+0x32d/0x3cd > [ 0.004870] [] ? warn_slowpath_common+0x77/0xa3 > [ 0.004875] [] ? init_hw_perf_events+0x32d/0x3cd > [ 0.004881] [] ? identify_cpu+0x2f7/0x300 > [ 0.004888] [] ? xen_restore_fl_direct_end+0x0/0x= 1 > [ 0.004895] [] ? kmem_cache_alloc+0x8c/0xf0 > [ 0.004900] [] ? identify_boot_cpu+0x15/0x3e > [ 0.004904] [] ? check_bugs+0x9/0x2e > [ 0.004910] [] ? start_kernel+0x3cd/0x3e8 > [ 0.004915] [] ? xen_start_kernel+0x586/0x58a You can ignore that one. It just means that you can't do profiling which = we haven't yet up-ported. .. > Then next one may not be xen but I only had the problem after running a= domu. > After a while I get kernel error and networking stops. And some other user with a bnx2 driver seems to see a similar problem. Le= t me CC them here. > This is the error: > [ 1411.813376] ------------[ cut here ]------------ > [ 1411.813398] WARNING:=20 > at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debi= an/build/source_amd64_xen/net/sched/s > ch_generic.c:261 dev_watchdog+0xe2/0x194() OK, this is one is more worrysome. > [ 1411.813410] Hardware name: H8DGU > [ 1411.813417] NETDEV WATCHDOG: peth0 (igb): transmit queue 1 timed out > [ 1411.813424] Modules linked in: xt_physdev iptable_filter tun ip_tabl= es=20 > x_tables bridge stp sg sr_mod cdrom xfs exportfs ipmi_si i > pmi_devintf ipmi_watchdog ipmi_msghandler xen_evtchn blktap xenfs loop = snd_pcm=20 > snd_timer snd soundcore snd_page_alloc pcspkr psmouse joydev evdev seri= o_raw=20 > i2c_piix > 4 edac_core k10temp edac_mce_amd i2c_core processor button acpi_process= or ext4=20 > mbcache jbd2 crc16 usbhid hid dm_mod raid1 md_mod sd_mod crc_t10dif=20 > ata_generic usb_s > torage pata_atiixp ahci ohci_hcd libata ehci_hcd usbcore nls_base scsi_= mod igb=20 > dca thermal thermal_sys [last unloaded: scsi_wait_scan] > [ 1411.813656] Pid: 4, comm: ksoftirqd/0 Tainted: G W =20 > 2.6.32-5-xen-amd64 #1 > [ 1411.813664] Call Trace: > [ 1411.813671] [] ? dev_watchdog+0xe2/0x194 > [ 1411.813697] [] ? dev_watchdog+0xe2/0x194 > [ 1411.813711] [] ? warn_slowpath_common+0x77/0xa3 > [ 1411.813724] [] ? dev_watchdog+0x0/0x194 > [ 1411.813736] [] ? warn_slowpath_fmt+0x51/0x59 > [ 1411.813751] [] ? _spin_unlock_irqrestore+0xd/0xe > [ 1411.813762] [] ? try_to_wake_up+0x289/0x29b > [ 1411.813778] [] ? netif_tx_lock+0x3d/0x69 > [ 1411.813791] [] ? netdev_drivername+0x3b/0x40 > [ 1411.813803] [] ? dev_watchdog+0xe2/0x194 > [ 1411.813816] [] ? check_events+0x12/0x20 > [ 1411.813827] [] ? check_preempt_wakeup+0x0/0x268 > [ 1411.813841] [] ? run_timer_softirq+0x1c9/0x268 > [ 1411.813855] [] ? __do_softirq+0xdd/0x1a6 > [ 1411.813867] [] ? call_softirq+0x1c/0x30 > [ 1411.813873] [] ? do_softirq+0x3f/0x7c > [ 1411.813893] [] ? ksoftirqd+0x5f/0xd3 > [ 1411.813905] [] ? ksoftirqd+0x0/0xd3 > [ 1411.813915] [] ? kthread+0x79/0x81 > [ 1411.813926] [] ? child_rip+0xa/0x20 > [ 1411.813937] [] ? int_ret_from_sys_call+0x7/0x1b > [ 1411.813948] [] ? retint_restore_args+0x5/0x6 > [ 1411.813958] [] ? child_rip+0x0/0x20 > [ 1411.813966] ---[ end trace a7919e7f17c0a727 ]--- > [ 1412.052253] eth0: port 1(peth0) entering disabled state > [ 1635.796207] frontend_changed: backend/vbd/3/768: prepare for reconne= ct > [ 1647.137513] eth0: port 3(vif3.0) entering disabled state > [ 1647.157527] eth0: port 3(vif3.0) entering disabled state > Kernel logging (proc) stopped. >=20 > In this case dom0 locked up. Some times just networking stops and some = times=20 > networking recovers. >=20 > Looks like it uses msi-x interrupts. >=20 > Concerning igb error I have tried the following =A0one at a time: > New igb driver from Intel site. > kernel parameter =A0pcie_aspm=3Doff > ethtool -K eth0 tx off =A0on dom0 > ethtool -K eth0 gro off =A0on dom0=20 >=20 OK. > It has never died doing iperf from dom0 or domu =A0<> external. > Never died during network backup. >=20 > Usually takes a least a few hours and has never made it a day running a= domu. > Wish I could get it to die faster :-) > Any ideas? > I'm pretty much down to trying different network cards Did you try that? Did that make any difference? >=20 > Any ideas? There is a Xen parameter called 'noirqbalance' . Try that. Also see if yo= u can limit the CPUs in the dom0 using these two arguments on Xen hypervisor: dom0_vcpus=3D2 dom0_vcpus_pin=3D1 It would be interesting to narrow down _when_ you trigger this failure. B= /c we can pull Xen to see what the MSI's are 'xl debug-keys M' _before_ and _af= ter_ your failure to see if something is amiss. Mainly to figure out if the vectors are moving around the CPUs (or not) (XEN) MSI 29 vec=3D21 lowest edge assert log lowest dest=3D000000= 01 mask=3D0/0/-1 and also 'xl debug-keys i' to see if the domain has ACK-ed the interrupt: (XEN) IRQ: 29 affinity:00000000,00000000,00000000,00000001 vec:21 typ= e=3DPCI-MSI status=3D00000010 in-flight=3D0 domain-list=3D0:275(-= ---), (the last '----' might have something else in in them - if so that is a s= ign that dom0 hasn't picked up the event/vector).