From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: Re: Network dies and  kernel errors
Date: Fri, 29 Jul 2011 11:03:48 -0400
Message-ID: <20110729150347.GF5458@dumpdata.com>
References: <201107251418.21569.johnm@advocap.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
In-Reply-To: <201107251418.21569.johnm@advocap.org>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: John McMonagle <johnm@advocap.org>, tinnycloud@hotmail.com
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

On Mon, Jul 25, 2011 at 02:18:21PM -0500, John McMonagle wrote:
> Have a new amd 6100 based server.
> http://www.supermicro.com/Aplus/system/2U/2022/AS-2022G-URF.cfm
> Running debian squeeze with debian 2.6.32 xen kernel
> Running xen 4.1.1 built from source from xen.org
>=20
> I'm seeing 2 errors.
> during boot get this:
>=20
> [    0.004823] ------------[ cut here ]------------
> [    0.004833] WARNING:=20
> at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debi=
an/build/source_amd64_xen/arch/x86/xen/enlighten.c:726=20
> init_hw_perf_events+0x32d/0x3cd()
> [    0.004838] Hardware name: H8DGU
> [    0.004841] Modules linked in:
> [    0.004847] Pid: 0, comm: swapper Not tainted 2.6.32-5-xen-amd64 #1
> [    0.004850] Call Trace:
> [    0.004857]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> [    0.004862]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> [    0.004870]  [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> [    0.004875]  [<ffffffff81510efc>] ? init_hw_perf_events+0x32d/0x3cd
> [    0.004881]  [<ffffffff813044dc>] ? identify_cpu+0x2f7/0x300
> [    0.004888]  [<ffffffff8100eccf>] ? xen_restore_fl_direct_end+0x0/0x=
1
> [    0.004895]  [<ffffffff810e81d5>] ? kmem_cache_alloc+0x8c/0xf0
> [    0.004900]  [<ffffffff81510a16>] ? identify_boot_cpu+0x15/0x3e
> [    0.004904]  [<ffffffff81510baa>] ? check_bugs+0x9/0x2e
> [    0.004910]  [<ffffffff81509cce>] ? start_kernel+0x3cd/0x3e8
> [    0.004915]  [<ffffffff8150bc93>] ? xen_start_kernel+0x586/0x58a

You can ignore that one. It just means that you can't do profiling which =
we haven't
yet up-ported.

..
> Then next one may not be xen but I only had the problem after running a=
 domu.
> After a while I get kernel error and networking stops.

And some other user with a bnx2 driver seems to see a similar problem. Le=
t me CC them here.

> This is the error:
> [ 1411.813376] ------------[ cut here ]------------
> [ 1411.813398] WARNING:=20
> at /build/buildd-linux-2.6_2.6.32-35-amd64-aZSlKL/linux-2.6-2.6.32/debi=
an/build/source_amd64_xen/net/sched/s
> ch_generic.c:261 dev_watchdog+0xe2/0x194()

OK, this is one is more worrysome.

> [ 1411.813410] Hardware name: H8DGU
> [ 1411.813417] NETDEV WATCHDOG: peth0 (igb): transmit queue 1 timed out
> [ 1411.813424] Modules linked in: xt_physdev iptable_filter tun ip_tabl=
es=20
> x_tables bridge stp sg sr_mod cdrom xfs exportfs ipmi_si i
> pmi_devintf ipmi_watchdog ipmi_msghandler xen_evtchn blktap xenfs loop =
snd_pcm=20
> snd_timer snd soundcore snd_page_alloc pcspkr psmouse joydev evdev seri=
o_raw=20
> i2c_piix
> 4 edac_core k10temp edac_mce_amd i2c_core processor button acpi_process=
or ext4=20
> mbcache jbd2 crc16 usbhid hid dm_mod raid1 md_mod sd_mod crc_t10dif=20
> ata_generic usb_s
> torage pata_atiixp ahci ohci_hcd libata ehci_hcd usbcore nls_base scsi_=
mod igb=20
> dca thermal thermal_sys [last unloaded: scsi_wait_scan]
>  [ 1411.813656] Pid: 4, comm: ksoftirqd/0 Tainted: G        W =20
> 2.6.32-5-xen-amd64 #1
> [ 1411.813664] Call Trace:
> [ 1411.813671]  <IRQ>  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> [ 1411.813697]  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> [ 1411.813711]  [<ffffffff8104ef00>] ? warn_slowpath_common+0x77/0xa3
> [ 1411.813724]  [<ffffffff81272d60>] ? dev_watchdog+0x0/0x194
> [ 1411.813736]  [<ffffffff8104ef88>] ? warn_slowpath_fmt+0x51/0x59
> [ 1411.813751]  [<ffffffff8130d42a>] ? _spin_unlock_irqrestore+0xd/0xe
> [ 1411.813762]  [<ffffffff8104b41e>] ? try_to_wake_up+0x289/0x29b
> [ 1411.813778]  [<ffffffff81272d34>] ? netif_tx_lock+0x3d/0x69
> [ 1411.813791]  [<ffffffff8125d7da>] ? netdev_drivername+0x3b/0x40
> [ 1411.813803]  [<ffffffff81272e42>] ? dev_watchdog+0xe2/0x194
> [ 1411.813816]  [<ffffffff8100ece2>] ? check_events+0x12/0x20
> [ 1411.813827]  [<ffffffff81040e42>] ? check_preempt_wakeup+0x0/0x268
> [ 1411.813841]  [<ffffffff8105b5ef>] ? run_timer_softirq+0x1c9/0x268
> [ 1411.813855]  [<ffffffff81054c9b>] ? __do_softirq+0xdd/0x1a6
> [ 1411.813867]  [<ffffffff81012cac>] ? call_softirq+0x1c/0x30
> [ 1411.813873]  <EOI>  [<ffffffff8101422b>] ? do_softirq+0x3f/0x7c
> [ 1411.813893]  [<ffffffff810548c2>] ? ksoftirqd+0x5f/0xd3
> [ 1411.813905]  [<ffffffff81054863>] ? ksoftirqd+0x0/0xd3
> [ 1411.813915]  [<ffffffff81065c39>] ? kthread+0x79/0x81
> [ 1411.813926]  [<ffffffff81012baa>] ? child_rip+0xa/0x20
> [ 1411.813937]  [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
> [ 1411.813948]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
> [ 1411.813958]  [<ffffffff81012ba0>] ? child_rip+0x0/0x20
> [ 1411.813966] ---[ end trace a7919e7f17c0a727 ]---
> [ 1412.052253] eth0: port 1(peth0) entering disabled state
> [ 1635.796207] frontend_changed: backend/vbd/3/768: prepare for reconne=
ct
> [ 1647.137513] eth0: port 3(vif3.0) entering disabled state
> [ 1647.157527] eth0: port 3(vif3.0) entering disabled state
>  Kernel logging (proc) stopped.
>=20
> In this case dom0 locked up. Some times just networking stops and some =
times=20
> networking recovers.
>=20
> Looks like it uses msi-x interrupts.
>=20
> Concerning igb error I have tried the following =A0one at a time:
> New igb driver from Intel site.
> kernel parameter =A0pcie_aspm=3Doff
> ethtool -K eth0 tx off =A0on dom0
> ethtool -K eth0 gro off =A0on dom0=20
>=20
OK.
> It has never died doing iperf from dom0 or domu =A0<> external.
> Never died during network backup.
>=20
> Usually takes a least a few hours and has never made it a day running a=
 domu.
> Wish I could get it to die faster :-)
> Any ideas?
> I'm pretty much down to trying different network cards

Did you try that? Did that make any difference?
>=20
> Any ideas?

There is a Xen parameter called 'noirqbalance' . Try that. Also see if yo=
u can
limit the CPUs in the dom0 using these two arguments on Xen hypervisor:

dom0_vcpus=3D2 dom0_vcpus_pin=3D1


It would be interesting to narrow down _when_ you trigger this failure. B=
/c we
can pull Xen to see what the MSI's are 'xl debug-keys M' _before_ and _af=
ter_ your
failure to see if something is amiss.

Mainly to figure out if the vectors are moving around the CPUs (or not)

(XEN)  MSI    29 vec=3D21 lowest  edge   assert  log lowest dest=3D000000=
01 mask=3D0/0/-1

and also 'xl debug-keys i' to see if the domain has ACK-ed the interrupt:
(XEN)    IRQ:  29 affinity:00000000,00000000,00000000,00000001 vec:21 typ=
e=3DPCI-MSI         status=3D00000010 in-flight=3D0 domain-list=3D0:275(-=
---),

(the last '----' might have something else in in them - if so that is a s=
ign that
dom0 hasn't picked up the event/vector).