xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Crashing / unable to start domUs due to high number of luns?
@ 2012-01-31 21:42 Nathan March
  2012-02-01  1:30 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 7+ messages in thread
From: Nathan March @ 2012-01-31 21:42 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com

Hi All,

We've got a xen setup based around a dell iscsi device with each xen 
host having 2 lun's, we then run multipath on top of that. After adding 
a couple new virtual disks the other day, a couple of our online stable 
VM's suddenly hard locked up. Attaching to the console gave me nothing, 
looked like they lost their disk devices.

Attempting to restart them on the same dom0 failed with hot plug errors, 
as did attempting to start them on a few different dom0's. After doing a 
"multipath -F" to remove unused devices and manually bringing in just 
the selected LUN's via "multipath diskname", I was able to successfully 
start them. This initially made me think perhaps I was hitting some sort 
of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun 
= 1088 iscsi connections). Just to be clear, the problem occurred on 
multiple dom0's at the same time so it definitely seems iscsi related.

Now, a day later, I'm debugging this further and I'm again unable to 
start VM's, even with all extra multipath devices removed. I rebooted 
one of the dom0's and was able to successfully migrate our production 
VM's off a broken server, so I've now got an empty dom0 that's unable to 
start any vm's.

Starting a VM results in the following in xend.log:

[2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
[2012-01-31 13:06:16 12353] DEBUG (DevController:628) 
hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
[2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices 
failed.
Traceback (most recent call last):
   File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line 
85, in perform
     return op_method(op, req)
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line 
85, in op_wait_for_devices
     return self.dom.waitForDevices()
   File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", 
line 1237, in waitForDevices
     self.getDeviceController(devclass).waitForDevices()
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", 
line 140, in waitForDevices
     return map(self.waitForDevice, self.deviceIDs())
   File 
"/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", 
line 155, in waitForDevice
     (devid, self.deviceClass))
VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.
[2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071) 
XendDomainInfo.destroy: domid=35
[2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying 
device model

I tried turning up udev's log level but that didn't reveal anything. 
Reading the xenstore for the vif doesn't show anything unusual either:

ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
0 = ""
  bridge = "vlan91"
  domain = "nathanxenuk1"
  handle = "0"
  uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
  script = "/etc/xen/scripts/vif-bridge"
  state = "1"
  frontend = "/local/domain/35/device/vif/0"
  mac = "00:16:3d:03:00:44"
  online = "1"
  frontend-id = "35"

The bridge device (vlan91) exists, trying a different bridge doesn't 
matter. Removing the VIF completely results in the same error for the 
VBD. Adding debugging to the hotplug/network scripts didn't reveal 
anything, it looks like they aren't even being executed yet. Nothing is 
logged to xen-hotplug.log.

The only thing I can think of that this may be related to, is gentoo 
defaulted to a 10mb /dev which we filled up a few months back. We upped 
the size to 50mb in the mount options and everything's been completely 
stable since (~33 days). None of the /dev on the dom0's is higher than 
25% usage. Asides from adding the new luns, no changes have been made in 
the past month.

To try and test if removing some devices would solve anything, I tried 
doing an "iscsiadm -m node --logout" and it promptly hard locked the 
entire box. After a reboot, I was unable to reproduce the problem on 
that particular dom0.

I've still got 1 dom0 that's exhibiting the problem, if anyone is able 
to suggest any further debugging steps?

- Nathan


(XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1, 
pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011

ukxen1 xen # xm info
host                   : ukxen1
release                : 3.0.3
version                : #4 SMP Thu Dec 22 12:44:22 PST 2011
machine                : x86_64
nr_cpus                : 24
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 2
cpu_mhz                : 2261
hw_caps                : 
bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
virt_caps              : hvm hvm_directio
total_memory           : 98291
free_memory            : 91908
free_cpus              : 0
xen_major              : 4
xen_minor              : 1
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : console=vga dom0_mem=1024M dom0_max_vcpus=1 
dom0_vcpus_pin=true
cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
cc_compile_by          : root
cc_compile_domain      :
cc_compile_date        : Mon Aug 29 16:24:12 PDT 2011
xend_config_format     : 4

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-04-16 14:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-31 21:42 Crashing / unable to start domUs due to high number of luns? Nathan March
2012-02-01  1:30 ` Konrad Rzeszutek Wilk
2012-02-01 19:48   ` Nathan March
2012-02-17  7:15   ` Nathan March
2012-02-17  9:15     ` Fajar A. Nugraha
2012-02-17  9:21       ` Nathan March
2012-04-16 14:18     ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).