From: Konrad Rzeszutek Wilk <konrad@darnok.org>
To: Nathan March <nathan@gt.net>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: Re: Crashing / unable to start domUs due to high number of luns?
Date: Tue, 31 Jan 2012 21:30:36 -0400 [thread overview]
Message-ID: <20120201013036.GA12637@andromeda.dapyr.net> (raw)
In-Reply-To: <4F28603F.8010900@gt.net>
On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
> Hi All,
>
> We've got a xen setup based around a dell iscsi device with each xen
> host having 2 lun's, we then run multipath on top of that. After adding
> a couple new virtual disks the other day, a couple of our online stable
> VM's suddenly hard locked up. Attaching to the console gave me nothing,
> looked like they lost their disk devices.
>
> Attempting to restart them on the same dom0 failed with hot plug errors,
> as did attempting to start them on a few different dom0's. After doing a
> "multipath -F" to remove unused devices and manually bringing in just
> the selected LUN's via "multipath diskname", I was able to successfully
> start them. This initially made me think perhaps I was hitting some sort
> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun
> = 1088 iscsi connections). Just to be clear, the problem occurred on
> multiple dom0's at the same time so it definitely seems iscsi related.
>
> Now, a day later, I'm debugging this further and I'm again unable to
> start VM's, even with all extra multipath devices removed. I rebooted
> one of the dom0's and was able to successfully migrate our production
> VM's off a broken server, so I've now got an empty dom0 that's unable to
> start any vm's.
>
> Starting a VM results in the following in xend.log:
>
> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
> [2012-01-31 13:06:16 12353] DEBUG (DevController:628)
> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices
> failed.
> Traceback (most recent call last):
> File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line
> 85, in perform
> return op_method(op, req)
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line
> 85, in op_wait_for_devices
> return self.dom.waitForDevices()
> File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
> line 1237, in waitForDevices
> self.getDeviceController(devclass).waitForDevices()
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 140, in waitForDevices
> return map(self.waitForDevice, self.deviceIDs())
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 155, in waitForDevice
> (devid, self.deviceClass))
> VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.
Was there anything in the kernel (dmesg) about vifs? What does your
/proc/interrupts look like? Can you provide the dmesg that you get
during startup. I am mainly looking for:
NR_IRQS:16640 nr_irqs:1536 16
How many guests are your running when this happens?
One theory is that your are running out dom0 interrupts. Thought
I *think* that was made dynamic in 3.0..
Thought that does explain your iSCSI network wonky in the guest -
was there anything in the dmesg when the guest started going bad?
> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071)
> XendDomainInfo.destroy: domid=35
> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying
> device model
>
> I tried turning up udev's log level but that didn't reveal anything.
> Reading the xenstore for the vif doesn't show anything unusual either:
>
> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
> 0 = ""
> bridge = "vlan91"
> domain = "nathanxenuk1"
> handle = "0"
> uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
> script = "/etc/xen/scripts/vif-bridge"
> state = "1"
> frontend = "/local/domain/35/device/vif/0"
> mac = "00:16:3d:03:00:44"
> online = "1"
> frontend-id = "35"
>
> The bridge device (vlan91) exists, trying a different bridge doesn't
> matter. Removing the VIF completely results in the same error for the
> VBD. Adding debugging to the hotplug/network scripts didn't reveal
> anything, it looks like they aren't even being executed yet. Nothing is
> logged to xen-hotplug.log.
OK, so that would imply the kernel hasn't been able to do the right
thing. Hmm.
What do you see when this happens with udev --monitor --kernel --udev
--property ?
>
> The only thing I can think of that this may be related to, is gentoo
> defaulted to a 10mb /dev which we filled up a few months back. We upped
> the size to 50mb in the mount options and everything's been completely
> stable since (~33 days). None of the /dev on the dom0's is higher than
> 25% usage. Asides from adding the new luns, no changes have been made in
> the past month.
>
> To try and test if removing some devices would solve anything, I tried
> doing an "iscsiadm -m node --logout" and it promptly hard locked the
> entire box. After a reboot, I was unable to reproduce the problem on
> that particular dom0.
>
> I've still got 1 dom0 that's exhibiting the problem, if anyone is able
> to suggest any further debugging steps?
>
> - Nathan
>
>
> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1,
> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
>
> ukxen1 xen # xm info
> host : ukxen1
> release : 3.0.3
> version : #4 SMP Thu Dec 22 12:44:22 PST 2011
> machine : x86_64
> nr_cpus : 24
> nr_nodes : 2
> cores_per_socket : 6
> threads_per_core : 2
> cpu_mhz : 2261
> hw_caps :
> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
> virt_caps : hvm hvm_directio
> total_memory : 98291
> free_memory : 91908
> free_cpus : 0
> xen_major : 4
> xen_minor : 1
> xen_extra : .1
> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : unavailable
> xen_commandline : console=vga dom0_mem=1024M dom0_max_vcpus=1
> dom0_vcpus_pin=true
> cc_compiler : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
> cc_compile_by : root
> cc_compile_domain :
> cc_compile_date : Mon Aug 29 16:24:12 PDT 2011
> xend_config_format : 4
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
next prev parent reply other threads:[~2012-02-01 1:30 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-31 21:42 Crashing / unable to start domUs due to high number of luns? Nathan March
2012-02-01 1:30 ` Konrad Rzeszutek Wilk [this message]
2012-02-01 19:48 ` Nathan March
2012-02-17 7:15 ` Nathan March
2012-02-17 9:15 ` Fajar A. Nugraha
2012-02-17 9:21 ` Nathan March
2012-04-16 14:18 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120201013036.GA12637@andromeda.dapyr.net \
--to=konrad@darnok.org \
--cc=nathan@gt.net \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.