From: Konrad Rzeszutek Wilk <konrad@darnok.org>
To: Nathan March <nathan@gt.net>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
Subject: Re: Crashing / unable to start domUs due to high number of luns?
Date: Tue, 31 Jan 2012 21:30:36 -0400 [thread overview]
Message-ID: <20120201013036.GA12637@andromeda.dapyr.net> (raw)
In-Reply-To: <4F28603F.8010900@gt.net>
On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote:
> Hi All,
>
> We've got a xen setup based around a dell iscsi device with each xen
> host having 2 lun's, we then run multipath on top of that. After adding
> a couple new virtual disks the other day, a couple of our online stable
> VM's suddenly hard locked up. Attaching to the console gave me nothing,
> looked like they lost their disk devices.
>
> Attempting to restart them on the same dom0 failed with hot plug errors,
> as did attempting to start them on a few different dom0's. After doing a
> "multipath -F" to remove unused devices and manually bringing in just
> the selected LUN's via "multipath diskname", I was able to successfully
> start them. This initially made me think perhaps I was hitting some sort
> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun
> = 1088 iscsi connections). Just to be clear, the problem occurred on
> multiple dom0's at the same time so it definitely seems iscsi related.
>
> Now, a day later, I'm debugging this further and I'm again unable to
> start VM's, even with all extra multipath devices removed. I rebooted
> one of the dom0's and was able to successfully migrate our production
> VM's off a broken server, so I've now got an empty dom0 that's unable to
> start any vm's.
>
> Starting a VM results in the following in xend.log:
>
> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0.
> [2012-01-31 13:06:16 12353] DEBUG (DevController:628)
> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status.
> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices
> failed.
> Traceback (most recent call last):
> File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line
> 85, in perform
> return op_method(op, req)
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line
> 85, in op_wait_for_devices
> return self.dom.waitForDevices()
> File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py",
> line 1237, in waitForDevices
> self.getDeviceController(devclass).waitForDevices()
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 140, in waitForDevices
> return map(self.waitForDevice, self.deviceIDs())
> File
> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py",
> line 155, in waitForDevice
> (devid, self.deviceClass))
> VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.
Was there anything in the kernel (dmesg) about vifs? What does your
/proc/interrupts look like? Can you provide the dmesg that you get
during startup. I am mainly looking for:
NR_IRQS:16640 nr_irqs:1536 16
How many guests are your running when this happens?
One theory is that your are running out dom0 interrupts. Thought
I *think* that was made dynamic in 3.0..
Thought that does explain your iSCSI network wonky in the guest -
was there anything in the dmesg when the guest started going bad?
> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071)
> XendDomainInfo.destroy: domid=35
> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying
> device model
>
> I tried turning up udev's log level but that didn't reveal anything.
> Reading the xenstore for the vif doesn't show anything unusual either:
>
> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35
> 0 = ""
> bridge = "vlan91"
> domain = "nathanxenuk1"
> handle = "0"
> uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81"
> script = "/etc/xen/scripts/vif-bridge"
> state = "1"
> frontend = "/local/domain/35/device/vif/0"
> mac = "00:16:3d:03:00:44"
> online = "1"
> frontend-id = "35"
>
> The bridge device (vlan91) exists, trying a different bridge doesn't
> matter. Removing the VIF completely results in the same error for the
> VBD. Adding debugging to the hotplug/network scripts didn't reveal
> anything, it looks like they aren't even being executed yet. Nothing is
> logged to xen-hotplug.log.
OK, so that would imply the kernel hasn't been able to do the right
thing. Hmm.
What do you see when this happens with udev --monitor --kernel --udev
--property ?
>
> The only thing I can think of that this may be related to, is gentoo
> defaulted to a 10mb /dev which we filled up a few months back. We upped
> the size to 50mb in the mount options and everything's been completely
> stable since (~33 days). None of the /dev on the dom0's is higher than
> 25% usage. Asides from adding the new luns, no changes have been made in
> the past month.
>
> To try and test if removing some devices would solve anything, I tried
> doing an "iscsiadm -m node --logout" and it promptly hard locked the
> entire box. After a reboot, I was unable to reproduce the problem on
> that particular dom0.
>
> I've still got 1 dom0 that's exhibiting the problem, if anyone is able
> to suggest any further debugging steps?
>
> - Nathan
>
>
> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1,
> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011
>
> ukxen1 xen # xm info
> host : ukxen1
> release : 3.0.3
> version : #4 SMP Thu Dec 22 12:44:22 PST 2011
> machine : x86_64
> nr_cpus : 24
> nr_nodes : 2
> cores_per_socket : 6
> threads_per_core : 2
> cpu_mhz : 2261
> hw_caps :
> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
> virt_caps : hvm hvm_directio
> total_memory : 98291
> free_memory : 91908
> free_cpus : 0
> xen_major : 4
> xen_minor : 1
> xen_extra : .1
> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler : credit
> xen_pagesize : 4096
> platform_params : virt_start=0xffff800000000000
> xen_changeset : unavailable
> xen_commandline : console=vga dom0_mem=1024M dom0_max_vcpus=1
> dom0_vcpus_pin=true
> cc_compiler : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
> cc_compile_by : root
> cc_compile_domain :
> cc_compile_date : Mon Aug 29 16:24:12 PDT 2011
> xend_config_format : 4
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
next prev parent reply other threads:[~2012-02-01 1:30 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-01-31 21:42 Crashing / unable to start domUs due to high number of luns? Nathan March
2012-02-01 1:30 ` Konrad Rzeszutek Wilk [this message]
2012-02-01 19:48 ` Nathan March
2012-02-17 7:15 ` Nathan March
2012-02-17 9:15 ` Fajar A. Nugraha
2012-02-17 9:21 ` Nathan March
2012-04-16 14:18 ` Konrad Rzeszutek Wilk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120201013036.GA12637@andromeda.dapyr.net \
--to=konrad@darnok.org \
--cc=nathan@gt.net \
--cc=xen-devel@lists.xensource.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).