From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nathan March Subject: Re: Crashing / unable to start domUs due to high number of luns? Date: Thu, 16 Feb 2012 23:15:34 -0800 Message-ID: <4F3DFE96.2030803@gt.net> References: <4F28603F.8010900@gt.net> <20120201013036.GA12637@andromeda.dapyr.net> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20120201013036.GA12637@andromeda.dapyr.net> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Konrad Rzeszutek Wilk Cc: "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org On 1/31/2012 5:30 PM, Konrad Rzeszutek Wilk wrote: > On Tue, Jan 31, 2012 at 01:42:23PM -0800, Nathan March wrote: >> Hi All, >> >> We've got a xen setup based around a dell iscsi device with each xen >> host having 2 lun's, we then run multipath on top of that. After adding >> a couple new virtual disks the other day, a couple of our online stable >> VM's suddenly hard locked up. Attaching to the console gave me nothing, >> looked like they lost their disk devices. >> >> Attempting to restart them on the same dom0 failed with hot plug errors, >> as did attempting to start them on a few different dom0's. After doing a >> "multipath -F" to remove unused devices and manually bringing in just >> the selected LUN's via "multipath diskname", I was able to successfully >> start them. This initially made me think perhaps I was hitting some sort >> of udev / multipath / iscsi device lun limit (136 luns, 8 paths per lun >> = 1088 iscsi connections). Just to be clear, the problem occurred on >> multiple dom0's at the same time so it definitely seems iscsi related. >> >> Now, a day later, I'm debugging this further and I'm again unable to >> start VM's, even with all extra multipath devices removed. I rebooted >> one of the dom0's and was able to successfully migrate our production >> VM's off a broken server, so I've now got an empty dom0 that's unable to >> start any vm's. >> >> Starting a VM results in the following in xend.log: >> >> [2012-01-31 13:06:16 12353] DEBUG (DevController:144) Waiting for 0. >> [2012-01-31 13:06:16 12353] DEBUG (DevController:628) >> hotplugStatusCallback /local/domain/0/backend/vif/35/0/hotplug-status. >> [2012-01-31 13:07:56 12353] ERROR (SrvBase:88) Request wait_for_devices >> failed. >> Traceback (most recent call last): >> File "/usr/lib64/python2.6/site-packages/xen/web/SrvBase.py", line >> 85, in perform >> return op_method(op, req) >> File >> "/usr/lib64/python2.6/site-packages/xen/xend/server/SrvDomain.py", line >> 85, in op_wait_for_devices >> return self.dom.waitForDevices() >> File "/usr/lib64/python2.6/site-packages/xen/xend/XendDomainInfo.py", >> line 1237, in waitForDevices >> self.getDeviceController(devclass).waitForDevices() >> File >> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", >> line 140, in waitForDevices >> return map(self.waitForDevice, self.deviceIDs()) >> File >> "/usr/lib64/python2.6/site-packages/xen/xend/server/DevController.py", >> line 155, in waitForDevice >> (devid, self.deviceClass)) >> VmError: Device 0 (vif) could not be connected. Hotplug scripts not working. > > Was there anything in the kernel (dmesg) about vifs? What does your > /proc/interrupts look like? Can you provide the dmesg that you get > during startup. I am mainly looking for: > > NR_IRQS:16640 nr_irqs:1536 16 > > How many guests are your running when this happens? > > One theory is that your are running out dom0 interrupts. Thought > I *think* that was made dynamic in 3.0.. > > > Thought that does explain your iSCSI network wonky in the guest - > was there anything in the dmesg when the guest started going bad? > >> [2012-01-31 13:07:56 12353] DEBUG (XendDomainInfo:3071) >> XendDomainInfo.destroy: domid=35 >> [2012-01-31 13:07:58 12353] DEBUG (XendDomainInfo:2401) Destroying >> device model >> >> I tried turning up udev's log level but that didn't reveal anything. >> Reading the xenstore for the vif doesn't show anything unusual either: >> >> ukxen1 ~ # xenstore-ls /local/domain/0/backend/vif/35 >> 0 = "" >> bridge = "vlan91" >> domain = "nathanxenuk1" >> handle = "0" >> uuid = "2128d0b7-d50f-c2ad-4243-8a42bb598b81" >> script = "/etc/xen/scripts/vif-bridge" >> state = "1" >> frontend = "/local/domain/35/device/vif/0" >> mac = "00:16:3d:03:00:44" >> online = "1" >> frontend-id = "35" >> >> The bridge device (vlan91) exists, trying a different bridge doesn't >> matter. Removing the VIF completely results in the same error for the >> VBD. Adding debugging to the hotplug/network scripts didn't reveal >> anything, it looks like they aren't even being executed yet. Nothing is >> logged to xen-hotplug.log. > OK, so that would imply the kernel hasn't been able to do the right > thing. Hmm. > > What do you see when this happens with udev --monitor --kernel --udev > --property ? I have this happening again on a server and running udev monitor (udevadm monitor --kernel --udev --property) prints absolutely *nothing*. I've confirmed on a working xen host that it does actually print a ton of debugging when i restart a VM. This machine however prints nothing when trying to spawn. Still returns the same hotplug failure error. Any suggestions on what I can do to debug? Still nothing in dmesg. > >> The only thing I can think of that this may be related to, is gentoo >> defaulted to a 10mb /dev which we filled up a few months back. We upped >> the size to 50mb in the mount options and everything's been completely >> stable since (~33 days). None of the /dev on the dom0's is higher than >> 25% usage. Asides from adding the new luns, no changes have been made in >> the past month. >> >> To try and test if removing some devices would solve anything, I tried >> doing an "iscsiadm -m node --logout" and it promptly hard locked the >> entire box. After a reboot, I was unable to reproduce the problem on >> that particular dom0. >> >> I've still got 1 dom0 that's exhibiting the problem, if anyone is able >> to suggest any further debugging steps? >> >> - Nathan >> >> >> (XEN) Xen version 4.1.1 (root@) (gcc version 4.3.4 (Gentoo 4.3.4 p1.1, >> pie-10.1.5) ) Mon Aug 29 16:24:12 PDT 2011 >> >> ukxen1 xen # xm info >> host : ukxen1 >> release : 3.0.3 >> version : #4 SMP Thu Dec 22 12:44:22 PST 2011 >> machine : x86_64 >> nr_cpus : 24 >> nr_nodes : 2 >> cores_per_socket : 6 >> threads_per_core : 2 >> cpu_mhz : 2261 >> hw_caps : >> bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000 >> virt_caps : hvm hvm_directio >> total_memory : 98291 >> free_memory : 91908 >> free_cpus : 0 >> xen_major : 4 >> xen_minor : 1 >> xen_extra : .1 >> xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 >> hvm-3.0-x86_32p hvm-3.0-x86_64 >> xen_scheduler : credit >> xen_pagesize : 4096 >> platform_params : virt_start=0xffff800000000000 >> xen_changeset : unavailable >> xen_commandline : console=vga dom0_mem=1024M dom0_max_vcpus=1 >> dom0_vcpus_pin=true >> cc_compiler : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5) >> cc_compile_by : root >> cc_compile_domain : >> cc_compile_date : Mon Aug 29 16:24:12 PDT 2011 >> xend_config_format : 4 >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel