From mboxrd@z Thu Jan 1 00:00:00 1970 From: Matt Ayres Subject: Re: Bug: xm commands hanging due to poor threading in xend Date: Mon, 23 Jan 2006 14:46:00 -0500 Message-ID: <43D53278.5090407@tektonic.net> References: <43D28927.8060803@tektonic.net> <20060123035932.GA28166@localhost.localdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20060123035932.GA28166@localhost.localdomain> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Ewan Mellor Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Ewan Mellor wrote: >> >> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465 > > In the /var/log/xend-debug.log for both your bugs #465 and #486 you can see > the message "error: can't start new thread". That's going to be fatal -- > there's no way that Xend can proceed if it cannot create new threads. > > This points to a resource leak on the machine -- either you are leaking > threads or processes locally to Xend or globally to your machine, which would > show up on ps ax, or you are out of memory, which would show up in free or top > (press m to sort by memory usage). Possibly, this could be a manifestation of > a file descriptor leak, which would show up in lsof. > > Could you try and track down the leak? This would give us a much better clue > as where to look. > I went ahead and did find some problems. On a server up with 10 days some processes (mysql/httpd) in dom0 were stressed. Swap was 50% in use. I have put in memory minimizing config files for both of these apps. File descriptors is still high even after restart most all services on the server with the higher uptime. I can also try increasing dom0 memory to 512MB or so. I did 128MB for dom0 with 2.0 and increased this to 256MB with 3.0 because all my hosts can now access their full 8GB. 10 day uptime host: # lsof -n | wc -l 2775 # free total used free shared buffers cached Mem: 262544 218040 44504 0 21300 55592 -/+ buffers/cache: 141148 121396 Swap: 522104 35944 486160 2 day uptime host: # lsof -n | wc -l 1420 # free total used free shared buffers cached Mem: 262544 252076 10468 0 28432 85264 -/+ buffers/cache: 138380 124164 Swap: 522104 3928 518176 File limit is 14343 so fd's shouldn't be a problem. I do not have any OOM errors in my logs though. >> I've also run into this once: >> >> Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ... >> vm20 xenstored: xenstored corruption: connection id -1: err No such file >> or directory: No child '(null)' found > > If you get this, all bets are off. There is no way that the system as it > stands will recover gracefully if the store is corrupted. At best, you'll > just lose configuration data regarding the running VMs -- at worst, the > corruption could persist indefinitely, and you'll be unable to do anything > through Xend. > > Do you have xen-unstable changeset 8269:ac3ceb2d37d1 aka xen-3.0-testing > changeset 8250:1e3d31952015? This fixes the only xenstore corruption bug that > I know of, and if you've got that fix, then it's definitely a new bug. In > that case, we would appreciate it if you could either find a test case that > takes less than a few days to trigger this bug, or get your hands dirty > yourself and put some tracing and assertions into Xenstored around the TDB > manipulations to try and catch the corruption. > I am running -unstable from the 16th. If that change exists in there then yes I have the fix. > Maybe the corrupted TDB file itself might be useful to someone. Could you > save that, too? Yes, normally in a case like this I get a few tdb.xxxxxx where the x's represent a 6 character length hex string. > > As far as I'm aware, you are the only person who's ever seen this message, so > tracking it down without your help is going to be impossible. Is there > anything strange about your setup? Any network block devices or NFS involved, > any quotas on your filesystems or SELinux? Any patches that you've applied, > non-standard kernel options, anything like that? > My setup is fairly standard. -unstable, PAE, LVM, routed networking. Just tracking Xen using mercuial.