All of lore.kernel.org
 help / color / mirror / Atom feed
* Bug: xm commands hanging due to poor threading in xend
@ 2006-01-21 19:19 Matt Ayres
  2006-01-21 19:20 ` Matt Ayres
  2006-01-23  3:59 ` Ewan Mellor
  0 siblings, 2 replies; 10+ messages in thread
From: Matt Ayres @ 2006-01-21 19:19 UTC (permalink / raw)
  To: xen-devel

I have noticed my most major issue with putting xend into full 
production is with many xm commands being issued it hangs and only 
starts working (sometimes) after a "service xend restart".  I created a 
bug a long time for this and have attached 3 different sets of logs 
using xen-bugtool.  This happens to most servers after running for 3-4 
days.  Those that have little activity on the xend daemon (older servers 
that were upgraded) can go 2 weeks+ at this point.  Once Xen gets to 
this state even restarting xend so the list command (and others) work, 
running "xm shutdown -a"  will guarantee an internal server error from 
xend.

I've also run into this once:

Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ...
vm20 xenstored: xenstored corruption: connection id -1: err No such file 
or directory: No child '(null)' found
Error: Error connecting to xend: Connection refused.  Is xend running?

This is all using -unstable.  There are not many commits to 3.0-testing 
specifically regarding xend/tdb/xenstore so tracking it at this point 
seems useless.

Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-21 19:19 Bug: xm commands hanging due to poor threading in xend Matt Ayres
@ 2006-01-21 19:20 ` Matt Ayres
  2006-01-23  3:59 ` Ewan Mellor
  1 sibling, 0 replies; 10+ messages in thread
From: Matt Ayres @ 2006-01-21 19:20 UTC (permalink / raw)
  To: Matt Ayres; +Cc: xen-devel




> 
> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465
> 

I forgot I also opened this one:

http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=486

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Bug: xm commands hanging due to poor threading in xend
@ 2006-01-22  0:05 James Harper
  0 siblings, 0 replies; 10+ messages in thread
From: James Harper @ 2006-01-22  0:05 UTC (permalink / raw)
  To: xen-devel

FWIW, I had a xend crash in -testing where I issued two 'xm create'
commands for two different domains in quick succession, eg at the same
command prompt typing the second one as soon as the first had finished.

I was in a bit of a hurry to get the domains running again so I wasn't
in a position to take any debug information.

James

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
> bounces@lists.xensource.com] On Behalf Of Matt Ayres
> Sent: Sunday, 22 January 2006 06:19
> To: xen-devel@lists.xensource.com
> Subject: [Xen-devel] Bug: xm commands hanging due to poor threading in
> xend
> 
> I have noticed my most major issue with putting xend into full
> production is with many xm commands being issued it hangs and only
> starts working (sometimes) after a "service xend restart".  I created
a
> bug a long time for this and have attached 3 different sets of logs
> using xen-bugtool.  This happens to most servers after running for 3-4
> days.  Those that have little activity on the xend daemon (older
servers
> that were upgraded) can go 2 weeks+ at this point.  Once Xen gets to
> this state even restarting xend so the list command (and others) work,
> running "xm shutdown -a"  will guarantee an internal server error from
> xend.
> 
> I've also run into this once:
> 
> Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ...
> vm20 xenstored: xenstored corruption: connection id -1: err No such
file
> or directory: No child '(null)' found
> Error: Error connecting to xend: Connection refused.  Is xend running?
> 
> This is all using -unstable.  There are not many commits to
3.0-testing
> specifically regarding xend/tdb/xenstore so tracking it at this point
> seems useless.
> 
> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-21 19:19 Bug: xm commands hanging due to poor threading in xend Matt Ayres
  2006-01-21 19:20 ` Matt Ayres
@ 2006-01-23  3:59 ` Ewan Mellor
  2006-01-23 19:46   ` Matt Ayres
  1 sibling, 1 reply; 10+ messages in thread
From: Ewan Mellor @ 2006-01-23  3:59 UTC (permalink / raw)
  To: Matt Ayres; +Cc: xen-devel

On Sat, Jan 21, 2006 at 02:19:03PM -0500, Matt Ayres wrote:

> I have noticed my most major issue with putting xend into full 
> production is with many xm commands being issued it hangs and only 
> starts working (sometimes) after a "service xend restart".  I created a 
> bug a long time for this and have attached 3 different sets of logs 
> using xen-bugtool.  This happens to most servers after running for 3-4 
> days.  Those that have little activity on the xend daemon (older servers 
> that were upgraded) can go 2 weeks+ at this point.  Once Xen gets to 
> this state even restarting xend so the list command (and others) work, 
> running "xm shutdown -a"  will guarantee an internal server error from 
> xend.
> 
> Error: Error connecting to xend: Connection refused.  Is xend running?
>
> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465

In the /var/log/xend-debug.log for both your bugs #465 and #486 you can see
the message "error: can't start new thread".  That's going to be fatal --
there's no way that Xend can proceed if it cannot create new threads.

This points to a resource leak on the machine -- either you are leaking
threads or processes locally to Xend or globally to your machine, which would
show up on ps ax, or you are out of memory, which would show up in free or top
(press m to sort by memory usage).  Possibly, this could be a manifestation of
a file descriptor leak, which would show up in lsof.

Could you try and track down the leak?  This would give us a much better clue
as where to look.

> I've also run into this once:
> 
> Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ...
> vm20 xenstored: xenstored corruption: connection id -1: err No such file 
> or directory: No child '(null)' found

If you get this, all bets are off.  There is no way that the system as it
stands will recover gracefully if the store is corrupted.  At best, you'll
just lose configuration data regarding the running VMs -- at worst, the
corruption could persist indefinitely, and you'll be unable to do anything
through Xend.

Do you have xen-unstable changeset 8269:ac3ceb2d37d1 aka xen-3.0-testing
changeset 8250:1e3d31952015?  This fixes the only xenstore corruption bug that
I know of, and if you've got that fix, then it's definitely a new bug.  In
that case, we would appreciate it if you could either find a test case that
takes less than a few days to trigger this bug, or get your hands dirty
yourself and put some tracing and assertions into Xenstored around the TDB
manipulations to try and catch the corruption.

Maybe the corrupted TDB file itself might be useful to someone.  Could you
save that, too?

As far as I'm aware, you are the only person who's ever seen this message, so
tracking it down without your help is going to be impossible.  Is there
anything strange about your setup?  Any network block devices or NFS involved,
any quotas on your filesystems or SELinux?  Any patches that you've applied,
non-standard kernel options, anything like that?

Ewan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-23  3:59 ` Ewan Mellor
@ 2006-01-23 19:46   ` Matt Ayres
  2006-01-23 19:54     ` Matt Ayres
  2006-01-23 20:09     ` Ewan Mellor
  0 siblings, 2 replies; 10+ messages in thread
From: Matt Ayres @ 2006-01-23 19:46 UTC (permalink / raw)
  To: Ewan Mellor; +Cc: xen-devel



Ewan Mellor wrote:

>>
>> Bug url: http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=465
> 
> In the /var/log/xend-debug.log for both your bugs #465 and #486 you can see
> the message "error: can't start new thread".  That's going to be fatal --
> there's no way that Xend can proceed if it cannot create new threads.
> 
> This points to a resource leak on the machine -- either you are leaking
> threads or processes locally to Xend or globally to your machine, which would
> show up on ps ax, or you are out of memory, which would show up in free or top
> (press m to sort by memory usage).  Possibly, this could be a manifestation of
> a file descriptor leak, which would show up in lsof.
> 
> Could you try and track down the leak?  This would give us a much better clue
> as where to look.
> 

I went ahead and did find some problems.  On a server up with 10 days 
some processes (mysql/httpd) in dom0 were stressed.  Swap was 50% in 
use.  I have put in memory minimizing config files for both of these 
apps.  File descriptors is still high even after restart most all 
services on the server with the higher uptime.  I can also try 
increasing dom0 memory to 512MB or so.

I did 128MB for dom0 with 2.0 and increased this to 256MB with 3.0 
because all my hosts can now access their full 8GB.

10 day uptime host:
# lsof -n | wc -l
2775
# free
              total       used       free     shared    buffers     cached
Mem:        262544     218040      44504          0      21300      55592
-/+ buffers/cache:     141148     121396
Swap:       522104      35944     486160

2 day uptime host:
# lsof -n | wc -l
1420
# free
              total       used       free     shared    buffers     cached
Mem:        262544     252076      10468          0      28432      85264
-/+ buffers/cache:     138380     124164
Swap:       522104       3928     518176


File limit is 14343 so fd's shouldn't be a problem.

I do not have any OOM errors in my logs though.

>> I've also run into this once:
>>
>> Message from syslogd@vm20 at Fri Jan 20 23:16:52 2006 ...
>> vm20 xenstored: xenstored corruption: connection id -1: err No such file 
>> or directory: No child '(null)' found
> 
> If you get this, all bets are off.  There is no way that the system as it
> stands will recover gracefully if the store is corrupted.  At best, you'll
> just lose configuration data regarding the running VMs -- at worst, the
> corruption could persist indefinitely, and you'll be unable to do anything
> through Xend.
> 
> Do you have xen-unstable changeset 8269:ac3ceb2d37d1 aka xen-3.0-testing
> changeset 8250:1e3d31952015?  This fixes the only xenstore corruption bug that
> I know of, and if you've got that fix, then it's definitely a new bug.  In
> that case, we would appreciate it if you could either find a test case that
> takes less than a few days to trigger this bug, or get your hands dirty
> yourself and put some tracing and assertions into Xenstored around the TDB
> manipulations to try and catch the corruption.
> 

I am running -unstable from the 16th.  If that change exists in there 
then yes I have the fix.

> Maybe the corrupted TDB file itself might be useful to someone.  Could you
> save that, too?

Yes, normally in a case like this I get a few tdb.xxxxxx where the x's 
represent a 6 character length hex string.

> 
> As far as I'm aware, you are the only person who's ever seen this message, so
> tracking it down without your help is going to be impossible.  Is there
> anything strange about your setup?  Any network block devices or NFS involved,
> any quotas on your filesystems or SELinux?  Any patches that you've applied,
> non-standard kernel options, anything like that?
> 

My setup is fairly standard.  -unstable, PAE, LVM, routed networking. 
Just tracking Xen using mercuial.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-23 19:46   ` Matt Ayres
@ 2006-01-23 19:54     ` Matt Ayres
  2006-01-24 18:11       ` Matt Ayres
  2006-01-23 20:09     ` Ewan Mellor
  1 sibling, 1 reply; 10+ messages in thread
From: Matt Ayres @ 2006-01-23 19:54 UTC (permalink / raw)
  To: Matt Ayres; +Cc: xen-devel, Ewan Mellor



Matt Ayres wrote:

> 
> File limit is 14343 so fd's shouldn't be a problem.

Actually, my thread-max limit was 4400, so in theory this could have 
been hit by many httpd/mysql/xend accesses.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: Bug: xm commands hanging due to poor threading in xend
@ 2006-01-23 19:57 Ian Pratt
  2006-01-23 20:23 ` Matt Ayres
  0 siblings, 1 reply; 10+ messages in thread
From: Ian Pratt @ 2006-01-23 19:57 UTC (permalink / raw)
  To: Matt Ayres, Ewan Mellor; +Cc: xen-devel

> My setup is fairly standard.  -unstable, PAE, LVM, routed networking. 
> Just tracking Xen using mercuial.

routeing is not widely used. Have a look in /proc/slabinfo on a long
running system to see if you have a leak -- there have been complaints
of a per-packet memory leak on some routeing configurations.

Ian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-23 19:46   ` Matt Ayres
  2006-01-23 19:54     ` Matt Ayres
@ 2006-01-23 20:09     ` Ewan Mellor
  1 sibling, 0 replies; 10+ messages in thread
From: Ewan Mellor @ 2006-01-23 20:09 UTC (permalink / raw)
  To: Matt Ayres; +Cc: xen-devel

On Mon, Jan 23, 2006 at 02:46:00PM -0500, Matt Ayres wrote:

> I went ahead and did find some problems.  On a server up with 10 days 
> some processes (mysql/httpd) in dom0 were stressed.  Swap was 50% in 
> use.  I have put in memory minimizing config files for both of these 
> apps.  File descriptors is still high even after restart most all 
> services on the server with the higher uptime.  I can also try 
> increasing dom0 memory to 512MB or so.
> 
> I did 128MB for dom0 with 2.0 and increased this to 256MB with 3.0 
> because all my hosts can now access their full 8GB.
> 
> 10 day uptime host:
> # lsof -n | wc -l
> 2775
> # free
>              total       used       free     shared    buffers     cached
> Mem:        262544     218040      44504          0      21300      55592
> -/+ buffers/cache:     141148     121396
> Swap:       522104      35944     486160
> 
> 2 day uptime host:
> # lsof -n | wc -l
> 1420
> # free
>              total       used       free     shared    buffers     cached
> Mem:        262544     252076      10468          0      28432      85264
> -/+ buffers/cache:     138380     124164
> Swap:       522104       3928     518176

And who owns the 1000 extra fds?  An extra 40MB in use isn't too scary
-- that could just be caching inside your MySQL, for example, but an
extra 1000 fds is certainly a problem.  I believe there is a per-process
1024 fd limit on some systems.

Ewan.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-23 19:57 Ian Pratt
@ 2006-01-23 20:23 ` Matt Ayres
  0 siblings, 0 replies; 10+ messages in thread
From: Matt Ayres @ 2006-01-23 20:23 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel, Ewan Mellor



Ian Pratt wrote:
>> My setup is fairly standard.  -unstable, PAE, LVM, routed networking. 
>> Just tracking Xen using mercuial.
> 
> routeing is not widely used. Have a look in /proc/slabinfo on a long
> running system to see if you have a leak -- there have been complaints
> of a per-packet memory leak on some routeing configurations.
> 

I'm being completely honesty when I say I look at slabinfo and have NO 
idea what I am seeing :)  If you want I can attach it to a bug.  I 
noticed the run-parts cron is running on the one... I am going to wait 
until that finishes and see the # of fd's drastically lowers.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Bug: xm commands hanging due to poor threading in xend
  2006-01-23 19:54     ` Matt Ayres
@ 2006-01-24 18:11       ` Matt Ayres
  0 siblings, 0 replies; 10+ messages in thread
From: Matt Ayres @ 2006-01-24 18:11 UTC (permalink / raw)
  To: Matt Ayres; +Cc: xen-devel, Ewan Mellor



Matt Ayres wrote:
> 
> 
> Matt Ayres wrote:
> 
>>
>> File limit is 14343 so fd's shouldn't be a problem.
> 
> Actually, my thread-max limit was 4400, so in theory this could have 
> been hit by many httpd/mysql/xend accesses.
> 

I raised the thread-max to 16384, increased dom0 RAM to 512MB, and dom0 
swap to 1GB and still saw problems throughout the night.  I noticed 
Vincent committed a few memory leak / xenstore fixes today.  I 
immediately pulled that down and am seeing how well things run now.

Users also reported receiving BUG's in their domains, I submitted a bug 
  regarding this.  I have set vcpus=1 for every domU now just in case 
SMP  inside the domU was affecting Xen / dom0 Linux.

I will update if this solves my problems or if they change / persist.

Thank you to the Xen developers for assisting me with diagnosing these 
odd problems that only those of us brave enough to try -unstable in 
production can probably find :)

Cheers!
Matt Ayres

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-01-24 18:11 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-21 19:19 Bug: xm commands hanging due to poor threading in xend Matt Ayres
2006-01-21 19:20 ` Matt Ayres
2006-01-23  3:59 ` Ewan Mellor
2006-01-23 19:46   ` Matt Ayres
2006-01-23 19:54     ` Matt Ayres
2006-01-24 18:11       ` Matt Ayres
2006-01-23 20:09     ` Ewan Mellor
  -- strict thread matches above, loose matches on Subject: below --
2006-01-22  0:05 James Harper
2006-01-23 19:57 Ian Pratt
2006-01-23 20:23 ` Matt Ayres

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.