All of lore.kernel.org
 help / color / mirror / Atom feed
* help with xenstored 'hang'
@ 2010-06-30 22:15 Jim Fehlig
  2010-06-30 23:17 ` Patrick Colp
  0 siblings, 1 reply; 6+ messages in thread
From: Jim Fehlig @ 2010-06-30 22:15 UTC (permalink / raw)
  To: xen-devel

I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
production installation.  The hang occurs randomly, on a random host. 
User has provided cores of xend and xenstored processes when hang
occurs.  After poking at these cores I have discovered

In xend process, a thread is blocked on a cond variable, waiting for a
response to XS_TRANSACTION_START from xenstored. A reader thread
responsible for reading from xenstored is blocked on read(2).

In the xenstored process, the lone thread is blocked on select(2),
waiting for IO. I examined the connections list and see that it contains
a connection for the XS_TRANSACTION_START request.  Dumping the
connection object:

(gdb) p *(struct connection *)0x526c70
$48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
can_write =
true, in = 0x523600,
out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
transaction_list = {next = 0x523560,
prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
domain = 0x0, watches = {
next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
0x405180 <readfd>}

Notice transaction_started is set to 1, but out_list is empty. AFAICT,
that means the reply has been sent to xend. The reader thread in xend
should have received the response and signaled the cond variable -
allowing execution to progress. Ultimately, xend would send a
XS_TRANSACTION_END message, freeing the connection object in xenstored
and removing it from connections list.

Does my understanding of this code sound correct?  Anyone have
suggestions or further debugging tips?  Examining cores is about my only
debug option as user does not want to deploy debug patches, enable
tracing, etc. across 700 hosts.

Interestingly, when user strace's or attaches to xenstored process with
gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
continues normally.  A new connection to xenstored (e.g. running xmtop)
seems to poke it along as well.  Would a timeout on select(2) in main
loop of xenstored help at all?

Thanks for any insights!
Jim

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-07-01 23:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-30 22:15 help with xenstored 'hang' Jim Fehlig
2010-06-30 23:17 ` Patrick Colp
2010-06-30 23:31   ` Jim Fehlig
2010-07-01 21:30   ` Jim Fehlig
2010-07-01 22:33     ` Patrick Colp
2010-07-01 23:03       ` Jim Fehlig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.