xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* help with xenstored 'hang'
@ 2010-06-30 22:15 Jim Fehlig
  2010-06-30 23:17 ` Patrick Colp
  0 siblings, 1 reply; 6+ messages in thread
From: Jim Fehlig @ 2010-06-30 22:15 UTC (permalink / raw)
  To: xen-devel

I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
production installation.  The hang occurs randomly, on a random host. 
User has provided cores of xend and xenstored processes when hang
occurs.  After poking at these cores I have discovered

In xend process, a thread is blocked on a cond variable, waiting for a
response to XS_TRANSACTION_START from xenstored. A reader thread
responsible for reading from xenstored is blocked on read(2).

In the xenstored process, the lone thread is blocked on select(2),
waiting for IO. I examined the connections list and see that it contains
a connection for the XS_TRANSACTION_START request.  Dumping the
connection object:

(gdb) p *(struct connection *)0x526c70
$48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
can_write =
true, in = 0x523600,
out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
transaction_list = {next = 0x523560,
prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
domain = 0x0, watches = {
next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
0x405180 <readfd>}

Notice transaction_started is set to 1, but out_list is empty. AFAICT,
that means the reply has been sent to xend. The reader thread in xend
should have received the response and signaled the cond variable -
allowing execution to progress. Ultimately, xend would send a
XS_TRANSACTION_END message, freeing the connection object in xenstored
and removing it from connections list.

Does my understanding of this code sound correct?  Anyone have
suggestions or further debugging tips?  Examining cores is about my only
debug option as user does not want to deploy debug patches, enable
tracing, etc. across 700 hosts.

Interestingly, when user strace's or attaches to xenstored process with
gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
continues normally.  A new connection to xenstored (e.g. running xmtop)
seems to poke it along as well.  Would a timeout on select(2) in main
loop of xenstored help at all?

Thanks for any insights!
Jim

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: help with xenstored 'hang'
  2010-06-30 22:15 help with xenstored 'hang' Jim Fehlig
@ 2010-06-30 23:17 ` Patrick Colp
  2010-06-30 23:31   ` Jim Fehlig
  2010-07-01 21:30   ` Jim Fehlig
  0 siblings, 2 replies; 6+ messages in thread
From: Patrick Colp @ 2010-06-30 23:17 UTC (permalink / raw)
  To: Jim Fehlig; +Cc: xen-devel

I was recently struggling with what sounds like a not-too-dissimilar
problem while working with a disaggregated version of xenstore. The
ultimate solution for me was to disable pthreads in xenstore/libxs. I
just commented out the following line in tools/xenstore/Makefile:

xs.opic: CFLAGS += -DUSE_PTHREAD

After I removed that line and rebuilt and installed xenstore, it
worked just fine. I would be curious to know if this also solves your
problem.


Patrick


On 30 June 2010 15:15, Jim Fehlig <jfehlig@novell.com> wrote:
> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
> production installation.  The hang occurs randomly, on a random host.
> User has provided cores of xend and xenstored processes when hang
> occurs.  After poking at these cores I have discovered
>
> In xend process, a thread is blocked on a cond variable, waiting for a
> response to XS_TRANSACTION_START from xenstored. A reader thread
> responsible for reading from xenstored is blocked on read(2).
>
> In the xenstored process, the lone thread is blocked on select(2),
> waiting for IO. I examined the connections list and see that it contains
> a connection for the XS_TRANSACTION_START request.  Dumping the
> connection object:
>
> (gdb) p *(struct connection *)0x526c70
> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
> can_write =
> true, in = 0x523600,
> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
> transaction_list = {next = 0x523560,
> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
> domain = 0x0, watches = {
> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
> 0x405180 <readfd>}
>
> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
> that means the reply has been sent to xend. The reader thread in xend
> should have received the response and signaled the cond variable -
> allowing execution to progress. Ultimately, xend would send a
> XS_TRANSACTION_END message, freeing the connection object in xenstored
> and removing it from connections list.
>
> Does my understanding of this code sound correct?  Anyone have
> suggestions or further debugging tips?  Examining cores is about my only
> debug option as user does not want to deploy debug patches, enable
> tracing, etc. across 700 hosts.
>
> Interestingly, when user strace's or attaches to xenstored process with
> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
> continues normally.  A new connection to xenstored (e.g. running xmtop)
> seems to poke it along as well.  Would a timeout on select(2) in main
> loop of xenstored help at all?
>
> Thanks for any insights!
> Jim
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: help with xenstored 'hang'
  2010-06-30 23:17 ` Patrick Colp
@ 2010-06-30 23:31   ` Jim Fehlig
  2010-07-01 21:30   ` Jim Fehlig
  1 sibling, 0 replies; 6+ messages in thread
From: Jim Fehlig @ 2010-06-30 23:31 UTC (permalink / raw)
  To: Patrick Colp; +Cc: xen-devel

Patrick Colp wrote:
> I was recently struggling with what sounds like a not-too-dissimilar
> problem while working with a disaggregated version of xenstore. The
> ultimate solution for me was to disable pthreads in xenstore/libxs. I
> just commented out the following line in tools/xenstore/Makefile:
>
> xs.opic: CFLAGS += -DUSE_PTHREAD
>   

Xen3.2 predates c/s 17405, which introduced optional use of pthreads. 
Prior to that, pthreads was used explicitly.

> After I removed that line and rebuilt and installed xenstore, it
> worked just fine. I would be curious to know if this also solves your
> problem.
>   

I can see if the user is receptive to testing backported 17405 with
pthreads disabled.

Thanks for the suggestion.
Jim

>
> Patrick
>
>
> On 30 June 2010 15:15, Jim Fehlig <jfehlig@novell.com> wrote:
>   
>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
>> production installation.  The hang occurs randomly, on a random host.
>> User has provided cores of xend and xenstored processes when hang
>> occurs.  After poking at these cores I have discovered
>>
>> In xend process, a thread is blocked on a cond variable, waiting for a
>> response to XS_TRANSACTION_START from xenstored. A reader thread
>> responsible for reading from xenstored is blocked on read(2).
>>
>> In the xenstored process, the lone thread is blocked on select(2),
>> waiting for IO. I examined the connections list and see that it contains
>> a connection for the XS_TRANSACTION_START request.  Dumping the
>> connection object:
>>
>> (gdb) p *(struct connection *)0x526c70
>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
>> can_write =
>> true, in = 0x523600,
>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
>> transaction_list = {next = 0x523560,
>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
>> domain = 0x0, watches = {
>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
>> 0x405180 <readfd>}
>>
>> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
>> that means the reply has been sent to xend. The reader thread in xend
>> should have received the response and signaled the cond variable -
>> allowing execution to progress. Ultimately, xend would send a
>> XS_TRANSACTION_END message, freeing the connection object in xenstored
>> and removing it from connections list.
>>
>> Does my understanding of this code sound correct?  Anyone have
>> suggestions or further debugging tips?  Examining cores is about my only
>> debug option as user does not want to deploy debug patches, enable
>> tracing, etc. across 700 hosts.
>>
>> Interestingly, when user strace's or attaches to xenstored process with
>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
>> continues normally.  A new connection to xenstored (e.g. running xmtop)
>> seems to poke it along as well.  Would a timeout on select(2) in main
>> loop of xenstored help at all?
>>
>> Thanks for any insights!
>> Jim
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>>     

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: help with xenstored 'hang'
  2010-06-30 23:17 ` Patrick Colp
  2010-06-30 23:31   ` Jim Fehlig
@ 2010-07-01 21:30   ` Jim Fehlig
  2010-07-01 22:33     ` Patrick Colp
  1 sibling, 1 reply; 6+ messages in thread
From: Jim Fehlig @ 2010-07-01 21:30 UTC (permalink / raw)
  To: Patrick Colp; +Cc: xen-devel

Patrick Colp wrote:
> I was recently struggling with what sounds like a not-too-dissimilar
> problem while working with a disaggregated version of xenstore. The
> ultimate solution for me was to disable pthreads in xenstore/libxs. I
> just commented out the following line in tools/xenstore/Makefile:
>
> xs.opic: CFLAGS += -DUSE_PTHREAD
>
> After I removed that line and rebuilt and installed xenstore, it
> worked just fine. I would be curious to know if this also solves your
> problem.
>   

After more thought, this seems like it could cause problems in xend,
which is multi-threaded.  This change essentially make the xenstore
client library thread-unsafe correct?

Regards,
Jim

>
> Patrick
>
>
> On 30 June 2010 15:15, Jim Fehlig <jfehlig@novell.com> wrote:
>   
>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
>> production installation.  The hang occurs randomly, on a random host.
>> User has provided cores of xend and xenstored processes when hang
>> occurs.  After poking at these cores I have discovered
>>
>> In xend process, a thread is blocked on a cond variable, waiting for a
>> response to XS_TRANSACTION_START from xenstored. A reader thread
>> responsible for reading from xenstored is blocked on read(2).
>>
>> In the xenstored process, the lone thread is blocked on select(2),
>> waiting for IO. I examined the connections list and see that it contains
>> a connection for the XS_TRANSACTION_START request.  Dumping the
>> connection object:
>>
>> (gdb) p *(struct connection *)0x526c70
>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
>> can_write =
>> true, in = 0x523600,
>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
>> transaction_list = {next = 0x523560,
>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
>> domain = 0x0, watches = {
>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
>> 0x405180 <readfd>}
>>
>> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
>> that means the reply has been sent to xend. The reader thread in xend
>> should have received the response and signaled the cond variable -
>> allowing execution to progress. Ultimately, xend would send a
>> XS_TRANSACTION_END message, freeing the connection object in xenstored
>> and removing it from connections list.
>>
>> Does my understanding of this code sound correct?  Anyone have
>> suggestions or further debugging tips?  Examining cores is about my only
>> debug option as user does not want to deploy debug patches, enable
>> tracing, etc. across 700 hosts.
>>
>> Interestingly, when user strace's or attaches to xenstored process with
>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
>> continues normally.  A new connection to xenstored (e.g. running xmtop)
>> seems to poke it along as well.  Would a timeout on select(2) in main
>> loop of xenstored help at all?
>>
>> Thanks for any insights!
>> Jim
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel
>>
>>
>>     

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: help with xenstored 'hang'
  2010-07-01 21:30   ` Jim Fehlig
@ 2010-07-01 22:33     ` Patrick Colp
  2010-07-01 23:03       ` Jim Fehlig
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick Colp @ 2010-07-01 22:33 UTC (permalink / raw)
  To: Jim Fehlig; +Cc: xen-devel

On 1 July 2010 14:30, Jim Fehlig <jfehlig@novell.com> wrote:
> Patrick Colp wrote:
>> I was recently struggling with what sounds like a not-too-dissimilar
>> problem while working with a disaggregated version of xenstore. The
>> ultimate solution for me was to disable pthreads in xenstore/libxs. I
>> just commented out the following line in tools/xenstore/Makefile:
>>
>> xs.opic: CFLAGS += -DUSE_PTHREAD
>>
>> After I removed that line and rebuilt and installed xenstore, it
>> worked just fine. I would be curious to know if this also solves your
>> problem.
>>
>
> After more thought, this seems like it could cause problems in xend,
> which is multi-threaded.  This change essentially make the xenstore
> client library thread-unsafe correct?

I don't think so. I think it just makes the xenstore library single
threaded. In my case, I was using a single threaded application and
still ran into this problem, as the xenstore library seems to have
multiple threads. But the description of your problem sounds a lot
like what was happening with me where it seemed like messages were
disappearing. I can't say if what worked for me would work for you,
though. It just seemed similar enough to me.


Patrick


>
> Regards,
> Jim
>
>>
>> Patrick
>>
>>
>> On 30 June 2010 15:15, Jim Fehlig <jfehlig@novell.com> wrote:
>>
>>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
>>> production installation.  The hang occurs randomly, on a random host.
>>> User has provided cores of xend and xenstored processes when hang
>>> occurs.  After poking at these cores I have discovered
>>>
>>> In xend process, a thread is blocked on a cond variable, waiting for a
>>> response to XS_TRANSACTION_START from xenstored. A reader thread
>>> responsible for reading from xenstored is blocked on read(2).
>>>
>>> In the xenstored process, the lone thread is blocked on select(2),
>>> waiting for IO. I examined the connections list and see that it contains
>>> a connection for the XS_TRANSACTION_START request.  Dumping the
>>> connection object:
>>>
>>> (gdb) p *(struct connection *)0x526c70
>>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
>>> can_write =
>>> true, in = 0x523600,
>>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
>>> transaction_list = {next = 0x523560,
>>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
>>> domain = 0x0, watches = {
>>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
>>> 0x405180 <readfd>}
>>>
>>> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
>>> that means the reply has been sent to xend. The reader thread in xend
>>> should have received the response and signaled the cond variable -
>>> allowing execution to progress. Ultimately, xend would send a
>>> XS_TRANSACTION_END message, freeing the connection object in xenstored
>>> and removing it from connections list.
>>>
>>> Does my understanding of this code sound correct?  Anyone have
>>> suggestions or further debugging tips?  Examining cores is about my only
>>> debug option as user does not want to deploy debug patches, enable
>>> tracing, etc. across 700 hosts.
>>>
>>> Interestingly, when user strace's or attaches to xenstored process with
>>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
>>> continues normally.  A new connection to xenstored (e.g. running xmtop)
>>> seems to poke it along as well.  Would a timeout on select(2) in main
>>> loop of xenstored help at all?
>>>
>>> Thanks for any insights!
>>> Jim
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: help with xenstored 'hang'
  2010-07-01 22:33     ` Patrick Colp
@ 2010-07-01 23:03       ` Jim Fehlig
  0 siblings, 0 replies; 6+ messages in thread
From: Jim Fehlig @ 2010-07-01 23:03 UTC (permalink / raw)
  To: Patrick Colp; +Cc: xen-devel

Patrick Colp wrote:
> On 1 July 2010 14:30, Jim Fehlig <jfehlig@novell.com> wrote:
>   
>> Patrick Colp wrote:
>>     
>>> I was recently struggling with what sounds like a not-too-dissimilar
>>> problem while working with a disaggregated version of xenstore. The
>>> ultimate solution for me was to disable pthreads in xenstore/libxs. I
>>> just commented out the following line in tools/xenstore/Makefile:
>>>
>>> xs.opic: CFLAGS += -DUSE_PTHREAD
>>>
>>> After I removed that line and rebuilt and installed xenstore, it
>>> worked just fine. I would be curious to know if this also solves your
>>> problem.
>>>
>>>       
>> After more thought, this seems like it could cause problems in xend,
>> which is multi-threaded.  This change essentially make the xenstore
>> client library thread-unsafe correct?
>>     
>
> I don't think so. I think it just makes the xenstore library single
> threaded.

Right.  But AFAICT, multiple threads in xend could use the single
xs_handle, allowing these threads to write to the handle's fd
simultaneously.  With the pthreads impl, these threads must acquire the
handle's req_mutex before writing.

>  In my case, I was using a single threaded application and
> still ran into this problem, as the xenstore library seems to have
> multiple threads.

It spawns one reader thread only.  Requests and responses are handled on
the caller's thread of control.

Regards,
Jim

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-07-01 23:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-30 22:15 help with xenstored 'hang' Jim Fehlig
2010-06-30 23:17 ` Patrick Colp
2010-06-30 23:31   ` Jim Fehlig
2010-07-01 21:30   ` Jim Fehlig
2010-07-01 22:33     ` Patrick Colp
2010-07-01 23:03       ` Jim Fehlig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).