I was recently struggling with what sounds like a not-too-dissimilar
problem while working with a disaggregated version of xenstore. The
ultimate solution for me was to disable pthreads in xenstore/libxs. I
just commented out the following line in tools/xenstore/Makefile:
xs.opic: CFLAGS += -DUSE_PTHREAD
After I removed that line and rebuilt and installed xenstore, it
worked just fine. I would be curious to know if this also solves your
problem.
Patrick
On 30 June 2010 15:15, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote:
> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
> production installation. The hang occurs randomly, on a random host.
> User has provided cores of xend and xenstored processes when hang
> occurs. After poking at these cores I have discovered
>
> In xend process, a thread is blocked on a cond variable, waiting for a
> response to XS_TRANSACTION_START from xenstored. A reader thread
> responsible for reading from xenstored is blocked on read(2).
>
> In the xenstored process, the lone thread is blocked on select(2),
> waiting for IO. I examined the connections list and see that it contains
> a connection for the XS_TRANSACTION_START request. Dumping the
> connection object:
>
> (gdb) p *(struct connection *)0x526c70
> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
> can_write =
> true, in = 0x523600,
> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
> transaction_list = {next = 0x523560,
> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
> domain = 0x0, watches = {
> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
> 0x405180 <readfd>}
>
> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
> that means the reply has been sent to xend. The reader thread in xend
> should have received the response and signaled the cond variable -
> allowing execution to progress. Ultimately, xend would send a
> XS_TRANSACTION_END message, freeing the connection object in xenstored
> and removing it from connections list.
>
> Does my understanding of this code sound correct? Anyone have
> suggestions or further debugging tips? Examining cores is about my only
> debug option as user does not want to deploy debug patches, enable
> tracing, etc. across 700 hosts.
>
> Interestingly, when user strace's or attaches to xenstored process with
> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
> continues normally. A new connection to xenstored (e.g. running xmtop)
> seems to poke it along as well. Would a timeout on select(2) in main
> loop of xenstored help at all?
>
> Thanks for any insights!
> Jim
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|