On 1 July 2010 14:30, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote:
> Patrick Colp wrote:
>> I was recently struggling with what sounds like a not-too-dissimilar
>> problem while working with a disaggregated version of xenstore. The
>> ultimate solution for me was to disable pthreads in xenstore/libxs. I
>> just commented out the following line in tools/xenstore/Makefile:
>>
>> xs.opic: CFLAGS += -DUSE_PTHREAD
>>
>> After I removed that line and rebuilt and installed xenstore, it
>> worked just fine. I would be curious to know if this also solves your
>> problem.
>>
>
> After more thought, this seems like it could cause problems in xend,
> which is multi-threaded. This change essentially make the xenstore
> client library thread-unsafe correct?
I don't think so. I think it just makes the xenstore library single
threaded. In my case, I was using a single threaded application and
still ran into this problem, as the xenstore library seems to have
multiple threads. But the description of your problem sounds a lot
like what was happening with me where it seemed like messages were
disappearing. I can't say if what worked for me would work for you,
though. It just seemed similar enough to me.
Patrick
>
> Regards,
> Jim
>
>>
>> Patrick
>>
>>
>> On 30 June 2010 15:15, Jim Fehlig <jfehlig@xxxxxxxxxx> wrote:
>>
>>> I'm trying to debug an 'xm list' hang on a large (~700 hosts) Xen 3.2
>>> production installation. The hang occurs randomly, on a random host.
>>> User has provided cores of xend and xenstored processes when hang
>>> occurs. After poking at these cores I have discovered
>>>
>>> In xend process, a thread is blocked on a cond variable, waiting for a
>>> response to XS_TRANSACTION_START from xenstored. A reader thread
>>> responsible for reading from xenstored is blocked on read(2).
>>>
>>> In the xenstored process, the lone thread is blocked on select(2),
>>> waiting for IO. I examined the connections list and see that it contains
>>> a connection for the XS_TRANSACTION_START request. Dumping the
>>> connection object:
>>>
>>> (gdb) p *(struct connection *)0x526c70
>>> $48 = {list = {next = 0x517c30, prev = 0x5151f0}, fd = 13, id = 0,
>>> can_write =
>>> true, in = 0x523600,
>>> out_list = {next = 0x526c98, prev = 0x526c98}, transaction = 0x0,
>>> transaction_list = {next = 0x523560,
>>> prev = 0x523560}, next_transaction_id = 60231445, transaction_started = 1,
>>> domain = 0x0, watches = {
>>> next = 0x51daa0, prev = 0x5267b0}, write = 0x402460 <writefd>, read =
>>> 0x405180 <readfd>}
>>>
>>> Notice transaction_started is set to 1, but out_list is empty. AFAICT,
>>> that means the reply has been sent to xend. The reader thread in xend
>>> should have received the response and signaled the cond variable -
>>> allowing execution to progress. Ultimately, xend would send a
>>> XS_TRANSACTION_END message, freeing the connection object in xenstored
>>> and removing it from connections list.
>>>
>>> Does my understanding of this code sound correct? Anyone have
>>> suggestions or further debugging tips? Examining cores is about my only
>>> debug option as user does not want to deploy debug patches, enable
>>> tracing, etc. across 700 hosts.
>>>
>>> Interestingly, when user strace's or attaches to xenstored process with
>>> gdb, xenstored "awakes", the hung 'xm list' returns, and xenstored
>>> continues normally. A new connection to xenstored (e.g. running xmtop)
>>> seems to poke it along as well. Would a timeout on select(2) in main
>>> loop of xenstored help at all?
>>>
>>> Thanks for any insights!
>>> Jim
>>>
>>>
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@xxxxxxxxxxxxxxxxxxx
>>> http://lists.xensource.com/xen-devel
>>>
>>>
>>>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|