Keir Fraser <keir.fraser@xxxxxxxxxxxxx> writes:
> On 29/12/2008 21:54, "Ferenc Wagner" <wferi@xxxxxxx> wrote:
>
>>> I think the console daemon tries to discard contiguous chunks of
>>> data, rather than odd characters here and there. How effective it
>>> really is I'm not sure, but certainly you can expect the discards to
>>> be in reasonable-sized chunks and also to be pretty random.
>>
>> Huh, now I'm no closer to have an idea about the expected behaviour.
>> What does the console daemon try to achieve? Does the randomness stem
>> from the scheduling irregularity? On which side of the daemon is the
>> 1 MB buffer?
>
> The daemon greedily takes characters from a small ring buffer shared with
> the guest, and places them in the much bigger 1MB buffer. The guest can
> expect characters to not sit around in the shared ring, and it is possible
> that a guest could lock up if that were to happen (although in your example
> only the user process writing to the console should hang in that
> [impossible] case).
Does your above "hang" mean a total lockup or rather a delay until the
dom0 schedules xenconsoled to remove those characters from the shared
ring?
By the example I could probe the console buffer behaviour only, I
never managed to freeze the guest. In real life user processes
practically never write to the guest console, it's dominated by
iptables (kernel packet filter) logs.
> Rather than stopping reading characters when the 1MB buffer fills,
> the daemon instead discards character sequences from the 1MB buffer.
>From the test it looks like the daemon discards from the end of the
buffer, but then why are the first 158 lines missing?
(As a side note, wouldn't it be more useful if the big buffer was a
ring as well, discarding input at the beginning?)
>>> Why do you think this has something to do with pv_ops lockups?
>>
>> That's just the only trace I can start with. On kernel lockups, I
>> usually look for clues in the console output. Now I found it garbled.
>> Either it is normal and I should look elsewhere, or it is a buffer
>> handling bug, possibly overwriting some memory and causing havoc
>> later. I know that's a long shot... But even SysRq didn't work, so I
>> have nothing more to work with. I'm looking at tools/console/daemon/io.c
>> now.
>
> It does sound like a long shot! Did you expect those VMs to be producing a
> lot of console output?
They definitely filled up their 1 MB buffers in a couple of days.
Usually they produce about 100 characters per minute, possibly with
some peaks if they get scanned, but I wouldn't call this "a lot".
Now your input reminded me of something else.
Dom0s show considerable latency. They run heartbeat to form a HA pair,
and hearbeat complains regularly. To quote some recent extreme value:
WARN: Gmain_timeout_dispatch: Dispatch function for send local status
took too long to execute: 320 ms (> 50 ms) (GSource: 0x811b930)
And heartbeat even runs on locked pages! So other processes probably
experience longer delays, which could cause problem if Xen is
sensitive to such things (but there's no significant swapping on the
dom0s). Btw. the above line was logged when 3 VMs were locked up
(constantly running from the dom0 POV, doing nothing otherwise).
Also, dom0 loses characters on its serial ports (console and heartbeat
medium). This is probably a manifestation of this latency, isn't it?
Anyway, I'll look at this some more tomorrow. I wonder why
xenconsoled on a machine running lots of VMs isn't much bigger (as VSZ
in ps output) than on another running nothing but dom0. Isn't this
1 MB/VM allocated by xenconsoled?
--
Thanks,
Feri.
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|