On 15.3.2006 13:32, Keir Fraser wrote:
>
> On 14 Mar 2006, at 18:05, Tomas Kopal wrote:
>
>> Well, in my case, I traced the problem down to a buggy chipset. The
>> VIA686a PIT timer randomly looses it's programming and needs to be
>> reset. The linux kernel has a workaround for this, but this does not get
>> used when xen comes to play as the hypervisor takes over control of
>> the PIT.
>> I have implemented similar workaround in xen hypervisor. So far I am
>> running it for about three weeks now and the server is perfectly stable.
>>
>> I am interested in your comments, and I would be happy if you could
>> apply this patch to xen sources.
>
> Do you have any details on what mode the timer enters when it loses its
> programming, whether this affects all PIT channels, etc?
Well, there is not much info on this. There is no official VIA info,
only speculations. Probably the most info I found on LKLM. The best
summary I found is here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0111.0/1613.html
and
http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.3/1068.html
One of initial problem descriptions:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0205.2/1405.html
It seems to affect only one channel AFAIK, but it's not always the same
(Linux kernel is using channel 0, Xen channel 2, and the problem is the
same for both). It's probably not affecting all channels together, as
bug on channel 1 could be quite disastrous to the memory contents.
But similar problems may be in other chipsets too:
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q274323
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q266344
So having a bit more "robust" PIT handling should generally help.
> The patch is
> potentially okay -- it differs from Linux in that we free-run channel 2
> (we don't periodically and automatically re-latch) and so the Linux test
> for count > latch does not work. The test you use (diff > 2*latch) is
> kind of weird, even if it does seem to work for you: I wonder what kind
> of mode it enters where readings make it look like it is running at
> three times normal speed?
I think that the mode is not changed, just the immediate value in the timer.
My explanation is that the timer sometimes (probably when the system is
under heavy load, like during domU shutdown) returns "random jump",
probably by resetting current timer value to some other, random one, but
continues counting. If this happen during calibration call, the
calibrated values are completely off, and the system time starts to run
away due to using invalid calibration data.
Together with xntpd it can get even more messy. (Just for the record, I
tried to turn xntpd in dom0 off, but the problem remained).
But this is not backed up by any real evidence, so take it with heaps of
salt :-).
The test for diff > 2*latch is a bit of heuristics :-). You are right
that this differs from Linux, Xen is not resetting the counter to latch
but free running it. But the diff between subsequent values should be
always near the latch value, as this is driven by the channel 0 set to
interrupt by latch.
I was printing out real diff values (detecting min and max over periods
of time) and it varied about 40% around the latch value. I didn't want
to get too many false positives, so I set it to double the expected
value. As the problematic values tend to be quite high, I think this is
a safe threshold.
>
> Also, although you detect and fix up channel 2 problems, all that code
> is driven off the channel 0 timer interrupt handler. What happens if ch0
> loses its programming? :-)
Don't know. It either does not loose it, or the effect of it loosing it
is not that obvious. Do you know any easy way how to detect this? (i.e.
detect missing or late interrupts? We can't use channel 2 as we can't
trust it. Maybe we can use the TSC?) As I said, I expect the timer to
continue counting, so if I am right, the only problem which it can cause
is that the timer will come a bit later. Apart from time keeping, this
should not be a big deal, or is it?
As I am thinking about this now, the cause may even be that the counter
problem is in channel 0 only. Then the timer interrupt would come a lot
later and the difference in values of channel 2 could overflow to
negative values?
>
> Really I want to understand this problem rather better before committing
> a patch for a six-year-old chipset.
>
> -- Keir
Yes, the chipset is quite old. We were already thinking about replacing
it, but after this fix, it will probably have to serve a bit longer :-).
I share your desire to understand the problem, but I still don't
understand it, and it seems that the people from LKLM didn't completely
understood it either. And according to the MSDN records, it may be quite
wide-spread, even on newer chipsets...
Feel free to make it compile-time option, or just move it to contrib.
But if it can save trouble I had to go through to anyone, it would be
definitely beneficial to have in the mainstream, especially when it does
not add any penalty to fault-less systems.
Thanks a lot
Tomas
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|