On 11/09/2011 05:35 AM, Jan Beulich wrote:
>>>> On 18.10.11 at 22:42, Laszlo Ersek <lersek@xxxxxxxxxx> wrote:
>> ... because the "clock_event_device framework" already accounts for idle
>> time through the "event_handler" function pointer in
>> xen_timer_interrupt().
>>
>> The patch is intended as the completion of [1]. It should fix the double
>> idle times seen in PV guests' /proc/stat [2]. It should be orthogonal to
>> stolen time accounting (the removed code seems to be isolated).
> After some more looking around I still think it's incorrect, albeit for
> a different reason: What tick_nohz_restart_sched_tick() accounts
> as idle time is *all* time that passed while in cpu_idle(). What gets
> accounted in do_stolen_accounting() (without your patch) is
> different:
> - time the vCPU was in RUNSTATE_blocked gets accounted as idle
> - time the vCPU was in RUNSTATE_runnable and RUNSTATE_offline
> gets accounted as stolen.
>
> That is, on an overcommitted system (and without your patch) I
> would expect you to not see the (full) double idle increment for a not
> fully idle and not fully loaded vCPU.
>
> Now we can of course say that for most purposes it's fine to
> ignore the difference between idle and steal time, but there must
> be a reason these have been getting accounted separately in Linux
> without regard to Xen.
>
> So I really think that what needs to be suppressed to address this
> is tick_nohz_restart_sched_tick()'s call to account_idle_ticks(). But
> simply forcing CONFIG_VIRT_CPU_ACCOUNTING is not an immediate
> option (not even when considering a Xen-only kernel), as that has
> further implications. Instead I wonder whether under Xen we should
> e.g. update per_cpu(tick_cpu_sched, cpu).idle_jiffies prior to calling
> tick_nohz_restart_sched_tick() (possibly implicitly by doing the
> update in do_stolen_accounting(), though that wouldn't work when
> the wakeup is due to an interrupt other than the timer one).
>
> The alternative of accounting the negative of the steal time as
> idle time in do_stolen_accounting() doesn't look correct either,
> since not all stolen time was necessarily accounted as idle (some
> would have got - also incorrectly - accounted as process or system
> time).
>
> So my conclusion is that only the full implementation of what
> CONFIG_VIRT_CPU_ACCOUNTING expects would actually get
> things sorted out properly (but that would imply Xen *and* native
> are willing to pay the performance price, or the enabling of this
> can be made dynamic so that at least native could remain
> unaffected).
>
> Or the negative stolen time should be accounted to the current
> process, the system, or as idle, depending on the context which
> do_stolen_accounting() runs in (just like timer_interrupt() did in
> the XenoLinux kernels for positive accounting).
I was always suspicious of the timer-interrupt-based stolen time
accounting code. It's really a hold-over from when ticks were the main
timekeeping mechanism, but with highres timers its massive overkill and
probably a source of performance degradation.
Overall, the stolen time accounting isn't really very important. The
kernel doesn't use it at all internally - it doesn't affect scheduling
decisions, for example. It's only exported in /proc/stat, and some
tools like top will display it if its non-zero. It could be useful for
diagnosing performance problems on a heavily loaded host system, but
other tools like "xentop" would give you as much information.
So really, all this code is nice to have, but I'm not sure its worth a
lot of time to fix, especially if it makes idle accounting hard (which
*is* important).
However, that said, PeterZ recently added some code to the scheduler so
that time "stolen" by interrupt handling isn't accounted against the
task running at the time, and the design is specifically intended to
also be useful for virtualization stolen time as well. There are some
KVM patches floating around to implement this, and we should look at
doing a Xen implementation. That would be much more practically useful,
since (I think) it allows the scheduler to be aware of stolen time and
not penalize tasks for time they never got to use. Or at the very least
it could give you per-task stolen stats (maybe?) which would be useful.
J
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|