[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration



On Mon, 2026-05-18 at 00:52 -0700, Dongli Zhang wrote:
> On 5/9/26 3:46 PM, David Woodhouse wrote:

Huh, I didn't write that then; it isn't September yet. Did you mean
2026-05-09? We aren't all in the US... 

Strictly speaking, you just misattributed a quote of mine, which is
very poor form :)

What mailer are you using? Can it be fixed?

> > From: Jack Allister <jalliste@xxxxxxxxxx>
> > 
> > Where kvm->arch.use_master_clock is false (because the host TSC is
> > unreliable, or the guest TSCs are configured strangely), the KVM clock
> > is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> > returns an error. In this case, as documented, userspace shall use the
> > legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> 
> The description here confused me a little. It sounds like userspace should 
> call
> KVM_SET_CLOCK if KVM_SET_CLOCK_GUEST fails. However, I assume it actually 
> means
> that userspace should do nothing extra if KVM_SET_CLOCK_GUEST fails, and 
> simply
> rely on the prior KVM_SET_CLOCK and KVM_VCPU_TSC_OFFSET workflow described in
> patch 07. Is that correct?

Yes. If KVM_SET_CLOCK_GUEST doesn't work (which might be because
KVM_GET_CLOCK_GUEST didn't work so userspace doesn't have the data in
the first place, or because the actual ioctl returns failure), then
userspace should rely on the old method using KVM_SET_CLOCK imprecisely
instead. That includes on a migration from an older kernel that *lacks*
KVM_GET_CLOCK_GUEST, of course.

I don't think it strictly matters whether userspace does KVM_SET_CLOCK
first, then *tries* KVM_SET_CLOCK_GUEST, or whether it tries
KVM_SET_CLOCK_GUEST and then only calls KVM_SET_CLOCK on failure? I'd
probably be inclined not to use KVM_SET_CLOCK at all unless it is known
to be needed?

> > +4.145 KVM_GET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> > +:Architectures: x86_64
> > +:Type: vcpu ioctl
> > +:Parameters: struct pvclock_vcpu_time_info (out)
> > +:Returns: 0 on success, <0 on error
> > +
> > +Retrieves the current time information structure used for KVM/PV clocks,
> > +in precisely the form advertised to the guest vCPU, which gives parameters
> > +for a direct conversion from a guest TSC value to nanoseconds.
> > +
> > +When the KVM clock is not in "master clock" mode, for example because the
> > +host TSC is unreliable or the guest TSCs are oddly configured, the KVM 
> > clock
> > +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest 
> > TSC.
> > +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> > +
> > +4.146 KVM_SET_CLOCK_GUEST
> > +----------------------------
> > +
> > +:Capability: none
> 
> Do we need a KVM_CHECK_EXTENSION capability for this? If userspace wants to
> support the new API, should it detect availability via KVM_CHECK_EXTENSION, or
> simply try the ioctl and handle failure?

That might be conventional, I suppose. But I suspect Jack's thinking
was that userspace is going to have to *try* it anyway, and still might
have to fall back to what KVM_SET_CLOCK can manage, so userspace
probably wouldn't even bother to check that capability; it doesn't
matter.

Since then, we've added some more attributes in this series though, and
it probably is worth adding a cap which advertises them *all*?
Something like KVM_CAP_CLOCK_PRECISION_API?

> > +#ifdef CONFIG_X86_64
> > +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user 
> > *argp)
> > +{
> > +   struct pvclock_vcpu_time_info hv_clock = {};
> > +   struct kvm_vcpu_arch *vcpu = &v->arch;
> > +   struct kvm_arch *ka = &v->kvm->arch;
> > +   unsigned int seq;
> > +
> > +   /*
> > +    * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock
> > +    * has never been generated at all, call kvm_guest_time_update().
> > +    */
> > +   if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) {
> 
> This was flagged by AI, and I am still checking whether it is a real issue.
> 
> What happens if KVM_REQ_MASTERCLOCK_UPDATE and KVM_REQ_CLOCK_UPDATE are both
> pending?
> 
> From my perspective, I am also curious how we should reason about this in 
> other
> scenarios in the future. Specifically, when do we need to process
> KVM_REQ_MASTERCLOCK_UPDATE before KVM_REQ_CLOCK_UPDATE, and when is it
> acceptable not to? I noticed that kvm_cpuid() already processes only
> KVM_REQ_CLOCK_UPDATE.

The way I've been thinking about it — and I'm only two cups of coffee
into Monday so take those words literally and don't think of them as
British understatement of something I believe is absolute truth — is
that MASTERCLOCK_UPDATE is updating the actual clock for the whole VM,
while CLOCK_UPDATE is about *putting* that information into the per-
vCPU pvclock structures.

So after a MASTERCLOCK_UPDATE, we need to do a CLOCK_UPDATE on all
vCPUs to disseminate the result. Which means that if CLOCK_UPDATE is
already pending before a MASTERCLOCK_UPDATE, it's probably redundant
and might as well be cleared because it's only going to get set *again*
in kvm_end_pvclock_update()? 


> > +   /*
> > +    * Calculate the guest TSC at the new reference point, and the
> > +    * corresponding KVM clock value according to user_hv_clock.
> > +    * Adjust kvmclock_offset so both definitions agree.
> > +    */
> > +   guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
> > +   user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
> > +   ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
> 
> I used to explore adjusting ka->kvmclock_offset in KVM_SET_CLOCK based on the
> old hv_clock and the new hv_clock long time ago. At that time, my concern was
> what would happen if userspace provided bogus values. Theoretically, this is
> possible with any ioctl. My concern may be unnecessary.
> 
> Would it be helpful to validate that the delta is within a reasonable range,
> e.g. that the drift can never be more than five minutes (forward or backward)?

Setting confidential guests aside, which have their own way of trusting
the TSC and should never even *consider* using kvmclock, surely this is
supposed to be *entirely* under the control of the VMM? The kernel has
no business deciding what is 'bogus'?

If a guest has been running for months on a previous host and is
migrated to a new host, don't we expect that the KVM clock of the new
VM on the new host is tweaked from its default near-zero after
creation, to some large amount?

Attachment: smime.p7s
Description: S/MIME cryptographic signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.