Xen project Mailing List

Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

To: David Woodhouse <dwmw2@xxxxxxxxxxxxx>, kvm@xxxxxxxxxxxxxxx

From: Dongli Zhang <dongli.zhang@xxxxxxxxxx>

Date: Mon, 18 May 2026 17:57:40 -0700

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=oracle.com; dmarc=pass action=none header.from=oracle.com; dkim=pass header.d=oracle.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=u+PFvMDHJ33oo7sGA3cLlGmVs2NUml79uY9VX5VPNXc=; b=aM3dfNqEaF+2GaLRInOI9LI2lhsvA6DtHoR24cKSCgMiZDZHjdecPJJY/lu2a80+h7HxH5gy/I9KLx/rN9N5/zbk1dHh4s7FwIuRLaJRx03uR9+WAABvTOWbFp0f5raRmWIJ/NKPASmFEoPUOQ9MKRKOEWplFPIbXbeqwo08dFyRjjsp6OoW7hs9y6TEjh7wqENcqc5TaoOPHXKX9ek0v25KwLwHN677MxId0xImJlB7TaCSBuQa4lYS+S9M3o/5WSLver3HQ0+S0vf2HKxLjqITGzn6JOggloXRNLQavyNFHhhAQuw1PjPEzELr18RUTSapqdWMaqedVTUsdoyoJg==

Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=tYvxjLHz0pT2q1GlB5Iur0+SSs3mOds/bcF8DL80LTKP4KkQ3Zv3WO7riHP80ivS7aX8LmGTi8Gdb0OMUsz5ry85HbNFGwMICws76wD+zSDNpQ9LfXuPaOzWWvlZ5V4tjgmWB4AoRNwqL8MtvQyu/cRNGffOXy47UlUt5vfEo75nS5YcEyfJn9xMxjd4WD0KpbrfJZm+StCiEpTSyd8dv6b5WmU0q8CYyBCOJop3M2baLAItyB/W3tI9Kskv0cFUUW0yAV20ysnZY99D7AUw9++8HrP+TUNzoUHpiZS6tU6igtGGKsCJJ4cwfeSeQ2Fd/4KhTs0lhMu/nLNSjpkXAg==

Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=corp-2025-04-25 header.d=oracle.com header.i="@oracle.com" header.h="Cc:Content-Transfer-Encoding:Content-Type:Date:From:In-Reply-To:Message-ID:MIME-Version:References:Subject:To"; dkim=pass header.s=selector2-oracle-onmicrosoft-com header.d=oracle.onmicrosoft.com header.i="@oracle.onmicrosoft.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"

Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>, Jonathan Corbet <corbet@xxxxxxx>, Shuah Khan <skhan@xxxxxxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxx>, Sean Christopherson <seanjc@xxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>, x86@xxxxxxxxxx, Marc Zyngier <maz@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, Jonathan Cameron <jic23@xxxxxxxxxx>, Sascha Bischoff <Sascha.Bischoff@xxxxxxx>, Jack Allister <jalliste@xxxxxxxxxx>, Joey Gouly <joey.gouly@xxxxxxx>, joe.jin@xxxxxxxxxx, linux-doc@xxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxxxxxxxxx, linux-kselftest@xxxxxxxxxxxxxxx

Delivery-date: Tue, 19 May 2026 00:58:33 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 2026-05-18 1:48 AM, David Woodhouse wrote: > On Mon, 2026-05-18 at 00:52 -0700, Dongli Zhang wrote: >> On 5/9/26 3:46 PM, David Woodhouse wrote: > > Huh, I didn't write that then; it isn't September yet. Did you mean > 2026-05-09? We aren't all in the US... > > Strictly speaking, you just misattributed a quote of mine, which is > very poor form :) > > What mailer are you using? Can it be fixed? Thunderbird. I have fixed the Thunderbird configuration. Does it look better to you? > >>> From: Jack Allister <jalliste@xxxxxxxxxx> >>> >>> Where kvm->arch.use_master_clock is false (because the host TSC is >>> unreliable, or the guest TSCs are configured strangely), the KVM clock >>> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST >>> returns an error. In this case, as documented, userspace shall use the >>> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this >> >> The description here confused me a little. It sounds like userspace should >> call >> KVM_SET_CLOCK if KVM_SET_CLOCK_GUEST fails. However, I assume it actually >> means >> that userspace should do nothing extra if KVM_SET_CLOCK_GUEST fails, and >> simply >> rely on the prior KVM_SET_CLOCK and KVM_VCPU_TSC_OFFSET workflow described in >> patch 07. Is that correct? > > Yes. If KVM_SET_CLOCK_GUEST doesn't work (which might be because > KVM_GET_CLOCK_GUEST didn't work so userspace doesn't have the data in > the first place, or because the actual ioctl returns failure), then > userspace should rely on the old method using KVM_SET_CLOCK imprecisely > instead. That includes on a migration from an older kernel that *lacks* > KVM_GET_CLOCK_GUEST, of course. > > I don't think it strictly matters whether userspace does KVM_SET_CLOCK > first, then *tries* KVM_SET_CLOCK_GUEST, or whether it tries > KVM_SET_CLOCK_GUEST and then only calls KVM_SET_CLOCK on failure? I'd > probably be inclined not to use KVM_SET_CLOCK at all unless it is known > to be needed? I really appreciate guidelines like the ones below. https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@xxxxxxxxxxxxx Assuming I am a user of the new API, I feel confused about whether the goal is to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is meant to supplement the former. If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, I would appreciate it if the API could carry more data in addition to struct pvclock_vcpu_time_info. +#define KVM_SET_CLOCK_GUEST _IOW(KVMIO, 0xd6, struct pvclock_vcpu_time_info) +#define KVM_GET_CLOCK_GUEST _IOR(KVMIO, 0xd7, struct pvclock_vcpu_time_info) In the future, if we need to carry additional data, we could simply reuse the padding fields instead of introducing another KVM_SET_CLOCK_GUEST2. The following is an example of how additional data could be carried. KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c68dc1b577eabd5605c6c7c08f3e07ae18d30d5d So far, I believe this guideline resolves most of my concerns. https://lore.kernel.org/all/20240522001817.619072-8-dwmw2@xxxxxxxxxxxxx > >>> +4.145 KVM_GET_CLOCK_GUEST >>> +---------------------------- >>> + >>> +:Capability: none >>> +:Architectures: x86_64 >>> +:Type: vcpu ioctl >>> +:Parameters: struct pvclock_vcpu_time_info (out) >>> +:Returns: 0 on success, <0 on error >>> + >>> +Retrieves the current time information structure used for KVM/PV clocks, >>> +in precisely the form advertised to the guest vCPU, which gives parameters >>> +for a direct conversion from a guest TSC value to nanoseconds. >>> + >>> +When the KVM clock is not in "master clock" mode, for example because the >>> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM >>> clock >>> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest >>> TSC. >>> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL. >>> + >>> +4.146 KVM_SET_CLOCK_GUEST >>> +---------------------------- >>> + >>> +:Capability: none >> >> Do we need a KVM_CHECK_EXTENSION capability for this? If userspace wants to >> support the new API, should it detect availability via KVM_CHECK_EXTENSION, >> or >> simply try the ioctl and handle failure? > > That might be conventional, I suppose. But I suspect Jack's thinking > was that userspace is going to have to *try* it anyway, and still might > have to fall back to what KVM_SET_CLOCK can manage, so userspace > probably wouldn't even bother to check that capability; it doesn't > matter. > > Since then, we've added some more attributes in this series though, and > it probably is worth adding a cap which advertises them *all*? > Something like KVM_CAP_CLOCK_PRECISION_API? >From an API user's perspective, userspace may need to distinguish between an >API failure and the API not being available. I don't see any existing "Capability: none" entries in Documentation/virt/kvm/api.rst. > >>> +#ifdef CONFIG_X86_64 >>> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user >>> *argp) >>> +{ >>> + struct pvclock_vcpu_time_info hv_clock = {}; >>> + struct kvm_vcpu_arch *vcpu = &v->arch; >>> + struct kvm_arch *ka = &v->kvm->arch; >>> + unsigned int seq; >>> + >>> + /* >>> + * If KVM_REQ_CLOCK_UPDATE is already pending, or if the pvclock >>> + * has never been generated at all, call kvm_guest_time_update(). >>> + */ >>> + if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) || !vcpu->hw_tsc_hz) { >> >> This was flagged by AI, and I am still checking whether it is a real issue. >> >> What happens if KVM_REQ_MASTERCLOCK_UPDATE and KVM_REQ_CLOCK_UPDATE are both >> pending? >> >> From my perspective, I am also curious how we should reason about this in >> other >> scenarios in the future. Specifically, when do we need to process >> KVM_REQ_MASTERCLOCK_UPDATE before KVM_REQ_CLOCK_UPDATE, and when is it >> acceptable not to? I noticed that kvm_cpuid() already processes only >> KVM_REQ_CLOCK_UPDATE. > > The way I've been thinking about it — and I'm only two cups of coffee > into Monday so take those words literally and don't think of them as > British understatement of something I believe is absolute truth — is > that MASTERCLOCK_UPDATE is updating the actual clock for the whole VM, > while CLOCK_UPDATE is about *putting* that information into the per- > vCPU pvclock structures. > > So after a MASTERCLOCK_UPDATE, we need to do a CLOCK_UPDATE on all > vCPUs to disseminate the result. Which means that if CLOCK_UPDATE is > already pending before a MASTERCLOCK_UPDATE, it's probably redundant > and might as well be cleared because it's only going to get set *again* > in kvm_end_pvclock_update()? Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no pending CLOCK_UPDATE. In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before saving pvclock_vcpu_time_info? This should be a very rare scenario. Although it is not mandatory, I think most users call these APIs only when the VM is already stopped. I am just curious how I should handle this in the future if I am implementing similar code, that is, processing a pending request outside vcpu_enter_guest(). > > >>> + /* >>> + * Calculate the guest TSC at the new reference point, and the >>> + * corresponding KVM clock value according to user_hv_clock. >>> + * Adjust kvmclock_offset so both definitions agree. >>> + */ >>> + guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now); >>> + user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc); >>> + ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns; >> >> I used to explore adjusting ka->kvmclock_offset in KVM_SET_CLOCK based on the >> old hv_clock and the new hv_clock long time ago. At that time, my concern was >> what would happen if userspace provided bogus values. Theoretically, this is >> possible with any ioctl. My concern may be unnecessary. >> >> Would it be helpful to validate that the delta is within a reasonable range, >> e.g. that the drift can never be more than five minutes (forward or >> backward)? > > Setting confidential guests aside, which have their own way of trusting > the TSC and should never even *consider* using kvmclock, surely this is > supposed to be *entirely* under the control of the VMM? The kernel has > no business deciding what is 'bogus'? Yes, I both think and agree that this is supposed to be entirely under the control of the VMM. Sometimes security researchers use fuzzing tools to interact with APIs in an attempt to leak data or crash the hypervisor in order to turn it into a CVE. My understanding is that, in the worst-case scenario here, the guest clock would simply get stuck. > > If a guest has been running for months on a previous host and is > migrated to a new host, don't we expect that the KVM clock of the new > VM on the new host is tweaked from its default near-zero after > creation, to some large amount? > Regarding live migration, my own investigation does not show a proportional relationship between VM uptime and the amount of drift. Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the amount of drift does not depend on how long the VM has been running before live migration. Instead, it depends on the delta between when we call MSR_IA32_TSC and KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK. The guest TSC stops at P1 and resumes at P3. The kvmclock stops at P2 and resumes at P4. We expect P1 == P2 and P3 == P4. On source host. - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1 - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1 - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2 - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3 - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4 ... ... - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N - KVM_GET_CLOCK ===> P2 On target host. - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3 - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2 - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3 - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4 - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5 ... ... - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N - KVM_SET_CLOCK ====> P4 Here is my equiation to predict the drift. T1_ns = P2 - P1 (nanoseconds) T2_tsc = P4 - P3 (cycles) T2_ns = pvclock_scale_delta(T2_tsc, old_hv_clock_src.tsc_to_system_mul, old_hv_clock_src.tsc_shift) if (T2_ns > T1_ns) backward drift: T2_ns - T1_ns else if (T1_ns > T2_ns) forward drift: T1_ns - T2_ns Theoretically, if P1 == P2 and P3 == P4, we won't encounter any kvm-clock drift. Thank you very much! Dongli Zhang

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.