Xen project Mailing List

Re: [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support

On Fri, Jul 25, 2025 at 11:26 PM Demi Marie Obenour <demiobenour@xxxxxxxxx> wrote: > > On 7/25/25 11:06, Edwin Török wrote: > > Caveats: > > * x86-only for now > > * only tested on AMD EPYC 8124P > > * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I > > wasn't able to test there ('perf top' fails to parse samples). I'll > > try to figure out what is wrong there separately > > * for now I edit the release config in xen.spec to enable frame > > pointers. Eventually it might be useful to have a 3rd build variant: > > release-fp. Or teach Xen to produce/parse ORC or SFrame formats without > > requiring frame pointers. > > That would definitely be nice. > > > * perf produces raw hex addresses, and a python script is used to > > post-process it and obtain symbols. Eventually perf should be updated > > to do this processing itself (there was an old patch for Linux 3.12 by > > Borislav Petkov) > > * I've only tested capturing Dom0 stack traces. Linux doesn't support > > guest stacktraces yet (it can only lookup the guest RIP) > > What would be needed to fix this? Capturing guest stacktraces from the host > or Xen seems like a really bad idea, but it might make sense to interrupt the > guest and allow it to provide a (strictly validated) stack trace for use by > the host. This would need to be done asynchronously, as Linux is moving > towards generating stack traces outside of the NMI handler. The way perf captures stacktraces for userspace is that it either walks its stack by following framepointers and copying memory from userspace, or it can take a copy of the entire userspace stack (up to a limit of ~64KiB), and let perf userspace construct a stacktrace from that (for --callgraph=dwarf). I'd expect that copying from userspace is a lot faster than copying from a guest, because for a guest you'd also need to map the page first, which would be an additional cost (and you'd have to be careful not to infinitely recurse if you get another interrupt while mapping), unless you keep the entire guest address space mapped, or have a cache of mapped stack pages. You can let a guest profile itself though, in which case it can process its own stacktrace, but exposing Xen's stacktrace to untrusted guests is probably not a good idea. You could try to also do what I've done with Xen here: have the guest provide the stacktrace to the hypervisor, which provides it to Dom0. But then you'd need to run some code inside the guest, and that may not be possible if you are currently handling something on behalf of the guest in Xen. I'd first wait to see whether KVM implements this, and then implement something similar for Xen. AFAICT KVM doesn't support this either. > > > * the Linux patch will need to be forwarded ported to master before > > submission > > * All the caveats for using regular VPMU apply, except for the lack of > > stacktraces, that is fixed here! > What would be needed to fix these limitations? See below for my answers to each one, although others on this mailing list might be able to provide a more correct answer. > > * Dom0 must run hard pinned on all host CPUs Not sure. I think Dom0 needs to be able to run some code whenever the NMI arrives, and that needs to run on the CPU it arrived on, unless you define a way for one CPU to also receive and process interrupts for CPUs that Dom0 doesn't run on. The pinning requirement could be lifted if everything is correctly context switched > > * Watchdog must be disabled IIUC the Xen watchdog and the profiling interrupt both use NMIs, so you can only have one of them active. In fact even with bare metal Linux the NMI watchdog sometimes needs to be disabled for certain perf counters to work, although basic timer based profiling and most counters work with NMI enabled. If needed 'perf' prints a message to disable the Linux NMI watchdog, but if you follow those instructions literally the host will panic and reboot 20s later because the soft lockup detector won't work anymore (so that too would need to be disabled). > > * not security supported See https://xenbits.xen.org/xsa/advisory-163.html Also even if you ignore security support, using vPMU on production systems currently is probably not a good idea, there are probably lots of pre-existing bugs to fix, and the bugs might be micro-architecture specific. E.g. with vPMU enabled running 'perf stat -ddd' in Dom0 caused one of my (older) hosts to completely freeze (all vCPUs except one stuck in a spinlock, and the last one not running anywhere), whereas it ran perfectly fine on other (newer) hosts. I haven't debugged yet what is causing it (could also be a bug in Linux, or the Linux Xen PMU driver and not Xen). There is a way to restrict what performance counters are exposed to guests, and e.g. I think EC2 used to expose some of these to guests. Initially temperatures/turbo boost could be measured from guests, but that got disabled following an XSA: https://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html. Later a restricted set of PMCs got exposed (vpmu=ipc, or vpmu=arch), which then got enabled for EC2 guests (don't know whether they still expose these): https://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html If that is enabled, the stacktrace is already suitably restricted to Dom0-only, so should be safe to use, i.e. even if you can't use `vpmu=on`, you might be able to use `vpmu=ipc`. Currently neither of these is security supported though. > > * x86 only This one should be fixable, all it needs is a way to do a stacktrace, which should already be present in the arch-specific traps.c (although AFAICT only X86 and ARM implement stack traces currently), although that of course assumes that other arches would have a PMU implementation to begin with. AFAICT xenpmu_op is only implemented on x86: ``` #ifdef CONFIG_X86 xenpmu_op(unsigned int op, xen_pmu_params_t *arg) #endif ``` > > * secureboot needs to be disabled > This is because to enable vpmu you need to modify the Xen cmdline, and that is restricted under secure boot. If you enable vpmu at build time then it might work, but see above about no security support. > With them it isn't really > possible to do profiling on production systems, only on dedicated development > boxes. I'd like to be able to do profiling on production too. But I'm taking it one step at a time, at least now I'll have a way to do profiling on development/test boxes. For production use a different approach might be needed, e.g. LBR, or a dedicated way to get just a hypervisor stacktrace on a timer, without involving the (v)PMU at all. That would require some new integration with `perf` too. > That works great if you have a dev box and can create a realistic > workload with non-sensitive data, but less great if you have a problem that > you can't reproduce on a non-production system. It's also not usable > for real-time monitoring of production environments. Best regards, --Edwin > -- > Sincerely, > Demi Marie Obenour (she/her/hers)

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.