[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH v1 00/10] Xen flamegraph (hypervisor stacktrace profile) support



On Fri, Jul 25, 2025 at 11:26 PM Demi Marie Obenour
<demiobenour@xxxxxxxxx> wrote:
>
> On 7/25/25 11:06, Edwin Török wrote:
> > Caveats:
> >  * x86-only for now
> >  * only tested on AMD EPYC 8124P
> >  * Xen PMU support was broken to begin with on Xeon Silver 4514Y, so I
> >  wasn't able to test there ('perf top' fails to parse samples). I'll
> >  try to figure out what is wrong there separately
> >  * for now I edit the release config in xen.spec to enable frame
> >  pointers. Eventually it might be useful to have a 3rd build variant:
> >  release-fp. Or teach Xen to produce/parse ORC or SFrame formats without
> >  requiring frame pointers.
>
> That would definitely be nice.
>
> >  * perf produces raw hex addresses, and a python script is used to
> >  post-process it and obtain symbols. Eventually perf should be updated
> >  to do this processing itself (there was an old patch for Linux 3.12 by 
> > Borislav Petkov)
> >  * I've only tested capturing Dom0 stack traces. Linux doesn't support
> >   guest stacktraces yet (it can only lookup the guest RIP)
>
> What would be needed to fix this?  Capturing guest stacktraces from the host
> or Xen seems like a really bad idea, but it might make sense to interrupt the
> guest and allow it to provide a (strictly validated) stack trace for use by
> the host.  This would need to be done asynchronously, as Linux is moving
> towards generating stack traces outside of the NMI handler.

The way perf captures stacktraces for userspace is that it either
walks its stack by following framepointers
and copying memory from userspace, or it can take a copy of the entire
userspace stack (up to a limit of ~64KiB),
and let perf userspace construct a stacktrace from that (for --callgraph=dwarf).
I'd expect that copying from userspace is a lot faster than copying
from a guest, because for a guest you'd also need to map
the page first, which would be an additional cost (and you'd have to
be careful not to infinitely recurse if you get another interrupt
while mapping), unless you keep the entire guest address space mapped,
or have a cache of mapped stack pages.

You can let a guest profile itself though, in which case it can
process its own stacktrace, but exposing Xen's stacktrace to untrusted
guests is probably not a good idea.

You could try to also do what I've done with Xen here: have the guest
provide the stacktrace to the hypervisor, which provides it to Dom0.
But then you'd need to run some code inside the guest, and that may
not be possible if you are currently handling something on behalf of
the guest in Xen.

I'd first wait to see whether KVM implements this, and then implement
something similar for Xen. AFAICT KVM doesn't support this either.

>
> >  * the Linux patch will need to be forwarded ported to master before 
> > submission
> >  * All the caveats for using regular VPMU apply, except for the lack of
> >   stacktraces, that is fixed here!

> What would be needed to fix these limitations?

See below for my answers to each one, although others on this mailing
list might be able to provide a more correct answer.

> >     * Dom0 must run hard pinned on all host CPUs

Not sure. I think Dom0 needs to be able to run some code whenever the
NMI arrives, and that needs to run on the CPU it arrived on, unless
you define a way for one CPU to also receive and process interrupts
for CPUs that Dom0 doesn't run on.
The pinning requirement could be lifted if everything is correctly
context switched

> >     * Watchdog must be disabled

IIUC the Xen watchdog and the profiling interrupt both use NMIs, so
you can only have one of them active.
In fact even with bare metal Linux the NMI watchdog sometimes needs to
be disabled for certain perf counters to work, although basic timer
based profiling and most counters work with NMI enabled. If needed
'perf' prints a message to disable the Linux NMI watchdog, but if you
follow those instructions literally the host will panic and reboot 20s
later because the soft lockup detector won't work anymore (so that too
would need to be disabled).

> >     * not security supported

See https://xenbits.xen.org/xsa/advisory-163.html

Also even if you ignore security support, using vPMU on production
systems currently is probably not a good idea, there are probably lots
of pre-existing bugs to fix, and the bugs might be micro-architecture
specific.
E.g. with vPMU enabled running 'perf stat -ddd' in Dom0 caused one of
my (older) hosts to completely freeze (all vCPUs except one stuck in a
spinlock, and the last one not running anywhere), whereas it ran
perfectly fine on other (newer) hosts. I haven't debugged yet what is
causing it (could also be a bug in Linux, or the Linux Xen PMU driver
and not Xen).

There is a way to restrict what performance counters are exposed to
guests, and e.g. I think EC2 used to expose some of these to guests.
Initially temperatures/turbo boost could be measured from guests, but
that got disabled following an XSA:
https://www.brendangregg.com/blog/2014-09-15/the-msrs-of-ec2.html.
Later a restricted set of PMCs got exposed (vpmu=ipc, or vpmu=arch),
which then got enabled for EC2 guests (don't know whether they still
expose these): https://www.brendangregg.com/blog/2017-05-04/the-pmcs-of-ec2.html

If that is enabled, the stacktrace is already suitably restricted to
Dom0-only, so should be safe to use, i.e. even if you can't use
`vpmu=on`, you might be able to use `vpmu=ipc`.
Currently neither of these is security supported though.

> >     * x86 only

This one should be fixable, all it needs is a way to do a stacktrace,
which should already be present in the arch-specific traps.c (although
AFAICT only X86 and ARM implement stack traces currently), although
that of course assumes that other arches would have a PMU
implementation to begin with.
AFAICT xenpmu_op is only implemented on x86:
```
#ifdef CONFIG_X86
xenpmu_op(unsigned int op, xen_pmu_params_t *arg)
#endif
```

> >     * secureboot needs to be disabled
>

This is because to enable vpmu you need to modify the Xen cmdline, and
that is restricted under secure boot.
If you enable vpmu at build time then it might work, but see above
about no security support.

>  With them it isn't really
> possible to do profiling on production systems, only on dedicated development
> boxes.

I'd like to be able to do profiling on production too. But I'm taking
it one step at a time, at least now I'll have a way to do profiling on
development/test boxes.

For production use a different approach might be needed, e.g. LBR, or
a dedicated way to get just a hypervisor stacktrace on a timer,
without involving the (v)PMU at all.
That would require some new integration with `perf` too.

> That works great if you have a dev box and can create a realistic
> workload with non-sensitive data, but less great if you have a problem that
> you can't reproduce on a non-production system.  It's also not usable
> for real-time monitoring of production environments.

Best regards,
--Edwin

> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.