|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH for-4.22] xen/x86: Change stub page allocation/free
On 11/06/2026 8:53 am, Roger Pau Monne wrote:
> From: Jason Andryuk <jason.andryuk@xxxxxxx>
>
> Today the inline tracking of the stub page is problematic. 0xcc is used to
> indicate unused, but it is also a "clear value." A !CONFIG_PV build with
> smt=0 will bring up CPU0, bring up CPU1, bring down CPU1, and free the
> in-use stub page. CPU0 or subsequent onlined CPUs can write to the re-used
> page.
I'm pretty sure a CONFIG_PV build booted on a FRED enabled system will
do the same.
This is the other case where we (now) forgo writing out the LSTAR/CSTAR
stubs.
>
> The new approach uses a global, CPU-indexed dynamically allocated array of
> stub addresses. However, to handle NUMA aware allocations, we cannot
> allocate all the memory in advance because of the NUMA dependency. Take
> advantage of the fact that Xen will attempt to contiguously pack CPUs on
> the same NUMA node (see normalise_cpu_order()), and on CPU bringup use the
> same stubs page the previous CPU did if suitable. Note the code would
> still function properly even if CPUs from NUMA nodes are not contiguously
> packed, it just consumes more memory.
>
> stub pages are no longer freed. They remain referenced in the global
> CPU-indexed array and are re-used if the CPU is re-onlined.
>
> stubs and node_stubs don't have an explicit lock. During boot they are
> accessed single threaded. During runtime, &cpu_add_remove_lock serializes
> access.
Is node_stubs stale? Stub(s) should be capitalised at the start of the
sentence. In context, it's not clear that it's a variable name, and it
doesn't need to be the literal variable name to convey the intended meaning.
>
> Fixes: 7a66ac8d1633 ("x86: move syscall trampolines off the stack")
> Signed-off-by: Jason Andryuk <jason.andryuk@xxxxxxx>
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> There are other even more simple options here: for example Andrew proposed
> to pack stubs contiguously in both the physical and the linear address
> spaces, at the cost of possibly loosing the NUMA memory affinity between
> the allocated page and the CPU using it. We have decided to go for a more
> conservative approach here, that keeps the same properties as the current
> logic regarding NUMA memory affinity of the stub region.
Part of the suggestion was made in error, but there's a second aspect
which I'll discuss at the end of the email.
> diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
> index 4192edf635b6..cddf8806c877 100644
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -2089,9 +2089,7 @@ void asmlinkage __init noreturn __start_xen(void)
>
> init_idle_domain();
>
> - this_cpu(stubs.addr) = alloc_stub_page(smp_processor_id(),
> - &this_cpu(stubs).mfn);
> - BUG_ON(!this_cpu(stubs.addr));
> + init_bsp_stub();
Personally, I'd name this init_stubs(). It does work for more than just
the BSP, and the bsp_* part is only really needed for clarity when
there's a matching ap_* variant, which is not the case here.
>
> bsp_traps_reinit(); /* Needs stubs allocated, must be before
> presmp_initcalls. */
>
> diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
> index d8fd71ffab37..3282392317f4 100644
> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -641,41 +642,61 @@ static int do_boot_cpu(int apicid, int cpu)
> return rc;
> }
>
> -#define STUB_BUF_CPU_OFFS(cpu) (((cpu) & (STUBS_PER_PAGE - 1)) *
> STUB_BUF_SIZE)
> +/* Dynamically allocated, indexed by CPU. Store physical address of stubs.
> */
> +static paddr_t *__ro_after_init stubs;
>
> -unsigned long alloc_stub_page(unsigned int cpu, unsigned long *mfn)
> +static bool assign_stub_page(unsigned int cpu)
> {
> unsigned long stub_va;
> - struct page_info *pg;
> + paddr_t addr = stubs[cpu];
>
> - BUILD_BUG_ON(STUBS_PER_PAGE & (STUBS_PER_PAGE - 1));
> -
> - if ( *mfn )
> - pg = mfn_to_page(_mfn(*mfn));
> - else
> + if ( addr == INVALID_PADDR )
> {
> - nodeid_t node = cpu_to_node(cpu);
> - unsigned int memflags = node != NUMA_NO_NODE ? MEMF_node(node) : 0;
I think you need to retain this calculation of memflags.
MEMF_node(NUMA_NO_NODE) doesn't work as expected.
> + nodeid_t nid = cpu_to_node(cpu);
>
> - pg = alloc_domheap_page(NULL, memflags);
> - if ( !pg )
> - return 0;
> + /*
> + * Attempt to use the same page as the previous CPU if possible,
> + * otherwise allocate a new one.
> + */
> + if ( cpu && nid == cpu_to_node(cpu - 1) &&
> + PAGE_OFFSET(stubs[cpu - 1] + STUB_BUF_SIZE) )
> + addr = stubs[cpu - 1] + STUB_BUF_SIZE;
> + else
> + {
> + struct page_info *pg = alloc_domheap_page(NULL, MEMF_node(nid));
>
> - unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
You've dropped this memset() of the whole page to 0xcc.
As a consequence, the stubs for not-yet-onlined CPUs, or for gaps
because of NUMA, are rubble yet mapped executably.
> + if ( !pg )
> + return false;
> + addr = page_to_maddr(pg);
> + }
> + stubs[cpu] = addr;
> }
>
> stub_va = XEN_VIRT_END - FIXADDR_X_SIZE - (cpu + 1) * PAGE_SIZE;
> - if ( map_pages_to_xen(stub_va, page_to_mfn(pg), 1,
> + if ( map_pages_to_xen(stub_va, maddr_to_mfn(addr), 1,
> PAGE_HYPERVISOR_RX | MAP_SMALL_PAGES) )
> - {
> - if ( !*mfn )
> - free_domheap_page(pg);
> - stub_va = 0;
> - }
> - else if ( !*mfn )
> - *mfn = mfn_x(page_to_mfn(pg));
> + return false;
>
> - return stub_va;
> + per_cpu(stubs.mfn, cpu) = PFN_DOWN(addr);
> + per_cpu(stubs.addr, cpu) = stub_va + PAGE_OFFSET(addr);
> + return true;
> +}
> +
> +void __init init_bsp_stub(void)
> +{
> + const unsigned int num_cpus = num_present_cpus();
> + unsigned int i;
> +
> + ASSERT(!stubs);
> + stubs = xvmalloc_array(typeof(*stubs), num_cpus);
> + if ( !stubs )
> + panic("Unable to allocate stub array");
> +
> + for ( i = 0; i < num_cpus; i++ )
> + stubs[i] = INVALID_PADDR;
> +
> + if ( !assign_stub_page(0) )
> + panic("Unable to initialize BSP stub region");
\n's for both panic messages.
With the above stuff addressed, I think this is looking ok, but
definitely subject to Jason confirming it resolves his issue. And for
4.22, that might even be sufficient to go in.
The other thing I want to discuss is this:
> stub_va = XEN_VIRT_END - FIXADDR_X_SIZE - (cpu + 1) * PAGE_SIZE;
because it creates 32 virtual aliases of every stub.
AIUI, this was a hard requirement for the old freeing scheme, but the
optimisation guides recommend against creating aliases like this.
Besides microarchitectural tracking/safety effects, one consequence is
that we end up with 31 aliases which have unsafe branches in them;
disp32's depend on the linear address the code is executed at.
The ideal solution would be allocate VAs just like we allocate paddrs,
and for the map_pages_to_xen() be beside the alloc_domheap_page(),
rather than outside of INVALID_PADDR check.
This reduces the amount of VA space (and L1 pagetables) used by 32
times, and removes the risk of accidentally using the wrong alias.
~Andrew
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |