|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH 5/5] RFC: pci: Migrate pci_mmcfg_{read,write} to pci.c
On 18/05/2026 4:21 pm, Teddy Astie wrote:
> Key parts of MMCFG access bits are in mmconfig_64.c (in particular
> pci_mmcfg_{read,write}()) while PCI configuration primitives (used accross the
> codebase) are in pci.c.
> This leads to situations where the compiler cannot optimize the `switch (len)`
> for MMCFG access for all pci_conf_{read,write}N(), because they are not from
> the same file.
>
> Move the pci_mmcfg_{read,write} in pci.c and hint the compiler to inline these
> functions such that it's more likely that the compiler eliminates the
> `switch (len)``.
>
> Also take the opportunity to migrate to pci_sbdf_t to reduce the parameter
> count
> and drop many parameter domains checks.
>
> On GCC 16.1, this leads to codegen where pci_conf_{read,write}N() doesn't call
> pci_mmcfg_{read,write}() anymore and directly perform the MMIO RW.
>
> <pci_conf_read32>:
> 55 push %rbp
> 48 89 e5 mov %rsp,%rbp
> 53 push %rbx
> 89 f8 mov %edi,%eax
> 89 f3 mov %esi,%ebx
> c1 ef 10 shr $0x10,%edi
> 81 fe ff 00 00 00 cmp $0xff,%esi
> 77 26 ja ffff82d040301fab <pci_conf_read32+0x3a>
> 85 ff test %edi,%edi
> 75 22 jne ffff82d040301fab <pci_conf_read32+0x3a>
> 0f b7 f8 movzwl %ax,%edi
> c1 e7 08 shl $0x8,%edi
> 83 e3 fc and $0xfffffffc,%ebx
> 09 df or %ebx,%edi
> 81 cf 00 00 00 80 or $0x80000000,%edi
> ba 04 00 00 00 mov $0x4,%edx
> be 00 00 00 00 mov $0x0,%esi
> e8 2a 1c 03 00 call ffff82d040333bd3 <pci_conf_read>
> eb 22 jmp ffff82d040301fcd <pci_conf_read32+0x5c>
> 81 fb ff 0f 00 00 cmp $0xfff,%ebx
> 77 24 ja ffff82d040301fd7 <pci_conf_read32+0x66>
> 0f b6 d0 movzbl %al,%edx
> 0f b6 f4 movzbl %ah,%esi
> 0f b7 ff movzwl %di,%edi
> e8 f5 fd ff ff call ffff82d040301db6 <pci_dev_base>
> 48 85 c0 test %rax,%rax
> 74 18 je ffff82d040301fde <pci_conf_read32+0x6d>
> 89 db mov %ebx,%ebx
> 48 01 d8 add %rbx,%rax
> 8b 00 mov (%rax),%eax
> 48 8b 5d f8 mov -0x8(%rbp),%rbx
> c9 leave
> e9 89 12 f0 ff jmp ffff82d040203260 <__x86_return_thunk>
> b8 ff ff ff ff mov $0xffffffff,%eax
> eb ef jmp ffff82d040301fcd <pci_conf_read32+0x5c>
> b8 ff ff ff ff mov $0xffffffff,%eax
> eb e8 jmp ffff82d040301fcd <pci_conf_read32+0x5c>
This is not the whole function because it's missing a pop %rbx. Also,
right at the bottom here are the -1's from bad error paths (discussed
later).
But, this should be after the ---. Disassembly this long isn't
interesting to stay in the commit message.
> diff --git a/xen/arch/x86/x86_64/mmconfig_64.c
> b/xen/arch/x86/x86_64/mmconfig_64.c
> index 940cf6d747..483dff9c2c 100644
> --- a/xen/arch/x86/x86_64/mmconfig_64.c
> +++ b/xen/arch/x86/x86_64/mmconfig_64.c
> @@ -133,6 +46,25 @@ static void __iomem *mcfg_ioremap(const struct
> acpi_mcfg_allocation *cfg,
> return (void __iomem *) virt;
> }
>
> +char __iomem *pci_mmcfg_base(unsigned int seg, unsigned int *bus)
> +{
> + struct acpi_mcfg_allocation *cfg;
> + int cfg_num;
> +
> + for (cfg_num = 0; cfg_num < pci_mmcfg_config_num; cfg_num++) {
> + cfg = pci_mmcfg_virt[cfg_num].cfg;
> + if (cfg->pci_segment == seg &&
> + (cfg->start_bus_number <= *bus) &&
> + (cfg->end_bus_number >= *bus)) {
> + *bus -= cfg->start_bus_number;
> + return pci_mmcfg_virt[cfg_num].virt;
> + }
> + }
> +
> + /* Fall back to type 0 */
> + return NULL;
> +}
This is a horrid function. Accessing and modifying bus like that causes
poor code generation, and by now having this in a separate translation
unit, the optimiser can't fold it into it's single caller and undo the
poor decisions which went into writing this function.
Instead, you want:
void __iomem *pci_mmcfg_base(pci_sbdf_t sbdf)
{
...
}
base which takes care of the bus adjustment internally. This can be
broken out into a separate patch, and take the opportunity to write it
to Xen style.
> diff --git a/xen/arch/x86/x86_64/pci.c b/xen/arch/x86/x86_64/pci.c
> index 8d33429103..c37e3edade 100644
> --- a/xen/arch/x86/x86_64/pci.c
> +++ b/xen/arch/x86/x86_64/pci.c
> @@ -11,13 +11,123 @@
> #define PCI_CONF_ADDRESS(sbdf, reg) \
> (0x80000000U | ((sbdf).bdf << 8) | ((reg) & ~3))
>
> +/*
> + * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
> + * on their northbrige except through the * %eax register. As such, you MUST
> + * NOT use normal IOMEM accesses, you need to only use the magic mmio-config
> + * accessor functions.
> + * In fact just use pci_config_*, nothing else please.
I know you're just copying an existing comment, but it's mostly an
opinion and not terribly helpful in place.
"AMD Fam10h CPUs can only make MMCFG accesses via MOV %eax/%ax/%al",
would be better, except...
... this claim cannot be true. It's been made since the K8 RevF BKWG
and exists even into the latest PPRs, but that's simply not how
load/store ops work in the pipeline.
It was added to Linux in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3320ad994afb2c44ad34b3b34c3c5cf0da297331
but without adequate justification. I've made some enquiries.
> + */
> +static inline unsigned char mmio_config_readb(void __iomem *pos)
> +{
> + u8 val;
> + asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
> + return val;
> +}
These need corrections, either in this patch or a followup.
Switch to Xen types, and correct the memory operand constraint to be "m"
(*(uint8_t *)ptr).
The Fam10h BKWG states that any memory encoding is acceptable, and this
allows the optimiser more flexibility (which will get used).
> +
> +static inline unsigned short mmio_config_readw(void __iomem *pos)
> +{
> + u16 val;
> + asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
> + return val;
> +}
> +
> +static inline unsigned int mmio_config_readl(void __iomem *pos)
> +{
> + u32 val;
> + asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
> + return val;
> +}
> +
> +static inline void mmio_config_writeb(void __iomem *pos, u8 val)
> +{
> + asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writew(void __iomem *pos, u16 val)
> +{
> + asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writel(void __iomem *pos, u32 val)
> +{
> + asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static char __iomem *pci_dev_base(unsigned int seg, unsigned int bus,
> unsigned int devfn)
> +{
> + char __iomem *addr;
> +
> + addr = pci_mmcfg_base(seg, &bus);
> + if (!addr)
> + return NULL;
> + return addr + ((bus << 20) | (devfn << 12));
> +}
> +
> +static inline
> +int pci_mmcfg_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len, u32
> *value)
> +{
> + char __iomem *addr;
> +
> + /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> + if (unlikely(reg > 4095)) {
> +err: *value = -1;
> + return -EINVAL;
> + }
> +
> + addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
> + if (!addr)
> + goto err;
> +
> + switch (len) {
> + case 1:
> + *value = mmio_config_readb(addr + reg);
> + break;
> + case 2:
> + *value = mmio_config_readw(addr + reg);
> + break;
> + case 4:
> + *value = mmio_config_readl(addr + reg);
> + break;
> + }
> +
> + return 0;
> +}
Again, for this patch or a later cleanup, drop the output-by-pointer and
return value directly. The optimiser is hopefully doing this already
but it also makes the function simpler.
At best, we want ASSERT_UNREACHBLE()'s in the error paths (including no
mapping), and to consistently return -1. Returning 0 for a bad length
is bogus.
Strictly speaking we also need to check reg & (len - 1) because accesses
must be naturally aligned, but even with ASSERT_UNREACHABLE() and a
failsafe -1, that's still logic emitted and I'm not sure if it's worth
having. Amongst other things you really need to know that len is 1, 2
or 4 before the alignment check reads correctly.
> +
> +inline int pci_mmcfg_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int
> len, u32 value)
> +{
> + char __iomem *addr;
> +
> + /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> + if (unlikely(reg > 4095))
> + return -EINVAL;
> +
> + addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
> + if (!addr)
> + return -EINVAL;
> +
> + switch (len) {
> + case 1:
> + mmio_config_writeb(addr + reg, value);
> + break;
> + case 2:
> + mmio_config_writew(addr + reg, value);
> + break;
> + case 4:
> + mmio_config_writel(addr + reg, value);
> + break;
> + }
> +
> + return 0;
> +}
> +
> uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg)
> {
> uint32_t value;
>
> if ( sbdf.seg || reg > 255 )
> {
> - pci_mmcfg_read(sbdf.seg, sbdf.bus, sbdf.devfn, reg, 1, &value);
> + pci_mmcfg_read(sbdf, reg, 1, &value);
> return value;
> }
Not for this patch, but we also need to junk the if() condition here and
elsewhere.
We should be using MMCFG unilaterally if it's available; the IO port
pairs require use of a global spinlock, and behind the scenes all the
CPU is doing is translating it back into MMCFG-like accesses.
At this juncture we should probably change it at the start of the 4.23
dev window to give it maximum time to settle before getting into a
release, so probably best to tack it on as a final commit in this series?
~Andrew
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |