[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 5/5] RFC: pci: Migrate pci_mmcfg_{read,write} to pci.c


  • To: Teddy Astie <teddy.astie@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 18 May 2026 18:35:48 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=Q3V/3xqU8rvEI4kDqvuV4q+H2XEtjmhTKcs+k/WkhGM=; b=UxLxa7qxJEkL7L/+3Ua1fQb7s4oiDqSGFUFFyt6b6FwFyKmQG8Krfxu3tU25ufKZn8L9jg+I3VnYzWjJbE9KaaoEPK9BOb8a/CWG8dMbSttopzvoXp3wB66WR9g3rOo5ngYAC8pvoyY55gTDNL9r25GeA25Gu51JlXjapsIvdPUXVLAQ8NcIX4hCUbvTfbXyCvvquD20aqmAZu//HtHIzOksX7KyRad2Yp1yAXSE+rq/ZSgNb68DHsJT7ROyljjjh+E75ilJXVzegZT4Y5tle1UQ2h0ddrJKhxD1QtTCChTlAxiu/fI2RG3SIro2K9d5q2O9VlHfWqZTyydLuCObIw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=DhwJeJX732cqNiiMNFydYFznN24c8TPZwn+axWo8K/NH7CGicJfIPhifQJQkIuLkZc3LquwKqXzLxnRXCWHR6bNfAc38FCqbaNQxGLfdIGPMJrd2JxvWA0rQSFqwzGaor+PYY3tW/S+swL31YsYPyW4D684OeXwF6jtLoQfw2mKCxgwRQlmMrnWLbvHOYu8HZRFNN2kN77GuRwbVhIaefVKmuyhg8mtTj/lccbeykQ4+R7G5rOpAHl92j4UKBV5OXRSfT0V4eus8hSgZNN3Vq5v8AWvPsG1IZDo+U3IH3Uc6SXTcOSM9+6eBT0aQpSH3Iz1+4cxP4jQY55h/BSWA0Q==
  • Authentication-results: eu.smtp.expurgate.cloud; dkim=pass header.s=selector1 header.d=citrix.com header.i="@citrix.com" header.h="From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck"
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Autocrypt: addr=andrew.cooper3@xxxxxxxxxx; keydata= xsFNBFLhNn8BEADVhE+Hb8i0GV6mihnnr/uiQQdPF8kUoFzCOPXkf7jQ5sLYeJa0cQi6Penp VtiFYznTairnVsN5J+ujSTIb+OlMSJUWV4opS7WVNnxHbFTPYZVQ3erv7NKc2iVizCRZ2Kxn srM1oPXWRic8BIAdYOKOloF2300SL/bIpeD+x7h3w9B/qez7nOin5NzkxgFoaUeIal12pXSR Q354FKFoy6Vh96gc4VRqte3jw8mPuJQpfws+Pb+swvSf/i1q1+1I4jsRQQh2m6OTADHIqg2E ofTYAEh7R5HfPx0EXoEDMdRjOeKn8+vvkAwhviWXTHlG3R1QkbE5M/oywnZ83udJmi+lxjJ5 YhQ5IzomvJ16H0Bq+TLyVLO/VRksp1VR9HxCzItLNCS8PdpYYz5TC204ViycobYU65WMpzWe LFAGn8jSS25XIpqv0Y9k87dLbctKKA14Ifw2kq5OIVu2FuX+3i446JOa2vpCI9GcjCzi3oHV e00bzYiHMIl0FICrNJU0Kjho8pdo0m2uxkn6SYEpogAy9pnatUlO+erL4LqFUO7GXSdBRbw5 gNt25XTLdSFuZtMxkY3tq8MFss5QnjhehCVPEpE6y9ZjI4XB8ad1G4oBHVGK5LMsvg22PfMJ ISWFSHoF/B5+lHkCKWkFxZ0gZn33ju5n6/FOdEx4B8cMJt+cWwARAQABzSlBbmRyZXcgQ29v cGVyIDxhbmRyZXcuY29vcGVyM0BjaXRyaXguY29tPsLBegQTAQgAJAIbAwULCQgHAwUVCgkI CwUWAgMBAAIeAQIXgAUCWKD95wIZAQAKCRBlw/kGpdefoHbdD/9AIoR3k6fKl+RFiFpyAhvO 59ttDFI7nIAnlYngev2XUR3acFElJATHSDO0ju+hqWqAb8kVijXLops0gOfqt3VPZq9cuHlh IMDquatGLzAadfFx2eQYIYT+FYuMoPZy/aTUazmJIDVxP7L383grjIkn+7tAv+qeDfE+txL4 SAm1UHNvmdfgL2/lcmL3xRh7sub3nJilM93RWX1Pe5LBSDXO45uzCGEdst6uSlzYR/MEr+5Z JQQ32JV64zwvf/aKaagSQSQMYNX9JFgfZ3TKWC1KJQbX5ssoX/5hNLqxMcZV3TN7kU8I3kjK mPec9+1nECOjjJSO/h4P0sBZyIUGfguwzhEeGf4sMCuSEM4xjCnwiBwftR17sr0spYcOpqET ZGcAmyYcNjy6CYadNCnfR40vhhWuCfNCBzWnUW0lFoo12wb0YnzoOLjvfD6OL3JjIUJNOmJy RCsJ5IA/Iz33RhSVRmROu+TztwuThClw63g7+hoyewv7BemKyuU6FTVhjjW+XUWmS/FzknSi dAG+insr0746cTPpSkGl3KAXeWDGJzve7/SBBfyznWCMGaf8E2P1oOdIZRxHgWj0zNr1+ooF /PzgLPiCI4OMUttTlEKChgbUTQ+5o0P080JojqfXwbPAyumbaYcQNiH1/xYbJdOFSiBv9rpt TQTBLzDKXok86M7BTQRS4TZ/ARAAkgqudHsp+hd82UVkvgnlqZjzz2vyrYfz7bkPtXaGb9H4 Rfo7mQsEQavEBdWWjbga6eMnDqtu+FC+qeTGYebToxEyp2lKDSoAsvt8w82tIlP/EbmRbDVn 7bhjBlfRcFjVYw8uVDPptT0TV47vpoCVkTwcyb6OltJrvg/QzV9f07DJswuda1JH3/qvYu0p vjPnYvCq4NsqY2XSdAJ02HrdYPFtNyPEntu1n1KK+gJrstjtw7KsZ4ygXYrsm/oCBiVW/OgU g/XIlGErkrxe4vQvJyVwg6YH653YTX5hLLUEL1NS4TCo47RP+wi6y+TnuAL36UtK/uFyEuPy wwrDVcC4cIFhYSfsO0BumEI65yu7a8aHbGfq2lW251UcoU48Z27ZUUZd2Dr6O/n8poQHbaTd 6bJJSjzGGHZVbRP9UQ3lkmkmc0+XCHmj5WhwNNYjgbbmML7y0fsJT5RgvefAIFfHBg7fTY/i kBEimoUsTEQz+N4hbKwo1hULfVxDJStE4sbPhjbsPCrlXf6W9CxSyQ0qmZ2bXsLQYRj2xqd1 bpA+1o1j2N4/au1R/uSiUFjewJdT/LX1EklKDcQwpk06Af/N7VZtSfEJeRV04unbsKVXWZAk uAJyDDKN99ziC0Wz5kcPyVD1HNf8bgaqGDzrv3TfYjwqayRFcMf7xJaL9xXedMcAEQEAAcLB XwQYAQgACQUCUuE2fwIbDAAKCRBlw/kGpdefoG4XEACD1Qf/er8EA7g23HMxYWd3FXHThrVQ HgiGdk5Yh632vjOm9L4sd/GCEACVQKjsu98e8o3ysitFlznEns5EAAXEbITrgKWXDDUWGYxd pnjj2u+GkVdsOAGk0kxczX6s+VRBhpbBI2PWnOsRJgU2n10PZ3mZD4Xu9kU2IXYmuW+e5KCA vTArRUdCrAtIa1k01sPipPPw6dfxx2e5asy21YOytzxuWFfJTGnVxZZSCyLUO83sh6OZhJkk b9rxL9wPmpN/t2IPaEKoAc0FTQZS36wAMOXkBh24PQ9gaLJvfPKpNzGD8XWR5HHF0NLIJhgg 4ZlEXQ2fVp3XrtocHqhu4UZR4koCijgB8sB7Tb0GCpwK+C4UePdFLfhKyRdSXuvY3AHJd4CP 4JzW0Bzq/WXY3XMOzUTYApGQpnUpdOmuQSfpV9MQO+/jo7r6yPbxT7CwRS5dcQPzUiuHLK9i nvjREdh84qycnx0/6dDroYhp0DFv4udxuAvt1h4wGwTPRQZerSm4xaYegEFusyhbZrI0U9tJ B8WrhBLXDiYlyJT6zOV2yZFuW47VrLsjYnHwn27hmxTC/7tvG3euCklmkn9Sl9IAKFu29RSo d5bD8kMSCYsTqtTfT6W4A3qHGvIDta3ptLYpIAOD2sY3GYq2nf3Bbzx81wZK14JdDDHUX2Rs 6+ahAA==
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • Delivery-date: Mon, 18 May 2026 17:36:05 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 18/05/2026 4:21 pm, Teddy Astie wrote:
> Key parts of MMCFG access bits are in mmconfig_64.c (in particular
> pci_mmcfg_{read,write}()) while PCI configuration primitives (used accross the
> codebase) are in pci.c.
> This leads to situations where the compiler cannot optimize the `switch (len)`
> for MMCFG access for all pci_conf_{read,write}N(), because they are not from
> the same file.
>
> Move the pci_mmcfg_{read,write} in pci.c and hint the compiler to inline these
> functions such that it's more likely that the compiler eliminates the
> `switch (len)``.
>
> Also take the opportunity to migrate to pci_sbdf_t to reduce the parameter 
> count
> and drop many parameter domains checks.
>
> On GCC 16.1, this leads to codegen where pci_conf_{read,write}N() doesn't call
> pci_mmcfg_{read,write}() anymore and directly perform the MMIO RW.
>
> <pci_conf_read32>:
>        55                      push   %rbp
>        48 89 e5                mov    %rsp,%rbp
>        53                      push   %rbx
>        89 f8                   mov    %edi,%eax
>        89 f3                   mov    %esi,%ebx
>        c1 ef 10                shr    $0x10,%edi
>        81 fe ff 00 00 00       cmp    $0xff,%esi
>        77 26                   ja     ffff82d040301fab <pci_conf_read32+0x3a>
>        85 ff                   test   %edi,%edi
>        75 22                   jne    ffff82d040301fab <pci_conf_read32+0x3a>
>        0f b7 f8                movzwl %ax,%edi
>        c1 e7 08                shl    $0x8,%edi
>        83 e3 fc                and    $0xfffffffc,%ebx
>        09 df                   or     %ebx,%edi
>        81 cf 00 00 00 80       or     $0x80000000,%edi
>        ba 04 00 00 00          mov    $0x4,%edx
>        be 00 00 00 00          mov    $0x0,%esi
>        e8 2a 1c 03 00          call   ffff82d040333bd3 <pci_conf_read>
>        eb 22                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>
>        81 fb ff 0f 00 00       cmp    $0xfff,%ebx
>        77 24                   ja     ffff82d040301fd7 <pci_conf_read32+0x66>
>        0f b6 d0                movzbl %al,%edx
>        0f b6 f4                movzbl %ah,%esi
>        0f b7 ff                movzwl %di,%edi
>        e8 f5 fd ff ff          call   ffff82d040301db6 <pci_dev_base>
>        48 85 c0                test   %rax,%rax
>        74 18                   je     ffff82d040301fde <pci_conf_read32+0x6d>
>        89 db                   mov    %ebx,%ebx
>        48 01 d8                add    %rbx,%rax
>        8b 00                   mov    (%rax),%eax
>        48 8b 5d f8             mov    -0x8(%rbp),%rbx
>        c9                      leave
>        e9 89 12 f0 ff          jmp    ffff82d040203260 <__x86_return_thunk>
>        b8 ff ff ff ff          mov    $0xffffffff,%eax
>        eb ef                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>
>        b8 ff ff ff ff          mov    $0xffffffff,%eax
>        eb e8                   jmp    ffff82d040301fcd <pci_conf_read32+0x5c>

This is not the whole function because it's missing a pop %rbx.  Also,
right at the bottom here are the -1's from bad error paths (discussed
later).

But, this should be after the ---.  Disassembly this long isn't
interesting to stay in the commit message.


> diff --git a/xen/arch/x86/x86_64/mmconfig_64.c 
> b/xen/arch/x86/x86_64/mmconfig_64.c
> index 940cf6d747..483dff9c2c 100644
> --- a/xen/arch/x86/x86_64/mmconfig_64.c
> +++ b/xen/arch/x86/x86_64/mmconfig_64.c
> @@ -133,6 +46,25 @@ static void __iomem *mcfg_ioremap(const struct 
> acpi_mcfg_allocation *cfg,
>      return (void __iomem *) virt;
>  }
>  
> +char __iomem *pci_mmcfg_base(unsigned int seg, unsigned int *bus)
> +{
> +    struct acpi_mcfg_allocation *cfg;
> +    int cfg_num;
> +
> +    for (cfg_num = 0; cfg_num < pci_mmcfg_config_num; cfg_num++) {
> +        cfg = pci_mmcfg_virt[cfg_num].cfg;
> +        if (cfg->pci_segment == seg &&
> +            (cfg->start_bus_number <= *bus) &&
> +            (cfg->end_bus_number >= *bus)) {
> +            *bus -= cfg->start_bus_number;
> +            return pci_mmcfg_virt[cfg_num].virt;
> +        }
> +    }
> +
> +    /* Fall back to type 0 */
> +    return NULL;
> +}

This is a horrid function.  Accessing and modifying bus like that causes
poor code generation, and by now having this in a separate translation
unit, the optimiser can't fold it into it's single caller and undo the
poor decisions which went into writing this function.

Instead, you want:

void __iomem *pci_mmcfg_base(pci_sbdf_t sbdf)
{
    ...
}

base which takes care of the bus adjustment internally.  This can be
broken out into a separate patch, and take the opportunity to write it
to Xen style.

> diff --git a/xen/arch/x86/x86_64/pci.c b/xen/arch/x86/x86_64/pci.c
> index 8d33429103..c37e3edade 100644
> --- a/xen/arch/x86/x86_64/pci.c
> +++ b/xen/arch/x86/x86_64/pci.c
> @@ -11,13 +11,123 @@
>  #define PCI_CONF_ADDRESS(sbdf, reg) \
>      (0x80000000U | ((sbdf).bdf << 8) | ((reg) & ~3))
>  
> +/*
> + * AMD Fam10h CPUs are buggy, and cannot access MMIO config space
> + * on their northbrige except through the * %eax register. As such, you MUST
> + * NOT use normal IOMEM accesses, you need to only use the magic mmio-config
> + * accessor functions.
> + * In fact just use pci_config_*, nothing else please.

I know you're just copying an existing comment, but it's mostly an
opinion and not terribly helpful in place.

"AMD Fam10h CPUs can only make MMCFG accesses via MOV %eax/%ax/%al",
would be better, except...

... this claim cannot be true.  It's been made since the K8 RevF BKWG
and exists even into the latest PPRs, but that's simply not how
load/store ops work in the pipeline.

It was added to Linux in
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3320ad994afb2c44ad34b3b34c3c5cf0da297331
but without adequate justification.  I've made some enquiries.

> + */
> +static inline unsigned char mmio_config_readb(void __iomem *pos)
> +{
> +    u8 val;
> +    asm volatile("movb (%1),%%al" : "=a" (val) : "r" (pos));
> +    return val;
> +}

These need corrections, either in this patch or a followup.

Switch to Xen types, and correct the memory operand constraint to be "m"
(*(uint8_t *)ptr).

The Fam10h BKWG states that any memory encoding is acceptable, and this
allows the optimiser more flexibility (which will get used).

> +
> +static inline unsigned short mmio_config_readw(void __iomem *pos)
> +{
> +    u16 val;
> +    asm volatile("movw (%1),%%ax" : "=a" (val) : "r" (pos));
> +    return val;
> +}
> +
> +static inline unsigned int mmio_config_readl(void __iomem *pos)
> +{
> +    u32 val;
> +    asm volatile("movl (%1),%%eax" : "=a" (val) : "r" (pos));
> +    return val;
> +}
> +
> +static inline void mmio_config_writeb(void __iomem *pos, u8 val)
> +{
> +    asm volatile("movb %%al,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writew(void __iomem *pos, u16 val)
> +{
> +    asm volatile("movw %%ax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static inline void mmio_config_writel(void __iomem *pos, u32 val)
> +{
> +    asm volatile("movl %%eax,(%1)" :: "a" (val), "r" (pos) : "memory");
> +}
> +
> +static char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, 
> unsigned int devfn)
> +{
> +    char __iomem *addr;
> +
> +    addr = pci_mmcfg_base(seg, &bus);
> +    if (!addr)
> +        return NULL;
> +     return addr + ((bus << 20) | (devfn << 12));
> +}
> +
> +static inline
> +int pci_mmcfg_read(pci_sbdf_t sbdf, unsigned int reg, unsigned int len, u32 
> *value)
> +{
> +    char __iomem *addr;
> +
> +    /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> +    if (unlikely(reg > 4095)) {
> +err:        *value = -1;
> +        return -EINVAL;
> +    }
> +
> +    addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
> +    if (!addr)
> +        goto err;
> +
> +    switch (len) {
> +    case 1:
> +        *value = mmio_config_readb(addr + reg);
> +        break;
> +    case 2:
> +        *value = mmio_config_readw(addr + reg);
> +        break;
> +    case 4:
> +        *value = mmio_config_readl(addr + reg);
> +        break;
> +    }
> +
> +    return 0;
> +}

Again, for this patch or a later cleanup, drop the output-by-pointer and
return value directly.  The optimiser is hopefully doing this already
but it also makes the function simpler.

At best, we want ASSERT_UNREACHBLE()'s in the error paths (including no
mapping), and to consistently return -1.  Returning 0 for a bad length
is bogus.

Strictly speaking we also need to check reg & (len - 1) because accesses
must be naturally aligned, but even with ASSERT_UNREACHABLE() and a
failsafe -1, that's still logic emitted and I'm not sure if it's worth
having.  Amongst other things you really need to know that len is 1, 2
or 4 before the alignment check reads correctly.

> +
> +inline int pci_mmcfg_write(pci_sbdf_t sbdf, unsigned int reg, unsigned int 
> len, u32 value)
> +{
> +    char __iomem *addr;
> +
> +    /* Why do we have this when nobody checks it. How about a BUG()!? -AK */
> +    if (unlikely(reg > 4095))
> +        return -EINVAL;
> +
> +    addr = pci_dev_base(sbdf.seg, sbdf.bus, sbdf.devfn);
> +    if (!addr)
> +        return -EINVAL;
> +
> +    switch (len) {
> +    case 1:
> +        mmio_config_writeb(addr + reg, value);
> +        break;
> +    case 2:
> +        mmio_config_writew(addr + reg, value);
> +        break;
> +    case 4:
> +        mmio_config_writel(addr + reg, value);
> +        break;
> +    }
> +
> +    return 0;
> +}
> +
>  uint8_t pci_conf_read8(pci_sbdf_t sbdf, unsigned int reg)
>  {
>      uint32_t value;
>  
>      if ( sbdf.seg || reg > 255 )
>      {
> -        pci_mmcfg_read(sbdf.seg, sbdf.bus, sbdf.devfn, reg, 1, &value);
> +        pci_mmcfg_read(sbdf, reg, 1, &value);
>          return value;
>      }

Not for this patch, but we also need to junk the if() condition here and
elsewhere.

We should be using MMCFG unilaterally if it's available; the IO port
pairs require use of a global spinlock, and behind the scenes all the
CPU is doing is translating it back into MMCFG-like accesses.

At this juncture we should probably change it at the start of the 4.23
dev window to give it maximum time to settle before getting into a
release, so probably best to tack it on as a final commit in this series?

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.