This is an archived copy of the Xen.org mailing list, which we have preserved to ensure that existing links to archives are not broken. The live archive, which contains the latest emails, can be found at http://lists.xen.org/
Home Products Support Community News


[Xen-devel] [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, e

To: "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>
Subject: [Xen-devel] [RFC PATCH linux-2.6.18-xen] pciback: clean up (MSI-X vec, entrynr) list when resetting PCI device
From: Laszlo Ersek <lersek@xxxxxxxxxx>
Date: Wed, 01 Jun 2011 12:05:44 +0200
Delivery-date: Wed, 01 Jun 2011 03:17:15 -0700
Envelope-to: www-data@xxxxxxxxxxxxxxxxxxx
List-help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-id: Xen developer discussion <xen-devel.lists.xensource.com>
List-post: <mailto:xen-devel@lists.xensource.com>
List-subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
List-unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>, <mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
Sender: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

this is more of an RFC than a patch now for linux-2.6.18-xen. Describing the 
situation captured in RHBZ#688673, in a nutshell:

- let's say we have an Intel 82576 card (igb),
- the card exports virtual functions (igbvf),
- one virtual function is passed through to a PV guest,
- the igbvf driver, co-operating with pcifront, pciback, and the hypervisor, 
sets up three MSI-X vectors for the virtual function PCI device,
- when the domU is shut down, the MSI-X vectors are "leaked" in dom0, because 
nobody ever reaches dom0's pci_disable_msix() / msi_remove_pci_irq_vectors() 
during shutdown,
- therefore configuration of the VF during the next boot of such a guest 
doesn't succeed (dom0 thinks, based on its stale list, that MSI-X vectors are 
already allocated/mapped for the device)

  dom0 (msix_capability_init(), output beautified a bit):
    msix entry 0 for dev 01:10:0 are not freed before acquire again.
    msix entry 1 for dev 01:10:0 are not freed before acquire again.
    msix entry 2 for dev 01:10:0 are not freed before acquire again.

      Determining IP information for eth1...
      Failed to obtain physical IRQ 255
      Failed to obtain physical IRQ 254
      Failed to obtain physical IRQ 253

- if the device or the full igbvf module is removed before shutdown in the 
guest (in case the boot was successful to begin with!), then pci_disable_msix() 
is called and everything works fine; the next boot / ifup succeeds.

I backported c/s 1070 to the RHEL-5 kernel, but it was not enough -- this is 
not an abrupt termination (destroy), but a contolled shutdown. Here's what I 
suspect happens (in RHEL-5) when device_shutdown() walks the list of devices in 
the PV domU and calls the appropriate shutdown hook for igbvf:

domU: igbvf_shutdown()
                    dev, PCI_COMMAND,
                    orig & ~(PCI_COMMAND_IO | PCI_COMMAND_MEMORY)
dom0:                   pciback_do_op()
                              command_write() [via PCI_COMMAND funcptr]
                                    dev->msix_enabled = 0;

The final assignment above precludes c/s 1070 from doing the job.

Of course igbvf_shutdown() could invoke igbvf_remove() instead (or another 
lower-level function that ends up in dom0's pci_disable_msix()), but it should 
not depend on the guest playing nice.

Below's my proposed patch for RHEL-5, ported to upstream. 

introduce msi_prune_pci_irq_vectors()

extend pciback_reset_device() with the following functionality 
- remove all (MSI-X vector, entry number) pairs that might still belong, in 
dom0's mind, to the PCI device being reset,
- unmap some space "used for saving/restoring MSI-X tables" that might 
otherwise go leaked

The function, modeled after msi_remove_pci_irq_vectors(), doesn't touch the 
hypervisor, nor the PCI device; it only trims the list used by dom0 for 
filtering. Justification being: when this is called, either the owner domain is 
gone already (or very close to being gone), or the device was already correctly 

Now I think one comment in the patch below does not hold for upstream: 
"msi_dev_head is only maintained in dom0". According to c/s 680, this doesn't 
seem to be the case for upstream.

Is the case described above possible at all in upstream? Do you think the fix I 
propose is correct? It seemed to do the job with RHEL-5 in my testing.

    PCI: 0000:01:10.0: cleaning up MSI-X entry 0
    PCI: 0000:01:10.0: cleaning up MSI-X entry 1
    PCI: 0000:01:10.0: cleaning up MSI-X entry 2
    PCI: 0000:01:10.0: cleaning up mask_base

Just in case:

 drivers/pci/msi-xen.c             |   46 ++++++++++++++++++++++++++++++++++++++
 drivers/xen/pciback/pciback_ops.c |    5 ++++
 include/linux/pci.h               |    1 
 3 files changed, 52 insertions(+)

Signed-off-by: Laszlo Ersek<lersek@xxxxxxxxxx>

Thank you very much!

diff -r 876a5aaac026 drivers/pci/msi-xen.c
--- a/drivers/pci/msi-xen.c     Thu May 26 12:33:41 2011 +0100
+++ b/drivers/pci/msi-xen.c     Tue May 31 19:17:26 2011 +0200
@@ -894,6 +894,51 @@ void msi_remove_pci_irq_vectors(struct p
        dev->irq = msi_dev_entry->default_irq;
+void msi_prune_pci_irq_vectors(struct pci_dev *dev)
+       unsigned long flags;
+       struct msi_dev_list *msi_dev_scan, *msi_dev_entry = NULL;
+       struct msi_pirq_entry *pirq_entry, *tmp;
+       if (!pci_msi_enable || !dev)
+               return;
+       /* msi_dev_head is only maintained in dom0 */
+       BUG_ON(!is_initial_xendomain());
+       /* search for the set of MSI-X vectors, don't extend list */
+       spin_lock_irqsave(&msi_dev_lock, flags);
+       list_for_each_entry(msi_dev_scan, &msi_dev_head, list)
+               if (msi_dev_scan->dev == dev) {
+                       msi_dev_entry = msi_dev_scan;
+                       break;
+               }
+       spin_unlock_irqrestore(&msi_dev_lock, flags);
+       if (msi_dev_entry == NULL)
+               return;
+       /* Empty the set. No PHYSDEVOP_unmap_pirq hypercalls: the domU is
+        * already gone, or the device is already unplugged.
+        */
+       spin_lock_irqsave(&msi_dev_entry->pirq_list_lock, flags);
+       if (!list_empty(&msi_dev_entry->pirq_list_head))
+               list_for_each_entry_safe(pirq_entry, tmp,
+                                        &msi_dev_entry->pirq_list_head, list) {
+                       printk(KERN_INFO "PCI: %s: cleaning up MSI-X entry "
+                              "%d\n", pci_name(dev), pirq_entry->entry_nr);
+                       list_del(&pirq_entry->list);
+                       kfree(pirq_entry);
+               }
+       spin_unlock_irqrestore(&msi_dev_entry->pirq_list_lock, flags);
+       if (msi_dev_entry->mask_base != NULL) {
+               printk(KERN_INFO "PCI: %s: cleaning up mask_base\n",
+                      pci_name(dev));
+               iounmap(msi_dev_entry->mask_base);
+               msi_dev_entry->mask_base = NULL;
+       }
 void pci_no_msi(void)
        pci_msi_enable = 0;
@@ -906,5 +951,6 @@ EXPORT_SYMBOL(pci_disable_msix);
 #ifdef CONFIG_XEN
diff -r 876a5aaac026 drivers/xen/pciback/pciback_ops.c
--- a/drivers/xen/pciback/pciback_ops.c Thu May 26 12:33:41 2011 +0100
+++ b/drivers/xen/pciback/pciback_ops.c Tue May 31 19:17:26 2011 +0200
@@ -29,6 +29,11 @@ void pciback_reset_device(struct pci_dev
                if (dev->msi_enabled)
+               /* After a controlled shutdown or the crash fixup above, prune
+                * dom0's MSI-X vector list for the device.
+                */
+               msi_prune_pci_irq_vectors(dev);
diff -r 876a5aaac026 include/linux/pci.h
--- a/include/linux/pci.h       Thu May 26 12:33:41 2011 +0100
+++ b/include/linux/pci.h       Tue May 31 19:17:26 2011 +0200
@@ -640,6 +640,7 @@ extern void msi_remove_pci_irq_vectors(s
 #ifdef CONFIG_XEN
 extern int register_msi_get_owner(int (*func)(struct pci_dev *dev));
 extern int unregister_msi_get_owner(int (*func)(struct pci_dev *dev));
+extern void msi_prune_pci_irq_vectors(struct pci_dev *dev);

Xen-devel mailing list