Hello,
I am currently investigating an issue with MSI allocation/deallocation
which appears to be an MSI resource leak in Xen. This is XenServer 6.0
based on Xen 4.1.1, with no changesets I can see affecting the relevant
Xen codepaths.
The box in question is a Netscalar SDX box with 24 logical cores (2
Nehalem sockets , 6 cores , hyperthreading), 96GB RAM, with 4 dual-port
Intel 10G ixgbe cards, (and two SSL 'Xcelerator' cards, but I have
disabled these for debugging purposes). Each of the 8 NIC ports exports
40 virtual functions. There are 40 (identical) VMs which have 1 VF from
each NIC passed through to them, giving each VM 8 VFs. Each VF itself
uses 3 MSI-X interrupts. Therefore, for all VMs to be working
correctly, there are 3irqs per VF for 8 VFs for 40 VMs = 960 MSI-X
interrupts.
The symptoms are: Reboot the VMs a couple of times, and eventually Xen
says "(XEN) ../physdev.c:140: domXXX: can't create irq for msi!". After
adding extra debugging, the call call to create_irq() was returning
-ENOSPC. At the point at which create_irq() was failing, there were
huge numbers of irqs listed with the debugkeys 'i' with a descriptor
affinity mask of all cpus, which I believe is interfering with the
calculations in __assign_irq_vector().
I suspected that this might be because of scheduling under load swapping
VCPUs across PCPUs, resulting in the irq descriptor being written into
all PCPU IDTs. As a result, I pinned each VM to a specific PCPU in the
hope that this would go away.
When starting each VM individually, the problem appears to go away.
However, when starting all VMs at once, there are still some irqs with
an affinity mask of all CPUs.
Specifically, one case is this: (I added extra debugging to put
irq_cfg->cpu_mask into the 'i' debugkeys)
(XEN) IRQ: 845 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000010 vec:7e type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 55(----),
(XEN) IRQ: 846 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:86 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 54(----),
(XEN) IRQ: 847 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:96 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 53(----),
(XEN) IRQ: 848 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:be type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 52(----),
(XEN) IRQ: 849 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:c6 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 51(----),
(XEN) IRQ: 850 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:ce type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 50(----),
(XEN) IRQ: 851 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:b7 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 49(----),
(XEN) IRQ: 852 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:cf type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 48(----),
(XEN) IRQ: 853 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:d7 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 47(----),
(XEN) IRQ: 854 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:d9 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 46(----),
(XEN) IRQ: 855 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:22 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 45(----),
(XEN) IRQ: 856 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:2a type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 44(----),
(XEN) IRQ: 857 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000010 vec:3c type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 43(----),
(XEN) IRQ: 858 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:4c type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 42(----),
(XEN) IRQ: 859 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:54 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 41(----),
(XEN) IRQ: 860 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:b5 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 40(----),
(XEN) IRQ: 861 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:ae type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 39(----),
(XEN) IRQ: 862 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:de type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 38(----),
(XEN) IRQ: 863 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000010 vec:55 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 37(----),
(XEN) IRQ: 864 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:9d type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 36(----),
(XEN) IRQ: 865 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:46 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 35(----),
(XEN) IRQ: 866 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:a6 type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 34(----),
(XEN) IRQ: 867 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:5f type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 33(----),
(XEN) IRQ: 868 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
cfg_aff:00000000,00000000,00000000,00000020 vec:7f type=PCI-MSI
status=00000050 in-flight=0 domain-list=34: 32(----),
Shows all irqs for dom34. The descriptors have full affinity, but the
irq_cfg has a cpu_mask between processor 8 and 9.
The domain dump for dom34 is
(XEN) General information for domain 34:
(XEN) refcnt=3 dying=0 nr_pages=131065 xenheap_pages=8 dirty_cpus={}
max_pages=133376
(XEN) handle=97ef6eef-69c2-024c-1bbb-a150ca668691 vm_assist=00000000
(XEN) paging assistance: hap refcounts translate external
(XEN) Rangesets belonging to domain 34:
(XEN) I/O Ports { }
(XEN) Interrupts { 32-55 }
(XEN) I/O Memory { f9f00-f9f03, fa001-fa003, fa19c-fa19f,
fa29d-fa29f, fa39c-fa39f, fa49d-fa49f, fa59c-fa59f, fa69d-fa69f,
fa79c-fa79f, fa89d-fa89f, fa99c-fa99f, faa9d-faa9f, fab9c-fab9f,
fac9d-fac9f, fad9c-fad9f, fae9d-fae9f }
(XEN) Memory pages belonging to domain 34:
(XEN) DomPage list too long to display
(XEN) P2M entry stats:
(XEN) L1: 1590 entries, 6512640 bytes
(XEN) L2: 253 entries, 530579456 bytes
(XEN) PoD entries=0 cachesize=0 superpages=0
(XEN) XenPage 00000000001146e1: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 00000000001146e0: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 00000000001146df: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 00000000001146de: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 00000000000bdc0e: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 0000000000114592: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 000000000011458f: caf=c000000000000001,
taf=7400000000000001
(XEN) XenPage 000000000011458c: caf=c000000000000001,
taf=7400000000000001
(XEN) VCPU information and callbacks for domain 34:
(XEN) VCPU0: CPU3 [has=F] flags=1 poll=0 upcall_pend = 00,
upcall_mask = 00 dirty_cpus={} cpu_affinity={3}
(XEN) paging assistance: hap, 4 levels
(XEN) No periodic timer
(XEN) VCPU1: CPU3 [has=F] flags=1 poll=0 upcall_pend = 00,
upcall_mask = 00 dirty_cpus={3} cpu_affinity={3}
(XEN) paging assistance: hap, 4 levels
(XEN) No periodic timer
Showing that this domain is actually pinned to pcpu 3.
Am I mis-interpreting the information, or does this indicate that the
scheduler (credit) is not obeying the cpu_affinity? The virtual
functions seem to be passing network traffic correctly so I would assume
that interrupts are getting where they are supposed to be going.
Another question which may or may not be related. cpu_cfg has a vector
and a cpu_mask. From this, I assume that the same interrupt must occupy
the same IDT entry for every pcpu it might be received on. Is there an
architectural reason why this should be the case, or is it just the way
Xen is coded?
(Also, it seems that <asm/irq.h> and <xen/irq.h> both define struct
irq_cfg and while one is strictly an extension of the other, there
appears to be no guards around them meaning that sizeof(irq_cfg) depends
on which header file you include. I don't know if this is relevant or
not, but it strikes me that code getting confused as to which they are
using could be computing on junk if it is expecting the longer irq_cfg
and actually getting the shorter irq_cfg.)
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|