On 03/08/11 12:51, Andrew Cooper wrote:
> Hello,
>
> I am currently investigating an issue with MSI allocation/deallocation
> which appears to be an MSI resource leak in Xen. This is XenServer 6.0
> based on Xen 4.1.1, with no changesets I can see affecting the relevant
> Xen codepaths.
>
> The box in question is a Netscalar SDX box with 24 logical cores (2
> Nehalem sockets , 6 cores , hyperthreading), 96GB RAM, with 4 dual-port
> Intel 10G ixgbe cards, (and two SSL 'Xcelerator' cards, but I have
> disabled these for debugging purposes). Each of the 8 NIC ports exports
> 40 virtual functions. There are 40 (identical) VMs which have 1 VF from
> each NIC passed through to them, giving each VM 8 VFs. Each VF itself
> uses 3 MSI-X interrupts. Therefore, for all VMs to be working
> correctly, there are 3irqs per VF for 8 VFs for 40 VMs = 960 MSI-X
> interrupts.
>
> The symptoms are: Reboot the VMs a couple of times, and eventually Xen
> says "(XEN) ../physdev.c:140: domXXX: can't create irq for msi!". After
> adding extra debugging, the call call to create_irq() was returning
> -ENOSPC. At the point at which create_irq() was failing, there were
> huge numbers of irqs listed with the debugkeys 'i' with a descriptor
> affinity mask of all cpus, which I believe is interfering with the
> calculations in __assign_irq_vector().
>
> I suspected that this might be because of scheduling under load swapping
> VCPUs across PCPUs, resulting in the irq descriptor being written into
> all PCPU IDTs. As a result, I pinned each VM to a specific PCPU in the
> hope that this would go away.
>
> When starting each VM individually, the problem appears to go away.
> However, when starting all VMs at once, there are still some irqs with
> an affinity mask of all CPUs.
>
> Specifically, one case is this: (I added extra debugging to put
> irq_cfg->cpu_mask into the 'i' debugkeys)
>
> (XEN) IRQ: 845 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000010 vec:7e type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 55(----),
> (XEN) IRQ: 846 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:86 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 54(----),
> (XEN) IRQ: 847 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:96 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 53(----),
> (XEN) IRQ: 848 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:be type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 52(----),
> (XEN) IRQ: 849 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:c6 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 51(----),
> (XEN) IRQ: 850 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:ce type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 50(----),
> (XEN) IRQ: 851 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:b7 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 49(----),
> (XEN) IRQ: 852 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:cf type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 48(----),
> (XEN) IRQ: 853 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:d7 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 47(----),
> (XEN) IRQ: 854 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:d9 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 46(----),
> (XEN) IRQ: 855 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:22 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 45(----),
> (XEN) IRQ: 856 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:2a type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 44(----),
> (XEN) IRQ: 857 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000010 vec:3c type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 43(----),
> (XEN) IRQ: 858 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:4c type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 42(----),
> (XEN) IRQ: 859 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:54 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 41(----),
> (XEN) IRQ: 860 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:b5 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 40(----),
> (XEN) IRQ: 861 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:ae type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 39(----),
> (XEN) IRQ: 862 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:de type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 38(----),
> (XEN) IRQ: 863 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000010 vec:55 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 37(----),
> (XEN) IRQ: 864 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:9d type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 36(----),
> (XEN) IRQ: 865 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:46 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 35(----),
> (XEN) IRQ: 866 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:a6 type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 34(----),
> (XEN) IRQ: 867 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:5f type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 33(----),
> (XEN) IRQ: 868 desc_aff:ffffffff,ffffffff,ffffffff,ffffffff
> cfg_aff:00000000,00000000,00000000,00000020 vec:7f type=PCI-MSI
> status=00000050 in-flight=0 domain-list=34: 32(----),
>
> Shows all irqs for dom34. The descriptors have full affinity, but the
> irq_cfg has a cpu_mask between processor 8 and 9.
>
> The domain dump for dom34 is
> (XEN) General information for domain 34:
> (XEN) refcnt=3 dying=0 nr_pages=131065 xenheap_pages=8 dirty_cpus={}
> max_pages=133376
> (XEN) handle=97ef6eef-69c2-024c-1bbb-a150ca668691 vm_assist=00000000
> (XEN) paging assistance: hap refcounts translate external
> (XEN) Rangesets belonging to domain 34:
> (XEN) I/O Ports { }
> (XEN) Interrupts { 32-55 }
> (XEN) I/O Memory { f9f00-f9f03, fa001-fa003, fa19c-fa19f,
> fa29d-fa29f, fa39c-fa39f, fa49d-fa49f, fa59c-fa59f, fa69d-fa69f,
> fa79c-fa79f, fa89d-fa89f, fa99c-fa99f, faa9d-faa9f, fab9c-fab9f,
> fac9d-fac9f, fad9c-fad9f, fae9d-fae9f }
> (XEN) Memory pages belonging to domain 34:
> (XEN) DomPage list too long to display
> (XEN) P2M entry stats:
> (XEN) L1: 1590 entries, 6512640 bytes
> (XEN) L2: 253 entries, 530579456 bytes
> (XEN) PoD entries=0 cachesize=0 superpages=0
> (XEN) XenPage 00000000001146e1: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 00000000001146e0: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 00000000001146df: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 00000000001146de: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 00000000000bdc0e: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 0000000000114592: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 000000000011458f: caf=c000000000000001,
> taf=7400000000000001
> (XEN) XenPage 000000000011458c: caf=c000000000000001,
> taf=7400000000000001
> (XEN) VCPU information and callbacks for domain 34:
> (XEN) VCPU0: CPU3 [has=F] flags=1 poll=0 upcall_pend = 00,
> upcall_mask = 00 dirty_cpus={} cpu_affinity={3}
> (XEN) paging assistance: hap, 4 levels
> (XEN) No periodic timer
> (XEN) VCPU1: CPU3 [has=F] flags=1 poll=0 upcall_pend = 00,
> upcall_mask = 00 dirty_cpus={3} cpu_affinity={3}
> (XEN) paging assistance: hap, 4 levels
> (XEN) No periodic timer
>
> Showing that this domain is actually pinned to pcpu 3.
>
> Am I mis-interpreting the information, or does this indicate that the
> scheduler (credit) is not obeying the cpu_affinity? The virtual
> functions seem to be passing network traffic correctly so I would assume
> that interrupts are getting where they are supposed to be going.
>
>
> Another question which may or may not be related. cpu_cfg has a vector
> and a cpu_mask. From this, I assume that the same interrupt must occupy
> the same IDT entry for every pcpu it might be received on. Is there an
> architectural reason why this should be the case, or is it just the way
> Xen is coded?
>
> (Also, it seems that <asm/irq.h> and <xen/irq.h> both define struct
> irq_cfg and while one is strictly an extension of the other, there
> appears to be no guards around them meaning that sizeof(irq_cfg) depends
> on which header file you include. I don't know if this is relevant or
> not, but it strikes me that code getting confused as to which they are
> using could be computing on junk if it is expecting the longer irq_cfg
> and actually getting the shorter irq_cfg.
Correction - I wasn't reading the source closely enough. There are
#ifdef __ia64__ guards around this.
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|