I need help tracking down an IRQ SMP affinity problem.
Xen version: 3.4 unstable
dom0: Linux 2.6.30.3 (Debian)
domU: Linux 2.6.30.1 (Debian)
Hardware platform: HP ProLiant G6, dual-socket Xeon 5540, hyperthreading enable in BIOS and kernel (total of 16 CPUs: 2 sockets * 4 cores per socket * 2 threads per core)
With vcpus < 5, I can change /proc/irq/<irq#>/smp_affinity and see the interrupts get routed to the proper CPU(s) by checking /proc/interrupts. With vcpus > 4, any change to /proc/irq/<irq#>/smp_affinity results in a complete loss of interrupts for <irq#>.
I noticed in the domU /var/log/kern.log that APIC routing changes from "flat" for vcpus=4 to "physical flat" for vcpus=5. Looking at the source code for linux-2.6.30.1/arch/x86/kernel/apic/probe_64.c, this switch occurs when "max_physical_apicid >= 8."
In the domU /var/log/kern.log and /proc/cpuinfo, only even numbered APIC IDs (starting from 0) are used so when it gets to the 5th CPU, it is already at APIC ID 8 which triggers the physical flat APIC routing.
dom0 has all 16 CPUs available to it. The mapping between CPU numbers and APIC ID is 1-to-1 (CPU0:APIC ID0 ... CPU15:APIC ID15). domU is configured with either vcpus=4 or vcpus=5. In both cases, the mapping uses even number only for the APIC IDs (CPU0:APIC
ID0 ... CPU5:APIC ID8).
I'm using an ATTO/PMC Tachyon-based Fibre Channel PCIe card on this platform. It uses PCI-MSI-edge for its interrupt. I use pciback.hide in my dom0 Xen 3.5 kernel stanza to pass the device directly to domU. I'm also using "iommu=1,no-intremap,passthrough"
in the stanza. I'm able to see the device in dom0 via "lspci -vv" and see the MSI message address and data that have been programmed into the Tachyon registers and using IRQ 32. Regardless of changes to IRQ 32's SMP affinity in domU, the MSI message address
and data as seen from dom0 does not change. I can only conclude that domU is running some sort of IRQ emulation.
# lspci -vv in dom0
07:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
Subsystem: Atto Technology Device 003c
Interrupt: pin A routed to IRQ 32
Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
Address: 00000000fee00000 Data: 40ba (dest ID=0, RH=DM=0, fixed interrupt, vector=0xba)
Kernel driver in use: pciback
In domU, the device has been remapped (intentionally in the dom0 config file) to bus 0, device 8 and can also be seen via "lspci -vv" with the same MSI message address but different data and using IRQ 48.
# lspci -vv in domU with vcpus=5
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
Subsystem: Atto Technology Device 003c
Interrupt: pin A routed to IRQ 48
Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee00000 Data: 4059 (dest ID=0, RH=DM=0, fixed interrupt, vector=0x59)
Kernel driver in use: hwdrv
Kernel modules: hbas-hw
At this point, the kernel driver for the device has been loaded and the number of interrupts can be seen in /proc/interrupts. The default IRQ SMP has not been changed and yet the interrupts are all being routed to CPU0. This is for vcpus=5 (physical flat
APIC routing). Changing IRQ 48's SMP affinity to any value will result in a complete loss of all interrupts. domU and dom0 need to be rebooted to restore normal operation.
# cat /proc/irq/48/smp_affinity
1f
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4
48: 60920 0 0 0 0 PCI-MSI-edge HW_TACHYON
With vcpus=4 (flat APIC routing), IRQ 48's SMP affinity behaves as expected (each of the 4 bits in /proc/irq/48/smp_affinity correspond to a CPU or CPUs where the interrupts will be routed). The MSI message address and data have different attributes compared
to vcpus=5. The address has dest ID=f (matches default /proc/irq/48/smp_affinity), RH=DM=1 and uses lowest priority instead of fixed interrupt.
# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
Subsystem: Atto Technology Device 003c
Interrupt: pin A routed to IRQ 48
Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0f00c Data: 4159 (dest ID=f, RH=DM=1, lowest priority interrupt, vector=0x59)
Kernel driver in use: hwdrv
Kernel modules: hbas-hw
# cat /proc/irq/48/smp_affinity
f
# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
48: 14082 19052 15337 14645 PCI-MSI-edge HW_TACHYON
Changing IRQ 48's SMP affinity to 8 shows that all the interrupts are being routed to CPU3 as expected and the MSI message address has changed to reflect the new dest ID while the vector stays the same.
# echo 8 > /proc/irq/48/smp_affinity
# cat /proc/interrupts
48: 14082 19052 15338 351361 PCI-MSI-edge HW_TACHYON
# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
Subsystem: Atto Technology Device 003c
Interrupt: pin A routed to IRQ 48
Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
Address: 00000000fee0800c Data: 4159 (dest ID=8, RH=DM=1, lowest priority interrupt, vector=0x59)
Kernel driver in use: hwdrv
Kernel modules: hbas-hw
My hunch is there is something wrong with physical flat APIC routing in domU. If I boot this same platform to straight Linux 2.6.30.1 (no Xen), /var/log/kern.log shows that it too is using physical flat APIC routing which is expected since it has a total
of 16 CPUs. Unlike domU though, changing the IRQ SMP affinity to any one-hot value (only one bit out of 16 is set to 1) behaves as expected. A non-one hot value results in all interrupts being routed to CPU0 but at least the interrupts are not lost.
One of my questions is "Why does domU use only even numbered APIC IDs?" If it used odd numbers, then physical flat APIC routing will only trigger when vcpus > 7.
I welcome any suggestions on how to pursue this problem or hopefully, someone will say that a patch for this already exists.
Thanks.
Dante Cinco