All of lore.kernel.org
 help / color / mirror / Atom feed
* IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
@ 2009-10-08  0:08 Cinco, Dante
  2009-10-08 16:07 ` Bruce Edge
  2009-10-08 18:05 ` Keir Fraser
  0 siblings, 2 replies; 55+ messages in thread
From: Cinco, Dante @ 2009-10-08  0:08 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com


[-- Attachment #1.1: Type: text/plain, Size: 6383 bytes --]

I need help tracking down an IRQ SMP affinity problem.

Xen version: 3.4 unstable
dom0: Linux 2.6.30.3 (Debian)
domU: Linux 2.6.30.1 (Debian)
Hardware platform: HP ProLiant G6, dual-socket Xeon 5540, hyperthreading enable in BIOS and kernel (total of 16 CPUs: 2 sockets * 4 cores per socket * 2 threads per core)

With vcpus < 5, I can change /proc/irq/<irq#>/smp_affinity and see the interrupts get routed to the proper CPU(s) by checking /proc/interrupts. With vcpus > 4, any change to /proc/irq/<irq#>/smp_affinity results in a complete loss of interrupts for <irq#>.

I noticed in the domU /var/log/kern.log that APIC routing changes from "flat" for vcpus=4 to "physical flat" for vcpus=5. Looking at the source code for linux-2.6.30.1/arch/x86/kernel/apic/probe_64.c, this switch occurs when "max_physical_apicid >= 8." In the domU /var/log/kern.log and /proc/cpuinfo, only even numbered APIC IDs (starting from 0) are used so when it gets to the 5th CPU, it is already at APIC ID 8 which triggers the physical flat APIC routing.

dom0 has all 16 CPUs available to it. The mapping between CPU numbers and APIC ID is 1-to-1 (CPU0:APIC ID0 ... CPU15:APIC ID15). domU is configured with either vcpus=4 or vcpus=5. In both cases, the mapping uses even number only for the APIC IDs (CPU0:APIC ID0 ... CPU5:APIC ID8).

I'm using an ATTO/PMC Tachyon-based Fibre Channel PCIe card on this platform. It uses PCI-MSI-edge for its interrupt. I use pciback.hide in my dom0 Xen 3.5 kernel stanza to pass the device directly to domU. I'm also using "iommu=1,no-intremap,passthrough" in the stanza. I'm able to see the device in dom0 via "lspci -vv" and see the MSI message address and data that have been programmed into the Tachyon registers and using IRQ 32. Regardless of changes to IRQ 32's SMP affinity in domU, the MSI message address and data as seen from dom0 does not change. I can only conclude that domU is running some sort of IRQ emulation.

# lspci -vv in dom0
07:00.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 32
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable+
                Address: 00000000fee00000  Data: 40ba (dest ID=0, RH=DM=0, fixed interrupt, vector=0xba)
        Kernel driver in use: pciback

In domU, the device has been remapped (intentionally in the dom0 config file) to bus 0, device 8 and can also be seen via "lspci -vv" with the same MSI message address but different data and using IRQ 48.

# lspci -vv in domU with vcpus=5
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee00000  Data: 4059 (dest ID=0, RH=DM=0, fixed interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

At this point, the kernel driver for the device has been loaded and the number of interrupts can be seen in /proc/interrupts. The default IRQ SMP has not been changed and yet the interrupts are all being routed to CPU0. This is for vcpus=5 (physical flat APIC routing). Changing IRQ 48's SMP affinity to any value will result in a complete loss of all interrupts. domU and dom0 need to be rebooted to restore normal operation.
# cat /proc/irq/48/smp_affinity
1f
# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4
  48:      60920          0          0          0          0   PCI-MSI-edge      HW_TACHYON

With vcpus=4 (flat APIC routing), IRQ 48's SMP affinity behaves as expected (each of the 4 bits in /proc/irq/48/smp_affinity correspond to a CPU or CPUs where the interrupts will be routed). The MSI message address and data have different attributes compared to vcpus=5. The address has dest ID=f (matches default /proc/irq/48/smp_affinity), RH=DM=1 and uses lowest priority instead of fixed interrupt.

# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0f00c  Data: 4159 (dest ID=f, RH=DM=1, lowest priority interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

# cat /proc/irq/48/smp_affinity
f
# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
  48:      14082      19052      15337      14645   PCI-MSI-edge      HW_TACHYON

Changing IRQ 48's SMP affinity to 8 shows that all the interrupts are being routed to CPU3 as expected and the MSI message address has changed to reflect the new dest ID while the vector stays the same.

# echo 8 > /proc/irq/48/smp_affinity
# cat /proc/interrupts
  48:      14082      19052      15338     351361   PCI-MSI-edge      HW_TACHYON

# lspci -vv in domU with vcpus=4
00:08.0 Fibre Channel: PMC-Sierra Inc. Device 8032 (rev 05)
        Subsystem: Atto Technology Device 003c
        Interrupt: pin A routed to IRQ 48
        Capabilities: [60] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable+
                Address: 00000000fee0800c  Data: 4159 (dest ID=8, RH=DM=1, lowest priority interrupt, vector=0x59)
        Kernel driver in use: hwdrv
        Kernel modules: hbas-hw

My hunch is there is something wrong with physical flat APIC routing in domU. If I boot this same platform to straight Linux 2.6.30.1 (no Xen), /var/log/kern.log shows that it too is using physical flat APIC routing which is expected since it has a total of 16 CPUs. Unlike domU though, changing the IRQ SMP affinity to any one-hot value (only one bit out of 16 is set to 1) behaves as expected. A non-one hot value results in all interrupts being routed to CPU0 but at least the interrupts are not lost.

One of my questions is "Why does domU use only even numbered APIC IDs?" If it used odd numbers, then physical flat APIC routing will only trigger when vcpus > 7.

I welcome any suggestions on how to pursue this problem or hopefully, someone will say that a patch for this already exists.

Thanks.

Dante Cinco


[-- Attachment #1.2: Type: text/html, Size: 9768 bytes --]

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 55+ messages in thread
* RE: IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem)
@ 2009-10-16  1:38 Cinco, Dante
  2009-10-16  2:34 ` Qing He
  0 siblings, 1 reply; 55+ messages in thread
From: Cinco, Dante @ 2009-10-16  1:38 UTC (permalink / raw)
  To: Qing He; +Cc: xen-devel@lists.xensource.com, Keir Fraser,
	xiantao.zhang@intel.com

I'm still trying to track down the problem of lost interrupts when I change /proc/irq/<irq#>/smp_affinity in domU. I'm now at Xen 3.5-unstable changeset 20320 and using pvops dom0 2.6.31.1.

In domU, my PCI devices are at virtual slots 5, 6, 7 and 8 so I use "lspci -vv" to get their respective IRQs and MSI message address/data and I can also see their IRQs in /proc/interrupts (I'm not showing all 16 CPUs):

lspci -vv -s 00:05.0 | grep IRQ; lspci -vv -s 00:06.0 | grep IRQ; lspci -vv -s 00:07.0 | grep IRQ; lspci -vv -s 00:08.0 | grep IRQ
        Interrupt: pin A routed to IRQ 48
        Interrupt: pin B routed to IRQ 49
        Interrupt: pin C routed to IRQ 50
        Interrupt: pin D routed to IRQ 51
lspci -vv -s 00:05.0 | grep Address; lspci -vv -s 00:06.0 | grep Address; lspci -vv -s 00:07.0 | grep Address; lspci -vv -s 00:08.0 | grep Address
                Address: 00000000fee00000  Data: 4071 (vector=113)
                Address: 00000000fee00000  Data: 4089 (vector=137)
                Address: 00000000fee00000  Data: 4099 (vector=153)
                Address: 00000000fee00000  Data: 40a9 (vector=169)
egrep '(HW_TACHYON|CPU0)' /proc/interrupts 
            CPU0       CPU1       
  48:    1571765          0          PCI-MSI-edge      HW_TACHYON
  49:    3204403          0          PCI-MSI-edge      HW_TACHYON
  50:    2643008          0          PCI-MSI-edge      HW_TACHYON
  51:    3270322          0          PCI-MSI-edge      HW_TACHYON

In dom0, my PCI devices show up as a 4-function device: 0:07:0.0, 0:07:0.1, 0:07:0.2, 0:07:0.3 and I also use "lspci -vv" to get the IRQs and MSI info:

lspci -vv -s 0:07:0.0 | grep IRQ;lspci -vv -s 0:07:0.1 | grep IRQ;lspci -vv -s 0:07:0.2 | grep IRQ;lspci -vv -s 0:07:0.3 | grep IRQ
        Interrupt: pin A routed to IRQ 11
        Interrupt: pin B routed to IRQ 10
        Interrupt: pin C routed to IRQ 7
        Interrupt: pin D routed to IRQ 5
lspci -vv -s 0:07:0.0 | grep Address;lspci -vv -s 0:07:0.1 | grep Address;lspci -vv -s 0:07:0.2 | grep Address;lspci -vv -s 0:07:0.3 | grep Address
                Address: 00000000fee00000  Data: 403c (vector=60)
                Address: 00000000fee00000  Data: 4044 (vector=68)
                Address: 00000000fee00000  Data: 404c (vector=76)
                Address: 00000000fee00000  Data: 4054 (vector=84)

I used the "Ctrl-a" "Ctrl-a" "Ctrl-a" "i" key sequence from the Xen console to print the guest interrupt information and the PCI devices. The vectors shown here are actually the vectors as seen from dom0 so I don't understand the label "Guest interrupt information." Meanwhile, the IRQs (74 - 77) do not match those from dom0 (11, 10, 7, 5) or domU (48, 49, 50, 51) as seen by "lspci -vv" but they do match those reported by the "Ctrl-a" key sequence followed by "Q" for PCI devices.

(XEN) Guest interrupt information:
(XEN)    IRQ:  74, IRQ affinity:0x00000001, Vec: 60 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 79(----),
(XEN)    IRQ:  75, IRQ affinity:0x00000001, Vec: 68 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 78(----),
(XEN)    IRQ:  76, IRQ affinity:0x00000001, Vec: 76 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 77(----),
(XEN)    IRQ:  77, IRQ affinity:0x00000001, Vec: 84 type=PCI-MSI         status=00000010 in-flight=0 domain-list=1: 76(----),

(XEN) ==== PCI devices ====
(XEN) 07:00.3 - dom 1   - MSIs < 77 >
(XEN) 07:00.2 - dom 1   - MSIs < 76 >
(XEN) 07:00.1 - dom 1   - MSIs < 75 >
(XEN) 07:00.0 - dom 1   - MSIs < 74 >

If I look at /var/log/xen/qemu-dm-dpm.log, I see these 4 lines that show the pirq's which matches those in the last column of guest interrupt information:

pt_msi_setup: msi mapped with pirq 4f (79)
pt_msi_setup: msi mapped with pirq 4e (78)
pt_msi_setup: msi mapped with pirq 4d (77)
pt_msi_setup: msi mapped with pirq 4c (76)

The gvec's (71, 89, 99, a9) matches the vectors as seen by lspci in domU:

pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4f gvec 71 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4e gvec 89 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4d gvec 99 gflags 0
pt_msgctrl_reg_write: guest enabling MSI, disable MSI-INTx translation
pt_msi_update: Update msi with pirq 4c gvec a9 gflags 0

I see these same pirq's in the output of "xm dmesg"

(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.0
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4f device = 5 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.1
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.1
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4e device = 6 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.2
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.2
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4d device = 7 intx = 0
(XEN) [VT-D]iommu.c:1289:d0 domain_context_unmap:PCIe: bdf = 7:0.3
(XEN) [VT-D]iommu.c:1175:d0 domain_context_mapping:PCIe: bdf = 7:0.3
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 4c device = 8 intx = 0

The machine_gsi's match the pirq's while the m_irq's match the IRQ from lspci dom0. What are the guest_gsi's?

(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=79 guest_gsi=36, device=5, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4f device = 0x5 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = b device = 5 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=78 guest_gsi=40, device=6, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4e device = 0x6 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = a device = 6 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=77 guest_gsi=44, device=7, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4d device = 0x7 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 7 device = 7 intx = 0
(XEN) io.c:316:d0 pt_irq_destroy_bind_vtd: machine_gsi=76 guest_gsi=17, device=8, intx=0.
(XEN) io.c:371:d0 XEN_DOMCTL_irq_unmapping: m_irq = 0x4c device = 0x8 intx = 0x0
(XEN) [VT-D]io.c:291:d0 VT-d irq bind: m_irq = 5 device = 8 intx = 0

So now when I finally get to the part where I change the smp_affinity, I see a corresponding change in the guest interrupt information, qemu-dm-dpm.log and lspci on both dom0 and domU:

cat /proc/irq/48/smp_affinity 
ffff
echo 2 > /proc/irq/48/smp_affinity
cat /proc/irq/48/smp_affinity 
0002

(XEN) Guest interrupt information: (IRQ affinity changed from 1 to 2, while vector changed from 60 to 92)
(XEN)    IRQ:  74, IRQ affinity:0x00000002, Vec: 92 type=PCI-MSI         status=00000010 in-flight=1 domain-list=1: 79(---M),

pt_msi_update: Update msi with pirq 4f gvec 71 gflags 2 (What is the significance of gflags 2?)
pt_msi_update: Update msi with pirq 4f gvec b1 gflags 2

domU: lspci -vv -s 00:05.0 | grep Address
                Address: 00000000fee02000  Data: 40b1 (dest ID changed from 0 to 2 and vector changed from 0x71 to 0xb1)

dom0: lspci -vv -s 0:07:0.0 | grep Address
                Address: 00000000fee00000  Data: 405c (vector changed from 0x3c (60 decimal) to 0x5c (92 decimal))

I'm confused why there are 4 sets of IRQs: dom0 lspci:[11,10,7,5], domU lspci proc interrupts:[48,49,50,51], pirq:[76,77,78,79], guest int info:[74,75,76,77].

Are the changes resulting from changing the IRQ smp_affinity consistent with what is expected? Any recommendation on where to go from here?

Thanks in advance.

Dante

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2009-10-26 13:34 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-08  0:08 IRQ SMP affinity problems in domU with vcpus > 4 on HP ProLiant G6 with dual Xeon 5540 (Nehalem) Cinco, Dante
2009-10-08 16:07 ` Bruce Edge
2009-10-08 18:05 ` Keir Fraser
2009-10-08 18:11   ` Cinco, Dante
2009-10-08 21:35     ` Keir Fraser
2009-10-09  9:07       ` Qing He
2009-10-09 15:59         ` Cinco, Dante
2009-10-09 23:39         ` Cinco, Dante
2009-10-10  9:43           ` Qing He
2009-10-10 10:10             ` Keir Fraser
2009-10-12  5:25             ` Cinco, Dante
2009-10-12  5:54               ` Qing He
2009-10-14 19:54                 ` Cinco, Dante
2009-10-16  0:09                   ` Konrad Rzeszutek Wilk
2009-10-16  1:40                     ` Konrad Rzeszutek Wilk
  -- strict thread matches above, loose matches on Subject: below --
2009-10-16  1:38 Cinco, Dante
2009-10-16  2:34 ` Qing He
2009-10-16  6:37   ` Keir Fraser
2009-10-16  7:32     ` Zhang, Xiantao
2009-10-16  8:24       ` Qing He
2009-10-16  8:22         ` Zhang, Xiantao
2009-10-16  8:34           ` Qing He
2009-10-16  8:35             ` Zhang, Xiantao
2009-10-16  9:01               ` Qing He
2009-10-16  9:42                 ` Qing He
2009-10-16  9:49                 ` Zhang, Xiantao
2009-10-16 14:54                   ` Zhang, Xiantao
2009-10-16 18:24                     ` Cinco, Dante
2009-10-17  0:59                       ` Zhang, Xiantao
2009-10-20  0:19                         ` Cinco, Dante
2009-10-20  5:46                           ` Zhang, Xiantao
2009-10-20  7:51                             ` Zhang, Xiantao
2009-10-20 17:26                               ` Cinco, Dante
2009-10-21  1:10                                 ` Zhang, Xiantao
2009-10-22  1:00                                   ` Cinco, Dante
2009-10-22  1:58                                     ` Zhang, Xiantao
2009-10-22  2:42                                       ` Zhang, Xiantao
2009-10-22  6:25                                         ` Keir Fraser
2009-10-22 21:11                                           ` Jeremy Fitzhardinge
2009-10-22  5:10                                       ` Qing He
2009-10-23  0:10                                         ` Cinco, Dante
2009-10-22  6:46                               ` Jan Beulich
2009-10-22  7:11                                 ` Zhang, Xiantao
2009-10-22  7:31                                   ` Jan Beulich
2009-10-22  8:41                                     ` Zhang, Xiantao
2009-10-22  9:42                                       ` Keir Fraser
2009-10-22 16:32                                         ` Zhang, Xiantao
2009-10-22 16:33                                         ` Cinco, Dante
2009-10-23  1:06                                           ` Zhang, Xiantao
2009-10-26 13:02                                         ` Zhang, Xiantao
2009-10-26 13:34                                           ` Keir Fraser
2009-10-16  9:41               ` Keir Fraser
2009-10-16  9:57                 ` Qing He
2009-10-16  9:58                 ` Zhang, Xiantao
2009-10-16 10:21                   ` Jan Beulich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.