xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Kernel 3.7.[12] - irq 16: nobody cared
@ 2013-01-15  3:27 Steven Haigh
  2013-01-15 15:23 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Haigh @ 2013-01-15  3:27 UTC (permalink / raw)
  To: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 13392 bytes --]

Hi all,

Firstly, please include me in any replies as I am not a list subscriber.

I'm trying to nail down a problem using Xen 4.2.1 & Kernel 3.7.1 (also 
3.7.2). It seems at random periods of time I get the following via the 
syslog:

Message from syslogd@xenhost at Jan 15 09:02:36 ...
  kernel:Disabling IRQ #16

Looking at IRQ16:
[root@xenhost xen]# cat /proc/interrupts | grep 16
  16:    1900000  xen-pirq-ioapic-level  sata_mv

I also see this in the dmesg:
irq 16: nobody cared (try booting with the "irqpoll" option)
Pid: 0, comm: swapper/0 Not tainted 3.7.2-1.el6xen.x86_64 #1
Call Trace:
  <IRQ>  [<ffffffff810a77f2>] __report_bad_irq+0x3a/0xc6
  [<ffffffff810a79e7>] note_interrupt+0x169/0x1e5
  [<ffffffff810a59b7>] handle_irq_event_percpu+0x16e/0x1b6
  [<ffffffff810a5a37>] handle_irq_event+0x38/0x54
  [<ffffffff810a8199>] handle_fasteoi_irq+0x88/0xd5
  [<ffffffff812c23f5>] __xen_evtchn_do_upcall+0x15a/0x1f7
  [<ffffffff812c3707>] xen_evtchn_do_upcall+0x2f/0x42
  [<ffffffff814a44be>] xen_do_hypervisor_callback+0x1e/0x30
  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
  [<ffffffff81007047>] ? xen_safe_halt+0x10/0x1a
  [<ffffffff810169b1>] ? default_idle+0x50/0x8a
  [<ffffffff81016318>] ? cpu_idle+0xc0/0xff
  [<ffffffff8148160e>] ? rest_init+0x72/0x74
  [<ffffffff81745b22>] ? start_kernel+0x3b0/0x3bd
  [<ffffffff817455a7>] ? repair_env_string+0x58/0x58
  [<ffffffff817452dd>] ? x86_64_start_reservations+0xb8/0xbd
  [<ffffffff81748cad>] ? xen_start_kernel+0x4f2/0x4f4
handlers:
[<ffffffffa012edd9>] mv_interrupt [sata_mv]
Disabling IRQ #16

I have tried booting with the irqpoll option on the kernel boot line, 
but the same problem occurs.

It seems disk throughput almost drops dead when this happens - as the 
SATA controller seems to go into some different mode of operation. It 
also seems like this has only happened recently - I was using builds of 
3.6.x as my Xen Dom0 kernel with no signs of this problem.

Has anyone else seen this in recent kernel releases? I'm not quite sure 
how to try and track this down.

Some system specs follow:
# dmidecode 2.11
SMBIOS 2.7 present.
75 structures occupying 3098 bytes.
Table at 0x000EB420.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
         Vendor: American Megatrends Inc.
         Version: U1f
         Release Date: 06/13/2012
         Address: 0xF0000
         Runtime Size: 64 kB
         ROM Size: 4096 kB
         Characteristics:
                 PCI is supported
                 BIOS is upgradeable
                 BIOS shadowing is allowed
                 Boot from CD is supported
                 Selectable boot is supported
                 BIOS ROM is socketed
                 EDD is supported
                 5.25"/1.2 MB floppy services are supported (int 13h)
                 3.5"/720 kB floppy services are supported (int 13h)
                 3.5"/2.88 MB floppy services are supported (int 13h)
                 Print screen service is supported (int 5h)
                 8042 keyboard services are supported (int 9h)
                 Serial services are supported (int 14h)
                 Printer services are supported (int 17h)
                 ACPI is supported
                 USB legacy is supported
                 BIOS boot specification is supported
                 Targeted content distribution is supported
                 UEFI is supported
         BIOS Revision: 4.6

Handle 0x0001, DMI type 1, 27 bytes
System Information
         Manufacturer: Gigabyte Technology Co., Ltd.
         Product Name: To be filled by O.E.M.
         Version: To be filled by O.E.M.
         Serial Number: To be filled by O.E.M.
         UUID: 03E50250-0449-054D-4A06-F60700080009
         Wake-up Type: Power Switch
         SKU Number: To be filled by O.E.M.
         Family: To be filled by O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
         Manufacturer: Gigabyte Technology Co., Ltd.
         Product Name: Z68M-D2H
         Version: To be filled by O.E.M.
         Serial Number: To be filled by O.E.M.
         Asset Tag: To be filled by O.E.M.
         Features:
                 Board is a hosting board
                 Board is replaceable
         Location In Chassis: To be filled by O.E.M.
         Chassis Handle: 0x0003
         Type: Motherboard
         Contained Object Handles: 0

Handle 0x0003, DMI type 3, 22 bytes
Chassis Information
         Manufacturer: Gigabyte Technology Co., Ltd.
         Type: Desktop
         Lock: Not Present
         Version: To Be Filled By O.E.M.
         Serial Number: To Be Filled By O.E.M.
         Asset Tag: To Be Filled By O.E.M.
         Boot-up State: Safe
         Power Supply State: Safe
         Thermal State: Safe
         Security Status: None
         OEM Information: 0x00000000
         Height: Unspecified
         Number Of Power Cords: 1
         Contained Elements: 0
         SKU Number: To be filled by O.E.M.

Handle 0x0004, DMI type 7, 19 bytes
Cache Information
         Socket Designation: CPU Internal L1
         Configuration: Enabled, Not Socketed, Level 1
         Operational Mode: Write Through
         Location: Internal
         Installed Size: 128 kB
         Maximum Size: 128 kB
         Supported SRAM Types:
                 Unknown
         Installed SRAM Type: Unknown
         Speed: Unknown
         Error Correction Type: Parity
         System Type: Other
         Associativity: 16-way Set-associative

Handle 0x0005, DMI type 7, 19 bytes
Cache Information
         Socket Designation: CPU Internal L2
         Configuration: Enabled, Not Socketed, Level 2
         Operational Mode: Write Through
         Location: Internal
         Installed Size: 1024 kB
         Maximum Size: 1024 kB
         Supported SRAM Types:
                 Unknown
         Installed SRAM Type: Unknown
         Speed: Unknown
         Error Correction Type: Multi-bit ECC
         System Type: Instruction
         Associativity: 16-way Set-associative

Handle 0x0006, DMI type 7, 19 bytes
Cache Information
         Socket Designation: CPU Internal L3
         Configuration: Enabled, Not Socketed, Level 3
         Operational Mode: Write Back
         Location: Internal
         Installed Size: 6144 kB
         Maximum Size: 6144 kB
         Supported SRAM Types:
                 Unknown
         Installed SRAM Type: Unknown
         Speed: Unknown
         Error Correction Type: Multi-bit ECC
         System Type: Instruction
         Associativity: 48-way Set-associative

... snip a bit ...

Handle 0x0020, DMI type 9, 17 bytes
System Slot Information
         Designation: J6B2
         Type: x16 PCI Express
         Current Usage: In Use
         Length: Long
         ID: 0
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:00:02.0

Handle 0x0021, DMI type 9, 17 bytes
System Slot Information
         Designation: J6B1
         Type: x1 PCI Express
         Current Usage: In Use
         Length: Short
         ID: 1
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:00:1c.0

Handle 0x0022, DMI type 9, 17 bytes
System Slot Information
         Designation: J6D1
         Type: x8 PCI Express
         Current Usage: In Use
         Length: Short
         ID: 2
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:00:01.0

Handle 0x0023, DMI type 9, 17 bytes
System Slot Information
         Designation: J7B1
         Type: x16 PCI Express
         Current Usage: In Use
         Length: Short
         ID: 3
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:00:03.0

Handle 0x0024, DMI type 9, 17 bytes
System Slot Information
         Designation: J8B4
         Type: x1 PCI Express
         Current Usage: In Use
         Length: Short
         ID: 4
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:00:1c.7

Handle 0x0025, DMI type 9, 17 bytes
System Slot Information
         Designation: J8B3
         Type: 32-bit PCI
         Current Usage: In Use
         Length: Short
         ID: 6
         Characteristics:
                 3.3 V is provided
                 Opening is shared
                 PME signal is supported
         Bus Address: 0000:14:1e.0

... snip a bit more ....

Handle 0x0043, DMI type 4, 42 bytes
Processor Information
         Socket Designation: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
         Type: Central Processor
         Family: Core i7
         Manufacturer: Intel
         ID: A7 06 02 00 FF FB EB BF
         Signature: Type 0, Family 6, Model 42, Stepping 7
         Flags:
                 FPU (Floating-point unit on-chip)
                 VME (Virtual mode extension)
                 DE (Debugging extension)
                 PSE (Page size extension)
                 TSC (Time stamp counter)
                 MSR (Model specific registers)
                 PAE (Physical address extension)
                 MCE (Machine check exception)
                 CX8 (CMPXCHG8 instruction supported)
                 APIC (On-chip APIC hardware supported)
                 SEP (Fast system call)
                 MTRR (Memory type range registers)
                 PGE (Page global enable)
                 MCA (Machine check architecture)
                 CMOV (Conditional move instruction supported)
                 PAT (Page attribute table)
                 PSE-36 (36-bit page size extension)
                 CLFSH (CLFLUSH instruction supported)
                 DS (Debug store)
                 ACPI (ACPI supported)
                 MMX (MMX technology supported)
                 FXSR (FXSAVE and FXSTOR instructions supported)
                 SSE (Streaming SIMD extensions)
                 SSE2 (Streaming SIMD extensions 2)
                 SS (Self-snoop)
                 HTT (Multi-threading)
                 TM (Thermal monitor supported)
                 PBE (Pending break enabled)
         Version: Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
         Voltage: 1.2 V
         External Clock: 100 MHz
         Max Speed: 7000 MHz
         Current Speed: 3700 MHz
         Status: Populated, Enabled
         Upgrade: Other
         L1 Cache Handle: 0x0004
         L2 Cache Handle: 0x0005
         L3 Cache Handle: 0x0006
         Serial Number: Not Specified
         Asset Tag: Fill By OEM
         Part Number: Fill By OEM
         Core Count: 4
         Core Enabled: 1
         Characteristics:
                 64-bit capable

... end

# lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor 
Family DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core 
Processor Family PCI Express Root Port (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core 
Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series 
Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset 
Family PCI Express Root Port 1 (rev b5)
00:1c.6 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset 
Family PCI Express Root Port 7 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset 
Family USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation Z68 Express Chipset Family LPC 
Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset 
Family SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family 
SMBus Controller (rev 05)
01:00.0 SCSI storage controller: Marvell Technology Group Ltd. 88SX7042 
PCI-e 4-port SATA-II (rev 02)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. 
RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06)

Disks are configured as such:
# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md0 : active raid1 sda1[1] sdb1[0]
       204788 blocks super 1.0 [2/2] [UU]

md2 : active raid6 sdc[5] sde[1] sdf[4] sdd[0]
       3907026688 blocks super 1.2 level 6, 128k chunk, algorithm 2 
[4/4] [UUUU]

md1 : active raid1 sdb2[0] sda2[1]
       77942716 blocks super 1.1 [2/2] [UU]

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299



[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-15  3:27 Kernel 3.7.[12] - irq 16: nobody cared Steven Haigh
@ 2013-01-15 15:23 ` Jan Beulich
  2013-01-15 17:15   ` Steven Haigh
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2013-01-15 15:23 UTC (permalink / raw)
  To: Steven Haigh; +Cc: xen-devel

>>> On 15.01.13 at 04:27, Steven Haigh <netwiz@crc.id.au> wrote:
> irq 16: nobody cared (try booting with the "irqpoll" option)
> Pid: 0, comm: swapper/0 Not tainted 3.7.2-1.el6xen.x86_64 #1
> Call Trace:
>   <IRQ>  [<ffffffff810a77f2>] __report_bad_irq+0x3a/0xc6
>   [<ffffffff810a79e7>] note_interrupt+0x169/0x1e5
>   [<ffffffff810a59b7>] handle_irq_event_percpu+0x16e/0x1b6
>   [<ffffffff810a5a37>] handle_irq_event+0x38/0x54
>   [<ffffffff810a8199>] handle_fasteoi_irq+0x88/0xd5
>   [<ffffffff812c23f5>] __xen_evtchn_do_upcall+0x15a/0x1f7
>   [<ffffffff812c3707>] xen_evtchn_do_upcall+0x2f/0x42
>   [<ffffffff814a44be>] xen_do_hypervisor_callback+0x1e/0x30
>   <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>   [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>   [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>   [<ffffffff81007047>] ? xen_safe_halt+0x10/0x1a
>   [<ffffffff810169b1>] ? default_idle+0x50/0x8a
>   [<ffffffff81016318>] ? cpu_idle+0xc0/0xff
>   [<ffffffff8148160e>] ? rest_init+0x72/0x74
>   [<ffffffff81745b22>] ? start_kernel+0x3b0/0x3bd
>   [<ffffffff817455a7>] ? repair_env_string+0x58/0x58
>   [<ffffffff817452dd>] ? x86_64_start_reservations+0xb8/0xbd
>   [<ffffffff81748cad>] ? xen_start_kernel+0x4f2/0x4f4
> handlers:
> [<ffffffffa012edd9>] mv_interrupt [sata_mv]
> Disabling IRQ #16
> 
> I have tried booting with the irqpoll option on the kernel boot line, 
> but the same problem occurs.
> 
> It seems disk throughput almost drops dead when this happens - as the 
> SATA controller seems to go into some different mode of operation. It 
> also seems like this has only happened recently - I was using builds of 
> 3.6.x as my Xen Dom0 kernel with no signs of this problem.
> 
> Has anyone else seen this in recent kernel releases? I'm not quite sure 
> how to try and track this down.

First of all, you'll want to clarify whether this problem is present
_only_ when running under Xen, or also when running the same
kernel without Xen underneath. This is primarily because the
output you provided shows that IRQ 16 actually has a handler,
just that it apparently ignores the interrupts (and that's nothing
that Xen controls).

Then, if this is a Xen-only problem, you will want to provide full
hypervisor and kernel (boot) logs, the hypervisor one including
debug key 'i' output, and the kernel one once with and once
without Xen.

Finally you'll want to clarify whether, when updating the kernel,
you also updated the hypervisor (and if so, try the know good
and known bad kernels on identical hypervisors).

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-15 15:23 ` Jan Beulich
@ 2013-01-15 17:15   ` Steven Haigh
  2013-01-16  9:42     ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Haigh @ 2013-01-15 17:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

Hi Jan,

On 16/01/2013 2:23 AM, Jan Beulich wrote:
>>>> On 15.01.13 at 04:27, Steven Haigh <netwiz@crc.id.au> wrote:
>> irq 16: nobody cared (try booting with the "irqpoll" option)
>> Pid: 0, comm: swapper/0 Not tainted 3.7.2-1.el6xen.x86_64 #1
>> Call Trace:
>>    <IRQ>  [<ffffffff810a77f2>] __report_bad_irq+0x3a/0xc6
>>    [<ffffffff810a79e7>] note_interrupt+0x169/0x1e5
>>    [<ffffffff810a59b7>] handle_irq_event_percpu+0x16e/0x1b6
>>    [<ffffffff810a5a37>] handle_irq_event+0x38/0x54
>>    [<ffffffff810a8199>] handle_fasteoi_irq+0x88/0xd5
>>    [<ffffffff812c23f5>] __xen_evtchn_do_upcall+0x15a/0x1f7
>>    [<ffffffff812c3707>] xen_evtchn_do_upcall+0x2f/0x42
>>    [<ffffffff814a44be>] xen_do_hypervisor_callback+0x1e/0x30
>>    <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>>    [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>>    [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>>    [<ffffffff81007047>] ? xen_safe_halt+0x10/0x1a
>>    [<ffffffff810169b1>] ? default_idle+0x50/0x8a
>>    [<ffffffff81016318>] ? cpu_idle+0xc0/0xff
>>    [<ffffffff8148160e>] ? rest_init+0x72/0x74
>>    [<ffffffff81745b22>] ? start_kernel+0x3b0/0x3bd
>>    [<ffffffff817455a7>] ? repair_env_string+0x58/0x58
>>    [<ffffffff817452dd>] ? x86_64_start_reservations+0xb8/0xbd
>>    [<ffffffff81748cad>] ? xen_start_kernel+0x4f2/0x4f4
>> handlers:
>> [<ffffffffa012edd9>] mv_interrupt [sata_mv]
>> Disabling IRQ #16
>>
>> I have tried booting with the irqpoll option on the kernel boot line,
>> but the same problem occurs.
>>
>> It seems disk throughput almost drops dead when this happens - as the
>> SATA controller seems to go into some different mode of operation. It
>> also seems like this has only happened recently - I was using builds of
>> 3.6.x as my Xen Dom0 kernel with no signs of this problem.
>>
>> Has anyone else seen this in recent kernel releases? I'm not quite sure
>> how to try and track this down.
> First of all, you'll want to clarify whether this problem is present
> _only_ when running under Xen, or also when running the same
> kernel without Xen underneath. This is primarily because the
> output you provided shows that IRQ 16 actually has a handler,
> just that it apparently ignores the interrupts (and that's nothing
> that Xen controls).
I'm not 100% sure how to do this. I haven't been able to find a method 
to cause the problem to happen... It just does - and it seems random 
when it does happen. Part of the problem with running the system without 
the hypervisor in place is that I can't replicate any kind of workload 
that would normally trigger the problem.
> Then, if this is a Xen-only problem, you will want to provide full
> hypervisor and kernel (boot) logs, the hypervisor one including
> debug key 'i' output, and the kernel one once with and once
> without Xen.
>
> Finally you'll want to clarify whether, when updating the kernel,
> you also updated the hypervisor (and if so, try the know good
> and known bad kernels on identical hypervisors).

I have been running Xen 4.2.1 for a while - and used multiple kernel 
versions with it. Sadly, I don't have an archive of the RPMs that I used 
(even though I built them!). I've only really noticed this happening in 
the last month - when I've been running kernel 3.7.1+

On the off chance today, I have moved the card from one 16x PCIe slot to 
the second one on the mainboard. This has moved the card from IRQ16 to 
IRQ19. As of yet, I haven't had the problem occur - however as it is a 
seemingly random occurrence, there is no guarantee that the problem is 
solved. I've tried loading up the i/o by doing a resync of the RAID6 (of 
which, 2 drives are on the sata_mv card) as well as hammering i/o in the 
DomUs (rather random stuff), but still no reliable way to force the 
problem to occur :(

I'm open to any suggestions :)

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-15 17:15   ` Steven Haigh
@ 2013-01-16  9:42     ` Jan Beulich
  2013-01-16  9:54       ` Steven Haigh
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2013-01-16  9:42 UTC (permalink / raw)
  To: Steven Haigh; +Cc: xen-devel

>>> On 15.01.13 at 18:15, Steven Haigh <netwiz@crc.id.au> wrote:
> I'm not 100% sure how to do this. I haven't been able to find a method 
> to cause the problem to happen... It just does - and it seems random 
> when it does happen. Part of the problem with running the system without 
> the hypervisor in place is that I can't replicate any kind of workload 
> that would normally trigger the problem.

That's pretty odd - there need to be almost 100,000 unhandled
interrupts within a tenth of a second, so there _must_ be
something triggering this if the device is otherwise working fine.

You're not by chance passing through to a guest any other
device using the same IRQ?

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-16  9:42     ` Jan Beulich
@ 2013-01-16  9:54       ` Steven Haigh
  2013-01-16 10:05         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Haigh @ 2013-01-16  9:54 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7299 bytes --]

On 16/01/2013 8:42 PM, Jan Beulich wrote:
>>>> On 15.01.13 at 18:15, Steven Haigh <netwiz@crc.id.au> wrote:
>> I'm not 100% sure how to do this. I haven't been able to find a method
>> to cause the problem to happen... It just does - and it seems random
>> when it does happen. Part of the problem with running the system without
>> the hypervisor in place is that I can't replicate any kind of workload
>> that would normally trigger the problem.
> That's pretty odd - there need to be almost 100,000 unhandled
> interrupts within a tenth of a second, so there _must_ be
> something triggering this if the device is otherwise working fine.
>
> You're not by chance passing through to a guest any other
> device using the same IRQ?

Hi Jan,

I don't pass any devices at all to any DomU's. All guests are PV Linux 
systems, all EL6. The only thing each DomU has is a disk, a network 
interface, and 2 x vcpus.

So far, I have:
# uptime
  20:50:40 up 1 day,  1:11,  1 user,  load average: 0.36, 0.17, 0.13

As I mentioned, I moved the sata card to the second 16x PCIe slot in the 
mainboard - which changed the IRQ from 16 to 19. Currently I see:
# grep sata_mv /proc/interrupts
  19:   21243495  xen-pirq-ioapic-level  sata_mv

Which is interestingly more than the onboard SATA ports:
# grep ahci /proc/interrupts
  50:    9004117  xen-pirq-msi       ahci

I'm not sure if this will give any further info:
# xm dmesg
  __  __            _  _    ____    _    _       _  __
  \ \/ /___ _ __   | || |  |___ \  / |  / |  ___| |/ /_
   \  // _ \ '_ \  | || |_   __) | | |__| | / _ \ | '_ \
   /  \  __/ | | | |__   _| / __/ _| |__| ||  __/ | (_) |
  /_/\_\___|_| |_|    |_|(_)_____(_)_|  |_(_)___|_|\___/

(XEN) Xen version 4.2.1 (mockbuild@crc.id.au) (gcc (GCC) 4.4.6 20120305 
(Red Hat 4.4.6-4)) Wed Dec 19 01:32:40 EST 2012
(XEN) Latest ChangeSet: unavailable
(XEN) Bootloader: GNU GRUB 0.97
(XEN) Command line: dom0_mem=1024M cpufreq=xen dom0_max_vcpus=1 
dom0_vcpus_pin
(XEN) Video information:
(XEN)  VGA is text mode 80x25, font 8x16
(XEN)  VBE/DDC methods: none; EDID transfer time: 0 seconds
(XEN)  EDID info not retrieved because no DDC retrieval method detected
(XEN) Disc information:
(XEN)  Found 2 MBR signatures
(XEN)  Found 3 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN)  0000000000000000 - 000000000009d800 (usable)
(XEN)  000000000009d800 - 00000000000a0000 (reserved)
(XEN)  00000000000e0000 - 0000000000100000 (reserved)
(XEN)  0000000000100000 - 0000000020000000 (usable)
(XEN)  0000000020000000 - 0000000020200000 (reserved)
(XEN)  0000000020200000 - 0000000040000000 (usable)
(XEN)  0000000040000000 - 0000000040200000 (reserved)
(XEN)  0000000040200000 - 00000000dbb1b000 (usable)
(XEN)  00000000dbb1b000 - 00000000dc3c7000 (reserved)
(XEN)  00000000dc3c7000 - 00000000dc647000 (ACPI NVS)
(XEN)  00000000dc647000 - 00000000dc64c000 (ACPI data)
(XEN)  00000000dc64c000 - 00000000dc68f000 (ACPI NVS)
(XEN)  00000000dc68f000 - 00000000dcdca000 (usable)
(XEN)  00000000dcdca000 - 00000000dcfdd000 (reserved)
(XEN)  00000000dcfdd000 - 00000000dd000000 (usable)
(XEN)  00000000dd800000 - 00000000dfa00000 (reserved)
(XEN)  00000000f8000000 - 00000000fc000000 (reserved)
(XEN)  00000000fec00000 - 00000000fec01000 (reserved)
(XEN)  00000000fed00000 - 00000000fed04000 (reserved)
(XEN)  00000000fed1c000 - 00000000fed20000 (reserved)
(XEN)  00000000fee00000 - 00000000fee01000 (reserved)
(XEN)  00000000ff000000 - 0000000100000000 (reserved)
(XEN)  0000000100000000 - 000000021f600000 (usable)
(XEN) ACPI: RSDP 000F0490, 0024 (r2 ALASKA)
(XEN) ACPI: XSDT DC629070, 0064 (r1 ALASKA    A M I  1072009 AMI 10013)
(XEN) ACPI: FACP DC632928, 00F4 (r4 ALASKA    A M I  1072009 AMI 10013)
(XEN) ACPI: DSDT DC629170, 97B8 (r2 ALASKA    A M I       12 INTL 20051117)
(XEN) ACPI: FACS DC645F80, 0040
(XEN) ACPI: APIC DC632A20, 0072 (r3 ALASKA    A M I  1072009 AMI 10013)
(XEN) ACPI: MCFG DC632A98, 003C (r1 ALASKA    A M I  1072009 MSFT       97)
(XEN) ACPI: HPET DC632AD8, 0038 (r1 ALASKA    A M I  1072009 AMI.        5)
(XEN) ACPI: SSDT DC632B10, 036D (r1 SataRe SataTabl     1000 INTL 20091112)
(XEN) ACPI: SSDT DC632E80, 09AA (r1  PmRef  Cpu0Ist     3000 INTL 20051117)
(XEN) ACPI: SSDT DC633830, 0A92 (r1  PmRef    CpuPm     3000 INTL 20051117)
(XEN) ACPI: MATS DC6342C8, 0034 (r2 ALASKA    A M I        2 wx2        0)
(XEN) System RAM: 8116MB (8310872kB)
(XEN) Domain heap initialised
(XEN) ACPI: 32/64X FACS address mismatch in FADT - 
dc645f80/0000000000000000, using 32
(XEN) Processor #0 6:10 APIC version 21
(XEN) Processor #2 6:10 APIC version 21
(XEN) Processor #4 6:10 APIC version 21
(XEN) Processor #6 6:10 APIC version 21
(XEN) IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-23
(XEN) Enabling APIC mode:  Flat.  Using 1 I/O APICs
(XEN) Table is not found!
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Detected 3303.320 MHz processor.
(XEN) Initing memory sharing.
(XEN) xstate_init: using cntxt_size: 0x340 and states: 0x7
(XEN) I/O virtualisation disabled
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) ENABLING IO-APIC IRQs
(XEN)  -> Using old ACK method
(XEN) Platform timer is 14.318MHz HPET
(XEN) Allocated console ring of 16 KiB.
(XEN) VMX: Supported advanced features:
(XEN)  - APIC MMIO access virtualisation
(XEN)  - APIC TPR shadow
(XEN)  - Extended Page Tables (EPT)
(XEN)  - Virtual-Processor Identifiers (VPID)
(XEN)  - Virtual NMI
(XEN)  - MSR direct-access bitmap
(XEN)  - Unrestricted Guest
(XEN) HVM: ASIDs enabled.
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging (HAP) detected
(XEN) HVM: HAP page sizes: 4kB, 2MB
(XEN) Brought up 4 CPUs
(XEN) *** LOADING DOMAIN 0 ***
(XEN)  Xen  kernel: 64-bit, lsb, compat32
(XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1d87000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN)  Dom0 alloc.:   0000000210000000->0000000214000000 (236799 pages 
to be allocated)
(XEN)  Init. ramdisk: 000000021d2ff000->000000021f5ff800
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN)  Loaded kernel: ffffffff81000000->ffffffff81d87000
(XEN)  Init. ramdisk: ffffffff81d87000->ffffffff84087800
(XEN)  Phys-Mach map: ffffffff84088000->ffffffff84288000
(XEN)  Start info:    ffffffff84288000->ffffffff842884b4
(XEN)  Page tables:   ffffffff84289000->ffffffff842ae000
(XEN)  Boot stack:    ffffffff842ae000->ffffffff842af000
(XEN)  TOTAL:         ffffffff80000000->ffffffff84400000
(XEN)  ENTRY ADDRESS: ffffffff81745210
(XEN) Dom0 has maximum 1 VCPUs
(XEN) Scrubbing Free RAM: 
......................................................................done.
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch 
input to Xen)
(XEN) Freed 252kB init memory.
(XEN) no cpu_id for acpi_id 5
(XEN) no cpu_id for acpi_id 6
(XEN) no cpu_id for acpi_id 7
(XEN) no cpu_id for acpi_id 8

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299




[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-16  9:54       ` Steven Haigh
@ 2013-01-16 10:05         ` Jan Beulich
  2013-01-16 10:13           ` Steven Haigh
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2013-01-16 10:05 UTC (permalink / raw)
  To: Steven Haigh; +Cc: xen-devel

>>> On 16.01.13 at 10:54, Steven Haigh <netwiz@crc.id.au> wrote:
> So far, I have:
> # uptime
>   20:50:40 up 1 day,  1:11,  1 user,  load average: 0.36, 0.17, 0.13
> 
> As I mentioned, I moved the sata card to the second 16x PCIe slot in the 
> mainboard - which changed the IRQ from 16 to 19. Currently I see:
> # grep sata_mv /proc/interrupts
>   19:   21243495  xen-pirq-ioapic-level  sata_mv
> 
> Which is interestingly more than the onboard SATA ports:
> # grep ahci /proc/interrupts
>   50:    9004117  xen-pirq-msi       ahci

Whether the former count is too high depends on the I/O amount
going through each controller. Of course it is possible for there to
be spikes that usually don't reach the 99,900 cutoff point, but
once in a while do. Figuring whether that's the case would require
adding a little bit more verbosity to
kernel/irq/spurious.c:note_interrupt(), e.g. to warn when having
reached half the threshold.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel 3.7.[12] - irq 16: nobody cared
  2013-01-16 10:05         ` Jan Beulich
@ 2013-01-16 10:13           ` Steven Haigh
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Haigh @ 2013-01-16 10:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 5610 bytes --]

On 16/01/2013 9:05 PM, Jan Beulich wrote:
>>>> On 16.01.13 at 10:54, Steven Haigh <netwiz@crc.id.au> wrote:
>> So far, I have:
>> # uptime
>>    20:50:40 up 1 day,  1:11,  1 user,  load average: 0.36, 0.17, 0.13
>>
>> As I mentioned, I moved the sata card to the second 16x PCIe slot in the
>> mainboard - which changed the IRQ from 16 to 19. Currently I see:
>> # grep sata_mv /proc/interrupts
>>    19:   21243495  xen-pirq-ioapic-level  sata_mv
>>
>> Which is interestingly more than the onboard SATA ports:
>> # grep ahci /proc/interrupts
>>    50:    9004117  xen-pirq-msi       ahci
> Whether the former count is too high depends on the I/O amount
> going through each controller. Of course it is possible for there to
> be spikes that usually don't reach the 99,900 cutoff point, but
> once in a while do. Figuring whether that's the case would require
> adding a little bit more verbosity to
> kernel/irq/spurious.c:note_interrupt(), e.g. to warn when having
> reached half the threshold.

Interestingly, I just realised I have 3 of the 4 drives in this RAID6 on 
the sata_mv card. I did originally think I had 2 drives on the onboard 
SATA ports, and the other 2 on the sata_mv card. This would mean 3/4 of 
the IO would be going via this card - but only 1/4 on the onboard.

# lsdrv
PCI [ahci] 00:1f.2 SATA controller: Intel Corporation 6 Series/C200 
Series Chipset Family SATA AHCI Controller (rev 05)
.scsi 0:0:0:0 ATA ST380815AS {6RAB72DZ}
..sda 74.53g [8:0] Partitioned (dos)
. .sda1 200.00m [8:1] MD raid1 (1/2) (w/ sdb1) in_sync 
'localhost.localdomain:0' {9f19116a-d280-8216-cc87-af34eae68242}
. ..md0 199.99m [9:0] MD v1.0 raid1 (2) clean
. . .                 Partitioned (dos) 
{6578dbc0-9e07-4ccc-8eff-15f2a1da8df1}
. . .Mounted as /dev/md0 @ /boot
. .sda2 74.33g [8:2] MD raid1 (1/2) (w/ sdb2) in_sync 
'localhost.localdomain:1' {afb92c19-b9b1-e3ae-07af-315d738e38be}
.  .md1 74.33g [9:1] MD v1.1 raid1 (2) clean
.   .                PV LVM2_member 74.33g used, 0 free 
{2koqPs-U1IA-9erV-ua4N-mxW1-BhRs-V3mlAH}
.   .VG RAID1 74.33g 0 free {HEGjco-Ptil-M5ZG-2qQR-zNo4-3cc5-b9Z3Kj}
.    .dm-0 9.77g [253:0] LV xenhost ext4 
{d2fa50d5-1a51-4599-9b72-f38f86b8f99e}
.    ..Mounted as /dev/mapper/RAID1-xenhost @ /
.    .dm-7 64.56g [253:7] LV zeus.vm ext4 
{67310780-b15c-47e4-812e-d954aa7d8e3b}
.scsi 1:0:0:0 ATA ST380815AS {6QZ6L9SD}
..sdb 74.53g [8:16] Partitioned (dos)
. .sdb1 200.00m [8:17] MD raid1 (0/2) (w/ sda1) in_sync 
'localhost.localdomain:0' {9f19116a-d280-8216-cc87-af34eae68242}
. ..md0 199.99m [9:0] MD v1.0 raid1 (2) clean
. .                   Partitioned (dos) 
{6578dbc0-9e07-4ccc-8eff-15f2a1da8df1}
. .sdb2 74.33g [8:18] MD raid1 (0/2) (w/ sda2) in_sync 
'localhost.localdomain:1' {afb92c19-b9b1-e3ae-07af-315d738e38be}
.  .md1 74.33g [9:1] MD v1.1 raid1 (2) clean
.                    PV LVM2_member 74.33g used, 0 free 
{2koqPs-U1IA-9erV-ua4N-mxW1-BhRs-V3mlAH}
.scsi 2:x:x:x [Empty]
.scsi 3:0:0:0 ATA ST2000VX000-9YW1 {Z1E10QQJ}
..sdc 1.82t [8:32] MD raid6 (3/4) (w/ sdd,sde,sdf) in_sync 
'xenhost.lan.crc.id.au:2' {cd8cc032-4898-fa88-3ba1-af64cf91583b}
. .md2 3.64t [9:2] MD v1.2 raid6,left-sym (4) active, 128k Chunk
.  .               PV LVM2_member 2.12t used, 1.52t free 
{8pyp2G-D268-fqKW-mBvf-wZbI-Qurt-aeTvOh}
.  .VG vg_raid6 3.64t 1.52t free {UrqTRc-AozJ-2RDf-qcZB-UdX3-tno9-3KHjjv}
.   .dm-6 2.00t [253:6] LV fileshare xfs 
{af405459-7569-4d82-82d9-ca27912316c7}
.   .dm-3 10.00g [253:3] LV lamp.vm ext4 
{67310780-b15c-47e4-812e-d954aa7d8e3b}
.   .dm-2 40.00g [253:2] LV mail.vm ext4 
{67310780-b15c-47e4-812e-d954aa7d8e3b}
.   .dm-4 20.00g [253:4] LV remotedesktop.vm Partitioned (dos)
.   .dm-5 2.00g [253:5] LV template.vm ext4 
{67310780-b15c-47e4-812e-d954aa7d8e3b}
.   .dm-1 50.00g [253:1] LV tsm.vm ext4 
{67310780-b15c-47e4-812e-d954aa7d8e3b}
.scsi 4:x:x:x [Empty]
.scsi 5:x:x:x [Empty]
PCI [sata_mv] 04:00.0 SCSI storage controller: Marvell Technology Group 
Ltd. 88SX7042 PCI-e 4-port SATA-II (rev 02)
.scsi 6:0:0:0 ATA ST2000VX000-9YW1 {Z1E11E7R}

..sdd 1.82t [8:48] MD raid6 (0/4) (w/ sdc,sde,sdf) in_sync 
'xenhost.lan.crc.id.au:2' {cd8cc032-4898-fa88-3ba1-af64cf91583b}
. .md2 3.64t [9:2] MD v1.2 raid6,left-sym (4) active, 128k Chunk
.                  PV LVM2_member 2.12t used, 1.52t free 
{8pyp2G-D268-fqKW-mBvf-wZbI-Qurt-aeTvOh}
.scsi 7:x:x:x [Empty]
.scsi 8:0:0:0 ATA ST2000VX000-9YW1 {Z1E0MD58}
..sde 1.82t [8:64] MD raid6 (1/4) (w/ sdc,sdd,sdf) in_sync 
'xenhost.lan.crc.id.au:2' {cd8cc032-4898-fa88-3ba1-af64cf91583b}
. .md2 3.64t [9:2] MD v1.2 raid6,left-sym (4) active, 128k Chunk
.                  PV LVM2_member 2.12t used, 1.52t free 
{8pyp2G-D268-fqKW-mBvf-wZbI-Qurt-aeTvOh}
.scsi 9:0:0:0 ATA ST2000VX000-9YW1 {Z1E17C3X}
  .sdf 1.82t [8:80] MD raid6 (2/4) (w/ sdc,sdd,sde) in_sync 
'xenhost.lan.crc.id.au:2' {cd8cc032-4898-fa88-3ba1-af64cf91583b}
   .md2 3.64t [9:2] MD v1.2 raid6,left-sym (4) active, 128k Chunk
                    PV LVM2_member 2.12t used, 1.52t free 
{8pyp2G-D268-fqKW-mBvf-wZbI-Qurt-aeTvOh}

I'm going to leave it as is at the moment to see if it happens again as 
it has been randomly over the last 3-4 weeks. I'll try to pull any info 
off this time before rebooting the system - as I only recently found 
this problem. Hopefully, either changing the slot, or even just 
reseating the card may have had some effect - but I guess only time will 
tell.

-- 
Steven Haigh

Email: netwiz@crc.id.au
Web: http://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299




[-- Attachment #1.2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-16 10:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-15  3:27 Kernel 3.7.[12] - irq 16: nobody cared Steven Haigh
2013-01-15 15:23 ` Jan Beulich
2013-01-15 17:15   ` Steven Haigh
2013-01-16  9:42     ` Jan Beulich
2013-01-16  9:54       ` Steven Haigh
2013-01-16 10:05         ` Jan Beulich
2013-01-16 10:13           ` Steven Haigh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).