* Unhandled IRQs on AMD E-450 @ 2011-11-29 21:44 Jeroen Van den Keybus 2011-11-30 8:30 ` Clemens Ladisch 2011-11-30 15:44 ` Borislav Petkov 0 siblings, 2 replies; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-11-29 21:44 UTC (permalink / raw) To: linux-kernel On an Asus E45M1-M PRO (AMD E-450) board with 64-bit Linux 3.0.0 (Ubuntu) and 3.2.0, I regularly get (more detailed logs at the end): Nov 28 04:35:29 zacate kernel: [29581.259926] irq 16: nobody cared (try booting with the "irqpoll" option) Nov 28 04:35:29 zacate kernel: [29581.259945] Pid: 0, comm: swapper Tainted: P 3.0.0-13-generic #22-Ubuntu ... Nov 28 04:35:29 zacate kernel: [29581.260171] handlers: Nov 28 04:35:29 zacate kernel: [29581.260204] [<ffffffffa0085ee0>] irq_handler Nov 28 04:35:29 zacate kernel: [29581.260216] [<ffffffffa048efe0>] azx_interrupt Nov 28 04:35:29 zacate kernel: [29581.260223] Disabling IRQ #16 Nov 24 21:25:41 zacate kernel: [ 190.503838] irq 19: nobody cared (try booting with the "irqpoll" option) Nov 24 21:25:41 zacate kernel: [ 190.503856] Pid: 0, comm: swapper Tainted: P 3.0.0-13-generic #22-Ubuntu ... Nov 24 21:25:41 zacate kernel: [ 190.504052] handlers: Nov 24 21:25:41 zacate kernel: [ 190.504085] [<ffffffffa0001f40>] ahci_interrupt Nov 24 21:25:41 zacate kernel: [ 190.504101] [<ffffffffa004e6c0>] e1000_intr Nov 24 21:25:41 zacate kernel: [ 190.504108] Disabling IRQ #19 I also tried with an untainted 3.2.0-rc2 kernel, in which I also disabled threadirqs: Nov 24 20:50:41 zacate kernel: [ 57.366678] irq 19: nobody cared (try booting with the "irqpoll" option) Nov 24 20:50:41 zacate kernel: [ 57.366690] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #5 The affected IRQ lines in /proc/interrupts: 16: 333 771 IO-APIC-fasteoi firewire_ohci, hda_intel ... 19: 39128 15165 IO-APIC-fasteoi ahci, eth1 40: 25641 59 PCI-MSI-edge eth0 41: 0 0 PCI-MSI-edge xhci_hcd 42: 0 0 PCI-MSI-edge xhci_hcd 43: 0 0 PCI-MSI-edge xhci_hcd 44: 2 404 PCI-MSI-edge hda_intel 45: 0 3 PCI-MSI-edge fglrx[0]@PCI:0:1:0 The dmesg lines directly pertaining to IRQ 16 and IRQ 19 [ 0.328032] pci 0000:00:15.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 0.328056] pci 0000:00:15.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 0.328077] pci 0000:00:15.2: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 0.328127] pci 0000:00:15.3: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 2.671164] firewire_ohci 0000:05:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 5.074619] HDA Intel 0000:00:14.2: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 2.010643] ahci 0000:00:11.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 [ 2.073026] xhci_hcd 0000:06:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 [ 2.090881] xhci_hcd 0000:06:00.0: irq 19, io mem 0xfe900000 [ 2.091040] xhci_hcd 0000:06:00.0: irq 41 for MSI/MSI-X [ 2.091050] xhci_hcd 0000:06:00.0: irq 42 for MSI/MSI-X [ 2.091059] xhci_hcd 0000:06:00.0: irq 43 for MSI/MSI-X [ 2.115098] e1000 0000:05:01.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 [ 5.041614] HDA Intel 0000:00:01.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19 [ 5.041711] HDA Intel 0000:00:01.1: irq 44 for MSI/MSI-X What I noted: - The problem (IRQ lines 16 and 19 getting disabled) occurs fairly often, but losing 19 occurs much more frequently than 16. - The problem with IRQ19 goes away (at least sufficiently long not to be occurring within 24h) when module e1000 is unloaded. - The problem persists with and without forced IRQ threading. - The problem persists with pci=nocsr. - The problem persists with irqfixup. - When IRQ19 dies, disk I/O access becomes very slow and unreliable. What could be going wrong here ? I note that at least 3 devices (00:15.x, 05:02.0 and 00:14.2) have their IRQ lines routed to IRQ 16, but I see only 2 handlers in the dmesg log and /proc/interrupts. The same applies to IRQ 19 (4 devices: 00:11.0, 06:00.0, 05:01.0 and 00:01.0). It is true that some of these ultimately seem to switch to MSI (06:00.0 and 00:01.0), but so does the video card (00:01.0), which does not route to any IRQ beforehand. Before I try finding the problem, I would like to know what a plausible failure mechanism is, so if anyone could give a hint on where to start looking... Thanks for your opinion (and please also CC to my PM address - not (yet) subscribed to the LKML), J. More detailed logs: Nov 28 04:35:29 zacate kernel: [29581.259926] irq 16: nobody cared (try booting with the "irqpoll" option) Nov 28 04:35:29 zacate kernel: [29581.259945] Pid: 0, comm: swapper Tainted: P 3.0.0-13-generic #22-Ubuntu Nov 28 04:35:29 zacate kernel: [29581.259952] Call Trace: Nov 28 04:35:29 zacate kernel: [29581.259958] <IRQ> [<ffffffff810cf96d>] __report_bad_irq+0x3d/0xe0 Nov 28 04:35:29 zacate kernel: [29581.259986] [<ffffffff810cfd95>] note_interrupt+0x135/0x180 Nov 28 04:35:29 zacate kernel: [29581.259998] [<ffffffff810cdd89>] handle_irq_event_percpu+0xa9/0x220 Nov 28 04:35:29 zacate kernel: [29581.260008] [<ffffffff810cdf4e>] handle_irq_event+0x4e/0x80 Nov 28 04:35:29 zacate kernel: [29581.260019] [<ffffffff810d06c4>] handle_fasteoi_irq+0x64/0xf0 Nov 28 04:35:29 zacate kernel: [29581.260029] [<ffffffff8100c252>] handle_irq+0x22/0x40 Nov 28 04:35:29 zacate kernel: [29581.260040] [<ffffffff815f422a>] do_IRQ+0x5a/0xe0 Nov 28 04:35:29 zacate kernel: [29581.260050] [<ffffffff815ea913>] common_interrupt+0x13/0x13 Nov 28 04:35:29 zacate kernel: [29581.260056] <EOI> [<ffffffff813725fb>] ? arch_local_irq_enable+0x8/0xd Nov 28 04:35:29 zacate kernel: [29581.260079] [<ffffffff810887a5>] ? sched_clock_idle_wakeup_event+0x15/0x20 Nov 28 04:35:29 zacate kernel: [29581.260089] [<ffffffff813730ed>] acpi_idle_enter_simple+0xcc/0x102 Nov 28 04:35:29 zacate kernel: [29581.260100] [<ffffffff814ab5c2>] cpuidle_idle_call+0xa2/0x1d0 Nov 28 04:35:29 zacate kernel: [29581.260112] [<ffffffff8100920b>] cpu_idle+0xab/0x100 Nov 28 04:35:29 zacate kernel: [29581.260124] [<ffffffff815b858e>] rest_init+0x72/0x74 Nov 28 04:35:29 zacate kernel: [29581.260134] [<ffffffff81ad0c2b>] start_kernel+0x3d4/0x3df Nov 28 04:35:29 zacate kernel: [29581.260144] [<ffffffff81ad0388>] x86_64_start_reservations+0x132/0x136 Nov 28 04:35:29 zacate kernel: [29581.260156] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140 Nov 28 04:35:29 zacate kernel: [29581.260165] [<ffffffff81ad0459>] x86_64_start_kernel+0xcd/0xdc Nov 28 04:35:29 zacate kernel: [29581.260171] handlers: Nov 28 04:35:29 zacate kernel: [29581.260204] [<ffffffffa0085ee0>] irq_handler Nov 28 04:35:29 zacate kernel: [29581.260216] [<ffffffffa048efe0>] azx_interrupt Nov 28 04:35:29 zacate kernel: [29581.260223] Disabling IRQ #16 Nov 24 21:25:41 zacate kernel: [ 190.503838] irq 19: nobody cared (try booting with the "irqpoll" option) Nov 24 21:25:41 zacate kernel: [ 190.503856] Pid: 0, comm: swapper Tainted: P 3.0.0-13-generic #22-Ubuntu Nov 24 21:25:41 zacate kernel: [ 190.503864] Call Trace: Nov 24 21:25:41 zacate kernel: [ 190.503870] <IRQ> [<ffffffff810cf96d>] __report_bad_irq+0x3d/0xe0 Nov 24 21:25:41 zacate kernel: [ 190.503898] [<ffffffff810cfd95>] note_interrupt+0x135/0x180 Nov 24 21:25:41 zacate kernel: [ 190.503909] [<ffffffff810cdd89>] handle_irq_event_percpu+0xa9/0x220 Nov 24 21:25:41 zacate kernel: [ 190.503920] [<ffffffff810cdf4e>] handle_irq_event+0x4e/0x80 Nov 24 21:25:41 zacate kernel: [ 190.503930] [<ffffffff810d06c4>] handle_fasteoi_irq+0x64/0xf0 Nov 24 21:25:41 zacate kernel: [ 190.503940] [<ffffffff8100c252>] handle_irq+0x22/0x40 Nov 24 21:25:41 zacate kernel: [ 190.503952] [<ffffffff815f422a>] do_IRQ+0x5a/0xe0 Nov 24 21:25:41 zacate kernel: [ 190.503961] [<ffffffff815ea913>] common_interrupt+0x13/0x13 Nov 24 21:25:41 zacate kernel: [ 190.503967] <EOI> [<ffffffff81094482>] ? tick_nohz_stop_sched_tick+0x2a2/0x3f0 Nov 24 21:25:41 zacate kernel: [ 190.503992] [<ffffffff810091d5>] cpu_idle+0x75/0x100 Nov 24 21:25:41 zacate kernel: [ 190.504004] [<ffffffff815b858e>] rest_init+0x72/0x74 Nov 24 21:25:41 zacate kernel: [ 190.504014] [<ffffffff81ad0c2b>] start_kernel+0x3d4/0x3df Nov 24 21:25:41 zacate kernel: [ 190.504024] [<ffffffff81ad0388>] x86_64_start_reservations+0x132/0x136 Nov 24 21:25:41 zacate kernel: [ 190.504036] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140 Nov 24 21:25:41 zacate kernel: [ 190.504045] [<ffffffff81ad0459>] x86_64_start_kernel+0xcd/0xdc Nov 24 21:25:41 zacate kernel: [ 190.504052] handlers: Nov 24 21:25:41 zacate kernel: [ 190.504085] [<ffffffffa0001f40>] ahci_interrupt Nov 24 21:25:41 zacate kernel: [ 190.504101] [<ffffffffa004e6c0>] e1000_intr Nov 24 21:25:41 zacate kernel: [ 190.504108] Disabling IRQ #19 Nov 24 20:50:41 zacate kernel: [ 57.366678] irq 19: nobody cared (try booting with the "irqpoll" option) Nov 24 20:50:41 zacate kernel: [ 57.366690] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #5 Nov 24 20:50:41 zacate kernel: [ 57.366694] Call Trace: Nov 24 20:50:41 zacate kernel: [ 57.366697] <IRQ> [<ffffffff810bb9cd>] __report_bad_irq+0x3d/0xe0 Nov 24 20:50:41 zacate kernel: [ 57.366715] [<ffffffff810bbe0d>] note_interrupt+0x14d/0x210 Nov 24 20:50:41 zacate kernel: [ 57.366721] [<ffffffff810b98a4>] handle_irq_event_percpu+0xc4/0x290 Nov 24 20:50:41 zacate kernel: [ 57.366728] [<ffffffff810b9ab8>] handle_irq_event+0x48/0x70 Nov 24 20:50:41 zacate kernel: [ 57.366733] [<ffffffff810bc7fa>] handle_fasteoi_irq+0x5a/0xe0 Nov 24 20:50:41 zacate kernel: [ 57.366740] [<ffffffff81004012>] handle_irq+0x22/0x40 Nov 24 20:50:41 zacate kernel: [ 57.366747] [<ffffffff81506b6a>] do_IRQ+0x5a/0xd0 Nov 24 20:50:41 zacate kernel: [ 57.366753] [<ffffffff814fe72b>] common_interrupt+0x6b/0x6b Nov 24 20:50:41 zacate kernel: [ 57.366756] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 Nov 24 20:50:41 zacate kernel: [ 57.366773] [<ffffffffa00cc0d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] Nov 24 20:50:41 zacate kernel: [ 57.366781] [<ffffffffa00cc0ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] Nov 24 20:50:41 zacate kernel: [ 57.366788] [<ffffffff814223a8>] cpuidle_idle_call+0xb8/0x230 Nov 24 20:50:41 zacate kernel: [ 57.366795] [<ffffffff81001215>] cpu_idle+0xc5/0x130 Nov 24 20:50:41 zacate kernel: [ 57.366802] [<ffffffff814e2360>] rest_init+0x94/0xa4 Nov 24 20:50:41 zacate kernel: [ 57.366809] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 Nov 24 20:50:41 zacate kernel: [ 57.366815] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 Nov 24 20:50:41 zacate kernel: [ 57.366821] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 Nov 24 20:50:41 zacate kernel: [ 57.366824] handlers: Nov 24 20:50:41 zacate kernel: [ 57.366834] [<ffffffffa0043c10>] ahci_interrupt Nov 24 20:50:41 zacate kernel: [ 57.366843] [<ffffffffa006f4f0>] e1000_intr Nov 24 20:50:41 zacate kernel: [ 57.366847] Disabling IRQ #19 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-11-29 21:44 Unhandled IRQs on AMD E-450 Jeroen Van den Keybus @ 2011-11-30 8:30 ` Clemens Ladisch 2011-11-30 15:44 ` Borislav Petkov 1 sibling, 0 replies; 40+ messages in thread From: Clemens Ladisch @ 2011-11-30 8:30 UTC (permalink / raw) To: Jeroen Van den Keybus; +Cc: linux-kernel Jeroen Van den Keybus wrote: > On an Asus E45M1-M PRO (AMD E-450) board with 64-bit Linux 3.0.0 > (Ubuntu) and 3.2.0, I regularly get (more detailed logs at the end): > > > Nov 28 04:35:29 zacate kernel: [29581.259926] irq 16: nobody cared (try booting with the "irqpoll" option) > Nov 24 21:25:41 zacate kernel: [ 190.503838] irq 19: nobody cared (try booting with the "irqpoll" option) > What could be going wrong here ? * Some buggy driver might not realize that an interrupt came from its device. * Some buggy device might raise an interrupt without telling the driver that it needs attention. * Some buggy device might raise a wrong interrupt. (This might include devices that generate a PCI interrupt although they are configured for MSI.) * Some buggy interrupt controller might be doing 'interesting' things. > - The problem with IRQ19 goes away (at least sufficiently long not to > be occurring within 24h) when module e1000 is unloaded. This looks like a bug in the e1000 hardware or software. > I note that at least 3 devices > (00:15.x, 05:02.0 and 00:14.2) have their IRQ lines routed to IRQ 16, > but I see only 2 handlers in the dmesg log and /proc/interrupts. lspci -s 0:15 -vv Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-11-29 21:44 Unhandled IRQs on AMD E-450 Jeroen Van den Keybus 2011-11-30 8:30 ` Clemens Ladisch @ 2011-11-30 15:44 ` Borislav Petkov 2011-12-01 8:01 ` Huang, Shane 1 sibling, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2011-11-30 15:44 UTC (permalink / raw) To: Jeroen Van den Keybus; +Cc: linux-kernel, shane.huang + Shane. Shane, can you guys take a look at this, sounds like some unfortunate sharing of AHCI and network IRQ numbers. Thanks. On Tue, Nov 29, 2011 at 10:44:42PM +0100, Jeroen Van den Keybus wrote: > On an Asus E45M1-M PRO (AMD E-450) board with 64-bit Linux 3.0.0 > (Ubuntu) and 3.2.0, I regularly get (more detailed logs at the end): > > > Nov 28 04:35:29 zacate kernel: [29581.259926] irq 16: nobody cared > (try booting with the "irqpoll" option) > Nov 28 04:35:29 zacate kernel: [29581.259945] Pid: 0, comm: swapper > Tainted: P 3.0.0-13-generic #22-Ubuntu > ... > Nov 28 04:35:29 zacate kernel: [29581.260171] handlers: > Nov 28 04:35:29 zacate kernel: [29581.260204] [<ffffffffa0085ee0>] irq_handler > Nov 28 04:35:29 zacate kernel: [29581.260216] [<ffffffffa048efe0>] azx_interrupt > Nov 28 04:35:29 zacate kernel: [29581.260223] Disabling IRQ #16 > > > Nov 24 21:25:41 zacate kernel: [ 190.503838] irq 19: nobody cared > (try booting with the "irqpoll" option) > Nov 24 21:25:41 zacate kernel: [ 190.503856] Pid: 0, comm: swapper > Tainted: P 3.0.0-13-generic #22-Ubuntu > ... > Nov 24 21:25:41 zacate kernel: [ 190.504052] handlers: > Nov 24 21:25:41 zacate kernel: [ 190.504085] [<ffffffffa0001f40>] > ahci_interrupt > Nov 24 21:25:41 zacate kernel: [ 190.504101] [<ffffffffa004e6c0>] e1000_intr > Nov 24 21:25:41 zacate kernel: [ 190.504108] Disabling IRQ #19 > > > I also tried with an untainted 3.2.0-rc2 kernel, in which I also > disabled threadirqs: > > > Nov 24 20:50:41 zacate kernel: [ 57.366678] irq 19: nobody cared > (try booting with the "irqpoll" option) > Nov 24 20:50:41 zacate kernel: [ 57.366690] Pid: 0, comm: swapper > Not tainted 3.2.0-rc2 #5 > > > The affected IRQ lines in /proc/interrupts: > > 16: 333 771 IO-APIC-fasteoi firewire_ohci, hda_intel > ... > 19: 39128 15165 IO-APIC-fasteoi ahci, eth1 > 40: 25641 59 PCI-MSI-edge eth0 > 41: 0 0 PCI-MSI-edge xhci_hcd > 42: 0 0 PCI-MSI-edge xhci_hcd > 43: 0 0 PCI-MSI-edge xhci_hcd > 44: 2 404 PCI-MSI-edge hda_intel > 45: 0 3 PCI-MSI-edge fglrx[0]@PCI:0:1:0 > > > The dmesg lines directly pertaining to IRQ 16 and IRQ 19 > > [ 0.328032] pci 0000:00:15.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > [ 0.328056] pci 0000:00:15.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > [ 0.328077] pci 0000:00:15.2: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > [ 0.328127] pci 0000:00:15.3: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > [ 2.671164] firewire_ohci 0000:05:02.0: PCI INT A -> GSI 16 (level, > low) -> IRQ 16 > [ 5.074619] HDA Intel 0000:00:14.2: PCI INT A -> GSI 16 (level, > low) -> IRQ 16 > > [ 2.010643] ahci 0000:00:11.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 > [ 2.073026] xhci_hcd 0000:06:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 > [ 2.090881] xhci_hcd 0000:06:00.0: irq 19, io mem 0xfe900000 > [ 2.091040] xhci_hcd 0000:06:00.0: irq 41 for MSI/MSI-X > [ 2.091050] xhci_hcd 0000:06:00.0: irq 42 for MSI/MSI-X > [ 2.091059] xhci_hcd 0000:06:00.0: irq 43 for MSI/MSI-X > [ 2.115098] e1000 0000:05:01.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 > [ 5.041614] HDA Intel 0000:00:01.1: PCI INT B -> GSI 19 (level, > low) -> IRQ 19 > [ 5.041711] HDA Intel 0000:00:01.1: irq 44 for MSI/MSI-X > > > What I noted: > > - The problem (IRQ lines 16 and 19 getting disabled) occurs fairly > often, but losing 19 occurs much more frequently than 16. > - The problem with IRQ19 goes away (at least sufficiently long not to > be occurring within 24h) when module e1000 is unloaded. > - The problem persists with and without forced IRQ threading. > - The problem persists with pci=nocsr. > - The problem persists with irqfixup. > - When IRQ19 dies, disk I/O access becomes very slow and unreliable. > > > What could be going wrong here ? I note that at least 3 devices > (00:15.x, 05:02.0 and 00:14.2) have their IRQ lines routed to IRQ 16, > but I see only 2 handlers in the dmesg log and /proc/interrupts. The > same applies to IRQ 19 (4 devices: 00:11.0, 06:00.0, 05:01.0 and > 00:01.0). It is true that some of these ultimately seem to switch to > MSI (06:00.0 and 00:01.0), but so does the video card (00:01.0), which > does not route to any IRQ beforehand. > > Before I try finding the problem, I would like to know what a > plausible failure mechanism is, so if anyone could give a hint on > where to start looking... > > Thanks for your opinion (and please also CC to my PM address - not > (yet) subscribed to the LKML), > > > J. > > > More detailed logs: > > Nov 28 04:35:29 zacate kernel: [29581.259926] irq 16: nobody cared > (try booting with the "irqpoll" option) > Nov 28 04:35:29 zacate kernel: [29581.259945] Pid: 0, comm: swapper > Tainted: P 3.0.0-13-generic #22-Ubuntu > Nov 28 04:35:29 zacate kernel: [29581.259952] Call Trace: > Nov 28 04:35:29 zacate kernel: [29581.259958] <IRQ> > [<ffffffff810cf96d>] __report_bad_irq+0x3d/0xe0 > Nov 28 04:35:29 zacate kernel: [29581.259986] [<ffffffff810cfd95>] > note_interrupt+0x135/0x180 > Nov 28 04:35:29 zacate kernel: [29581.259998] [<ffffffff810cdd89>] > handle_irq_event_percpu+0xa9/0x220 > Nov 28 04:35:29 zacate kernel: [29581.260008] [<ffffffff810cdf4e>] > handle_irq_event+0x4e/0x80 > Nov 28 04:35:29 zacate kernel: [29581.260019] [<ffffffff810d06c4>] > handle_fasteoi_irq+0x64/0xf0 > Nov 28 04:35:29 zacate kernel: [29581.260029] [<ffffffff8100c252>] > handle_irq+0x22/0x40 > Nov 28 04:35:29 zacate kernel: [29581.260040] [<ffffffff815f422a>] > do_IRQ+0x5a/0xe0 > Nov 28 04:35:29 zacate kernel: [29581.260050] [<ffffffff815ea913>] > common_interrupt+0x13/0x13 > Nov 28 04:35:29 zacate kernel: [29581.260056] <EOI> > [<ffffffff813725fb>] ? arch_local_irq_enable+0x8/0xd > Nov 28 04:35:29 zacate kernel: [29581.260079] [<ffffffff810887a5>] ? > sched_clock_idle_wakeup_event+0x15/0x20 > Nov 28 04:35:29 zacate kernel: [29581.260089] [<ffffffff813730ed>] > acpi_idle_enter_simple+0xcc/0x102 > Nov 28 04:35:29 zacate kernel: [29581.260100] [<ffffffff814ab5c2>] > cpuidle_idle_call+0xa2/0x1d0 > Nov 28 04:35:29 zacate kernel: [29581.260112] [<ffffffff8100920b>] > cpu_idle+0xab/0x100 > Nov 28 04:35:29 zacate kernel: [29581.260124] [<ffffffff815b858e>] > rest_init+0x72/0x74 > Nov 28 04:35:29 zacate kernel: [29581.260134] [<ffffffff81ad0c2b>] > start_kernel+0x3d4/0x3df > Nov 28 04:35:29 zacate kernel: [29581.260144] [<ffffffff81ad0388>] > x86_64_start_reservations+0x132/0x136 > Nov 28 04:35:29 zacate kernel: [29581.260156] [<ffffffff81ad0140>] ? > early_idt_handlers+0x140/0x140 > Nov 28 04:35:29 zacate kernel: [29581.260165] [<ffffffff81ad0459>] > x86_64_start_kernel+0xcd/0xdc > Nov 28 04:35:29 zacate kernel: [29581.260171] handlers: > Nov 28 04:35:29 zacate kernel: [29581.260204] [<ffffffffa0085ee0>] irq_handler > Nov 28 04:35:29 zacate kernel: [29581.260216] [<ffffffffa048efe0>] azx_interrupt > Nov 28 04:35:29 zacate kernel: [29581.260223] Disabling IRQ #16 > > > Nov 24 21:25:41 zacate kernel: [ 190.503838] irq 19: nobody cared > (try booting with the "irqpoll" option) > Nov 24 21:25:41 zacate kernel: [ 190.503856] Pid: 0, comm: swapper > Tainted: P 3.0.0-13-generic #22-Ubuntu > Nov 24 21:25:41 zacate kernel: [ 190.503864] Call Trace: > Nov 24 21:25:41 zacate kernel: [ 190.503870] <IRQ> > [<ffffffff810cf96d>] __report_bad_irq+0x3d/0xe0 > Nov 24 21:25:41 zacate kernel: [ 190.503898] [<ffffffff810cfd95>] > note_interrupt+0x135/0x180 > Nov 24 21:25:41 zacate kernel: [ 190.503909] [<ffffffff810cdd89>] > handle_irq_event_percpu+0xa9/0x220 > Nov 24 21:25:41 zacate kernel: [ 190.503920] [<ffffffff810cdf4e>] > handle_irq_event+0x4e/0x80 > Nov 24 21:25:41 zacate kernel: [ 190.503930] [<ffffffff810d06c4>] > handle_fasteoi_irq+0x64/0xf0 > Nov 24 21:25:41 zacate kernel: [ 190.503940] [<ffffffff8100c252>] > handle_irq+0x22/0x40 > Nov 24 21:25:41 zacate kernel: [ 190.503952] [<ffffffff815f422a>] > do_IRQ+0x5a/0xe0 > Nov 24 21:25:41 zacate kernel: [ 190.503961] [<ffffffff815ea913>] > common_interrupt+0x13/0x13 > Nov 24 21:25:41 zacate kernel: [ 190.503967] <EOI> > [<ffffffff81094482>] ? tick_nohz_stop_sched_tick+0x2a2/0x3f0 > Nov 24 21:25:41 zacate kernel: [ 190.503992] [<ffffffff810091d5>] > cpu_idle+0x75/0x100 > Nov 24 21:25:41 zacate kernel: [ 190.504004] [<ffffffff815b858e>] > rest_init+0x72/0x74 > Nov 24 21:25:41 zacate kernel: [ 190.504014] [<ffffffff81ad0c2b>] > start_kernel+0x3d4/0x3df > Nov 24 21:25:41 zacate kernel: [ 190.504024] [<ffffffff81ad0388>] > x86_64_start_reservations+0x132/0x136 > Nov 24 21:25:41 zacate kernel: [ 190.504036] [<ffffffff81ad0140>] ? > early_idt_handlers+0x140/0x140 > Nov 24 21:25:41 zacate kernel: [ 190.504045] [<ffffffff81ad0459>] > x86_64_start_kernel+0xcd/0xdc > Nov 24 21:25:41 zacate kernel: [ 190.504052] handlers: > Nov 24 21:25:41 zacate kernel: [ 190.504085] [<ffffffffa0001f40>] > ahci_interrupt > Nov 24 21:25:41 zacate kernel: [ 190.504101] [<ffffffffa004e6c0>] e1000_intr > Nov 24 21:25:41 zacate kernel: [ 190.504108] Disabling IRQ #19 > > > Nov 24 20:50:41 zacate kernel: [ 57.366678] irq 19: nobody cared > (try booting with the "irqpoll" option) > Nov 24 20:50:41 zacate kernel: [ 57.366690] Pid: 0, comm: swapper > Not tainted 3.2.0-rc2 #5 > Nov 24 20:50:41 zacate kernel: [ 57.366694] Call Trace: > Nov 24 20:50:41 zacate kernel: [ 57.366697] <IRQ> > [<ffffffff810bb9cd>] __report_bad_irq+0x3d/0xe0 > Nov 24 20:50:41 zacate kernel: [ 57.366715] [<ffffffff810bbe0d>] > note_interrupt+0x14d/0x210 > Nov 24 20:50:41 zacate kernel: [ 57.366721] [<ffffffff810b98a4>] > handle_irq_event_percpu+0xc4/0x290 > Nov 24 20:50:41 zacate kernel: [ 57.366728] [<ffffffff810b9ab8>] > handle_irq_event+0x48/0x70 > Nov 24 20:50:41 zacate kernel: [ 57.366733] [<ffffffff810bc7fa>] > handle_fasteoi_irq+0x5a/0xe0 > Nov 24 20:50:41 zacate kernel: [ 57.366740] [<ffffffff81004012>] > handle_irq+0x22/0x40 > Nov 24 20:50:41 zacate kernel: [ 57.366747] [<ffffffff81506b6a>] > do_IRQ+0x5a/0xd0 > Nov 24 20:50:41 zacate kernel: [ 57.366753] [<ffffffff814fe72b>] > common_interrupt+0x6b/0x6b > Nov 24 20:50:41 zacate kernel: [ 57.366756] <EOI> > [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 > Nov 24 20:50:41 zacate kernel: [ 57.366773] [<ffffffffa00cc0d3>] ? > acpi_idle_enter_simple+0xc5/0x102 [processor] > Nov 24 20:50:41 zacate kernel: [ 57.366781] [<ffffffffa00cc0ce>] ? > acpi_idle_enter_simple+0xc0/0x102 [processor] > Nov 24 20:50:41 zacate kernel: [ 57.366788] [<ffffffff814223a8>] > cpuidle_idle_call+0xb8/0x230 > Nov 24 20:50:41 zacate kernel: [ 57.366795] [<ffffffff81001215>] > cpu_idle+0xc5/0x130 > Nov 24 20:50:41 zacate kernel: [ 57.366802] [<ffffffff814e2360>] > rest_init+0x94/0xa4 > Nov 24 20:50:41 zacate kernel: [ 57.366809] [<ffffffff81aafba4>] > start_kernel+0x3a7/0x3b4 > Nov 24 20:50:41 zacate kernel: [ 57.366815] [<ffffffff81aaf322>] > x86_64_start_reservations+0x132/0x136 > Nov 24 20:50:41 zacate kernel: [ 57.366821] [<ffffffff81aaf416>] > x86_64_start_kernel+0xf0/0xf7 > Nov 24 20:50:41 zacate kernel: [ 57.366824] handlers: > Nov 24 20:50:41 zacate kernel: [ 57.366834] [<ffffffffa0043c10>] > ahci_interrupt > Nov 24 20:50:41 zacate kernel: [ 57.366843] [<ffffffffa006f4f0>] e1000_intr > Nov 24 20:50:41 zacate kernel: [ 57.366847] Disabling IRQ #19 > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 40+ messages in thread
* RE: Unhandled IRQs on AMD E-450 2011-11-30 15:44 ` Borislav Petkov @ 2011-12-01 8:01 ` Huang, Shane 2011-12-03 20:36 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Huang, Shane @ 2011-12-01 8:01 UTC (permalink / raw) To: Borislav Petkov, Jeroen Van den Keybus, Nguyen, Dong Cc: linux-kernel, Huang, Shane Boris, > Shane, can you guys take a look at this, sounds like some unfortunate > sharing of AHCI and network IRQ numbers. I'm adding Dong who might help on this. Thanks, Shane ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-01 8:01 ` Huang, Shane @ 2011-12-03 20:36 ` Jeroen Van den Keybus 2011-12-04 12:48 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-03 20:36 UTC (permalink / raw) To: Huang, Shane; +Cc: Borislav Petkov, Nguyen, Dong, linux-kernel I have tried the following kernel options: - acpi=noirq acpi_irq_nobalance - acpi=noirq - acpi=irq_nobalance - irqfixup - pci=nomsi all on the 3.2.0-rc2 kernel. I also threw out sound modules (based on snd_hda_intel both on Realtek and the built-in HDMI interfaces. Both sound interfaces shared the same driver and interrupts (IRQ16 and IRQ19)). But to no avail. Both IRQ19 and IRQ16 keep becoming lost after a while. I suspect that the post to this list made by Alan Stern on 22-10 concerning an Asus E35M1-M PRO refers to the same problem. I'm adding a full /proc/interrupts and lspci -vv output at the bottom, all from the 3.0.0 Ubuntu kernel. Feel free to mention any bad guys you recognize in this log. I could also add the dmesg log, but I fear it is too big to be appropriate in the list. If wanted, let me know. One point of interest, though: /proc/interrupts shows ERR: 1. Could this be related ? Is there any way of obtaining more output such as IO-APIC register states to verify that it is indeed a stuck IRQ input line and not an unsuccesful EOI ack ? I'm still a bit cautious about blaming a stuck device IRQ as there would have to be already two of them misbehaving. Rgds, J. $ cat /proc/interrupts (Note ERR:1) CPU0 CPU1 0: 45 1 IO-APIC-edge timer 1: 1 1 IO-APIC-edge i8042 5: 0 0 IO-APIC-edge parport0 7: 1 0 IO-APIC-edge 8: 1 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 0 4 IO-APIC-edge i8042 16: 1 783 IO-APIC-fasteoi firewire_ohci, hda_intel 17: 3 112 IO-APIC-fasteoi ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3 18: 0 4 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 19: 700 7243 IO-APIC-fasteoi ahci 40: 269 54 PCI-MSI-edge eth0 41: 0 0 PCI-MSI-edge xhci_hcd 42: 0 0 PCI-MSI-edge xhci_hcd 43: 0 0 PCI-MSI-edge xhci_hcd 44: 1 400 PCI-MSI-edge hda_intel 45: 0 3 PCI-MSI-edge fglrx[0]@PCI:0:1:0 NMI: 0 0 Non-maskable interrupts LOC: 8508 9855 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RES: 4350 2704 Rescheduling interrupts CAL: 180 287 Function call interrupts TLB: 413 296 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 1 1 Machine check polls ERR: 1 MIS: 0 # lspci -vv 00:00.0 Host bridge: Advanced Micro Devices [AMD] Family 14h Processor Root Complex Subsystem: ASUSTeK Computer Inc. Device 84e7 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 00:01.0 VGA compatible controller: ATI Technologies Inc Device 9806 (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Device 84e7 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 45 Region 0: Memory at c0000000 (32-bit, prefetchable) [size=256M] Region 1: I/O ports at f000 [size=256] Region 2: Memory at feb00000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at <unassigned> [disabled] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0300c Data: 4181 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: fglrx_pci Kernel modules: fglrx, radeon 00:01.1 Audio device: ATI Technologies Inc Wrestler HDMI Audio [Radeon HD 6250/6310] Subsystem: ASUSTeK Computer Inc. Device 84e7 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 44 Region 0: Memory at feb44000 (32-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed unknown, Width x0, ASPM unknown, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x0, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0100c Data: 4179 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: HDA Intel Kernel modules: snd-hda-intel 00:11.0 SATA controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 Interrupt: pin A routed to IRQ 19 Region 0: I/O ports at f140 [size=8] Region 1: I/O ports at f130 [size=4] Region 2: I/O ports at f120 [size=8] Region 3: I/O ports at f110 [size=4] Region 4: I/O ports at f100 [size=16] Region 5: Memory at feb4f000 (32-bit, non-prefetchable) [size=1K] Capabilities: [70] SATA HBA v1.0 InCfgSpace Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: ahci Kernel modules: ahci 00:12.0 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at feb4e000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: ohci_hcd 00:12.2 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 17 Region 0: Memory at feb4d000 (32-bit, non-prefetchable) [size=256] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Bridge: PM- B3+ Capabilities: [e4] Debug port: BAR=1 offset=00e0 Kernel driver in use: ehci_hcd 00:13.0 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at feb4c000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: ohci_hcd 00:13.2 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 17 Region 0: Memory at feb4b000 (32-bit, non-prefetchable) [size=256] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Bridge: PM- B3+ Capabilities: [e4] Debug port: BAR=1 offset=00e0 Kernel driver in use: ehci_hcd 00:14.0 SMBus: ATI Technologies Inc SBx00 SMBus Controller (rev 42) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Kernel driver in use: piix4_smbus Kernel modules: sp5100_tco, i2c-piix4 00:14.2 Audio device: ATI Technologies Inc SBx00 Azalia (Intel HDA) (rev 40) Subsystem: ASUSTeK Computer Inc. Device 8445 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at feb40000 (64-bit, non-prefetchable) [size=16K] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: HDA Intel Kernel modules: snd-hda-intel 00:14.3 ISA bridge: ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller (rev 40) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop+ ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 Secondary status: 66MHz- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- 00:14.5 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI2 Controller (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin C routed to IRQ 18 Region 0: Memory at feb4a000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: ohci_hcd 00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #247, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed unknown, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] Subsystem: ATI Technologies Inc Device 0000 Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: pcieport Kernel modules: shpchp 00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=03, subordinate=03, sec-latency=0 I/O behind bridge: 0000e000-0000efff Prefetchable memory behind bridge: 00000000d0000000-00000000d00fffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] Subsystem: ATI Technologies Inc Device 0000 Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: pcieport Kernel modules: shpchp 00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=04, subordinate=05, sec-latency=0 I/O behind bridge: 0000d000-0000dfff Memory behind bridge: fea00000-feafffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #2, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt- RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] Subsystem: ATI Technologies Inc Device 0000 Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: pcieport Kernel modules: shpchp 00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=00, secondary=06, subordinate=06, sec-latency=0 Memory behind bridge: fe900000-fe9fffff Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR- BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Root Port (Slot-), MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #3, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 <64ns, L1 <1us ClockPM- Surprise- LLActRep+ BwNot+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt+ RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible- RootCap: CRSVisible- RootSta: PME ReqID 0000, PMEStatus- PMEPending- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ ARIFwd- DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- ARIFwd- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [b0] Subsystem: ATI Technologies Inc Device 0000 Capabilities: [b8] HyperTransport: MSI Mapping Enable+ Fixed+ Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Kernel driver in use: pcieport Kernel modules: shpchp 00:16.0 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB OHCI0 Controller (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 18 Region 0: Memory at feb49000 (32-bit, non-prefetchable) [size=4K] Kernel driver in use: ohci_hcd 00:16.2 USB Controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 USB EHCI Controller (prog-if 20 [EHCI]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32, Cache Line Size: 64 bytes Interrupt: pin B routed to IRQ 17 Region 0: Memory at feb48000 (32-bit, non-prefetchable) [size=256] Capabilities: [c0] Power Management version 2 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Bridge: PM- B3+ Capabilities: [e4] Debug port: BAR=1 offset=00e0 Kernel driver in use: ehci_hcd 00:18.0 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 0 (rev 43) Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.1 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 1 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.2 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 2 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.3 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 3 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Capabilities: [f0] Secure device <?> Kernel driver in use: k10temp Kernel modules: k10temp 00:18.4 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 4 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.5 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 6 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.6 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 5 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 00:18.7 Host bridge: Advanced Micro Devices [AMD] Family 12h/14h Processor Function 7 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) Subsystem: ASUSTeK Computer Inc. P8P67 and other motherboards Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 40 Region 0: I/O ports at e000 [size=256] Region 2: Memory at d0004000 (64-bit, prefetchable) [size=4K] Region 4: Memory at d0000000 (64-bit, prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0100c Data: 4159 Capabilities: [70] Express (v2) Endpoint, MSI 01 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 4096 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Latency L0 <512ns, L1 <64us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB Capabilities: [b0] MSI-X: Enable- Count=4 Masked- Vector table: BAR=4 offset=00000000 PBA: BAR=4 offset=00000800 Capabilities: [d0] Vital Product Data Unknown small resource type 00, will not decode more. Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [140 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [160 v1] Device Serial Number 43-00-00-00-68-4c-e0-00 Kernel driver in use: r8169 Kernel modules: r8169 04:00.0 PCI bridge: ASMedia Technology Inc. Device 1080 (rev 01) (prog-if 01 [Subtractive decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Bus: primary=04, secondary=05, subordinate=05, sec-latency=32 I/O behind bridge: 0000d000-0000dfff Memory behind bridge: fea00000-feafffff Secondary status: 66MHz+ FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR+ BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B- PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn- Capabilities: [c0] Subsystem: ASUSTeK Computer Inc. Device 8489 05:01.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05) Subsystem: Intel Corporation PRO/1000 GT Desktop Adapter Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Interrupt: pin A routed to IRQ 19 Region 0: Memory at fea40000 (32-bit, non-prefetchable) [size=128K] Region 1: Memory at fea20000 (32-bit, non-prefetchable) [size=128K] Region 2: I/O ports at d080 [size=64] Expansion ROM at fea00000 [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME- Capabilities: [e4] PCI-X non-bridge device Command: DPERE- ERO+ RBC=512 OST=1 Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz- Kernel modules: e1000 05:02.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev c0) (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. M4A series motherboard Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 32 (8000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at fea60000 (32-bit, non-prefetchable) [size=2K] Region 1: I/O ports at d000 [size=128] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: firewire_ohci Kernel modules: firewire-ohci 06:00.0 USB Controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller (prog-if 30 [XHCI]) Subsystem: ASUSTeK Computer Inc. Device 8488 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 19 Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=32K] Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [68] MSI-X: Enable+ Count=8 Masked- Vector table: BAR=0 offset=00002000 PBA: BAR=0 offset=00002080 Capabilities: [78] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=55mA PME(D0-,D1-,D2-,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [80] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <2us ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #1, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Kernel driver in use: xhci_hcd Kernel modules: xhci-hcd 2011/12/1 Huang, Shane <Shane.Huang@amd.com>: > Boris, > >> Shane, can you guys take a look at this, sounds like some unfortunate >> sharing of AHCI and network IRQ numbers. > > I'm adding Dong who might help on this. > > > Thanks, > Shane > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-03 20:36 ` Jeroen Van den Keybus @ 2011-12-04 12:48 ` Clemens Ladisch 2011-12-04 13:36 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-04 12:48 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Jeroen Van den Keybus wrote: > [...] > But to no avail. Both IRQ19 and IRQ16 keep becoming lost after a while. You previously said that unloading e1000 made things better. Did this affect both IRQs 16 and 19? Can you check if this problem (on either 16 or 19) happens when you are not using the e1000 port (i.e., unplugged)? > I'm adding a full /proc/interrupts and lspci -vv output at the bottom, > all from the 3.0.0 Ubuntu kernel. Feel free to mention any bad guys > you recognize in this log. The /proc/interrupts doesn't show e1000, but lspci does. ...? Does the problem occur without fglrx? To get the AHCI interrupt away from IRQ 19, try the patch below. (But please don't show that ugly hack to any AMD guy. :) > Is there any way of obtaining more output such as IO-APIC register > states to verify that it is indeed a stuck IRQ input line and not an > unsuccesful EOI ack ? In theory, lspci's "Status: ... INTx+" shows an active interrupt line. Regards, Clemens --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -2906,6 +2906,48 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x65f8, quirk_intel_mc_errata); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x65f9, quirk_intel_mc_errata); DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x65fa, quirk_intel_mc_errata); +#if defined(CONFIG_PCI_MSI) && \ + (defined(CONFIG_SATA_AHCI) || defined(CONFIG_SATA_AHCI_MODULE)) +static void __init sb7x0_ahci_msi_enable(struct pci_dev *dev) +{ + u8 rev, ptr; + int where; + u32 misc_control; + + pci_bus_read_config_byte(dev->bus, PCI_DEVFN(0x14, 0), + PCI_REVISION_ID, &rev); + if (rev < 0x3c) /* A14 */ + return; + + pci_read_config_byte(dev, 0x34, &ptr); + if (ptr == 0x70) { + where = 0x34; + } else { + pci_read_config_byte(dev, 0x61, &ptr); + if (ptr == 0x70) + where = 0x61; + else + return; + } + + pci_read_config_byte(dev, 0x51, &ptr); + if (ptr != 0x70) + return; + + pci_read_config_dword(dev, 0x40, &misc_control); + misc_control |= 1; + pci_write_config_dword(dev, 0x40, misc_control); + + pci_write_config_byte(dev, where, 0x50); + + misc_control &= ~1; + pci_write_config_dword(dev, 0x40, misc_control); + + dev_dbg(&dev->dev, "AHCI: enabled MSI\n"); +} +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, 0x4391, sb7x0_ahci_msi_enable); +#endif + static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) { ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 12:48 ` Clemens Ladisch @ 2011-12-04 13:36 ` Jeroen Van den Keybus 2011-12-04 13:54 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-04 13:36 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel > You previously said that unloading e1000 made things better. Did this > affect both IRQs 16 and 19? No, this only affects IRQ 19. IRQ 16 usually dies within 15min..2hrs. > Can you check if this problem (on either 16 or 19) happens when you are > not using the e1000 port (i.e., unplugged)? The problem occurs with the e1000 idle (unplugged) and under heavy usage (plugged). Time to failure is also in the same order of magnitude (i.e. 1..30 minutes). As of now, I never had IRQ 19 disabled with the e1000 removed. The e1000 delivered with Ubuntu isn't particularly recent (7.3.21-k8-NAPI). Before I suspected a kernel problem, I already tried the 8.0.35 compiled from source obtained from Intel. Exactly the same result: IRQ 19 gets banned. > The /proc/interrupts doesn't show e1000, but lspci does. ...? You are right. I took that lspci after removing e1000, sorry for the confusion. Please see the new /proc/interrupts:below. > Does the problem occur without fglrx? Good question. I'll try that immediately. Stand by. > To get the AHCI interrupt away from IRQ 19, try the patch below. > (But please don't show that ugly hack to any AMD guy. :) I'll try that next too. >> Is there any way of obtaining more output such as IO-APIC register >> states to verify that it is indeed a stuck IRQ input line and not an >> unsuccesful EOI ack ? > In theory, lspci's "Status: ... INTx+" shows an active interrupt line. Ok. In that case (taking the lspci from a failed system) no (listed) device has INTx+. Thanks, J. $ cat /proc/interrupts (with e1000 (eth1) still loaded - this dump is after IRQ 19 is killed) CPU0 CPU1 0: 45 26 IO-APIC-edge timer 1: 1 1 IO-APIC-edge i8042 5: 0 0 IO-APIC-edge parport0 7: 1 0 IO-APIC-edge 8: 1 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 1 3 IO-APIC-edge i8042 16: 121 559 IO-APIC-fasteoi firewire_ohci, hda_intel 17: 3 110 IO-APIC-fasteoi ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3 18: 0 4 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 19: 198169 11097 IO-APIC-fasteoi ahci, eth1 40: 3601 71 PCI-MSI-edge eth0 41: 0 0 PCI-MSI-edge xhci_hcd 42: 0 0 PCI-MSI-edge xhci_hcd 43: 0 0 PCI-MSI-edge xhci_hcd 44: 4 298 PCI-MSI-edge hda_intel 45: 0 3 PCI-MSI-edge fglrx[0]@PCI:0:1:0 NMI: 0 0 Non-maskable interrupts LOC: 231521 231457 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RES: 37942 34198 Rescheduling interrupts CAL: 256 225 Function call interrupts TLB: 309 243 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 26 26 Machine check polls ERR: 1 MIS: 0 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 13:36 ` Jeroen Van den Keybus @ 2011-12-04 13:54 ` Jeroen Van den Keybus 2011-12-04 14:08 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-04 13:54 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel >> Does the problem occur without fglrx? > > Good question. I'll try that immediately. Stand by. I'm afraid it didn't matter. dmesg log: -- rmmod'ing e1000 in order not to get stuck while shutting down the X system [ 42.990418] e1000 0000:05:01.0: PCI INT A disabled -- Killed lightdm [ 102.250141] [fglrx] IRQ 45 Disabled [ 102.405031] HDMI hot plug event: Pin=3 Presence_Detect=1 ELD_Valid=1 [ 102.405063] HDMI status: Pin=3 Presence_Detect=1 ELD_Valid=1 -- rmmod'ed fglrx [ 142.964281] pci 0000:00:01.0: PCI INT A disabled [ 142.964323] [fglrx] module unloaded - fglrx 8.90.5 [Oct 12 2011] -- modprobe'd e1000 again [ 185.635457] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI [ 185.635469] e1000: Copyright (c) 1999-2006 Intel Corporation. [ 185.635612] e1000 0000:05:01.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 [ 186.213243] e1000 0000:05:01.0: eth1: (PCI:33MHz:32-bit) 00:0e:0c:d9:6f:ca [ 186.213263] e1000 0000:05:01.0: eth1: Intel(R) PRO/1000 Network Connection [ 186.248807] ADDRCONF(NETDEV_UP): eth1: link is not ready -- Lost IRQ 19 [ 354.446192] irq 19: nobody cared (try booting with the "irqpoll" option) [ 354.446343] Pid: 0, comm: swapper Tainted: P 3.0.0-13-generic #22-Ubuntu [ 354.446351] Call Trace: [ 354.446357] <IRQ> [<ffffffff810cf96d>] __report_bad_irq+0x3d/0xe0 [ 354.446385] [<ffffffff810cfd95>] note_interrupt+0x135/0x180 [ 354.446396] [<ffffffff810cdd89>] handle_irq_event_percpu+0xa9/0x220 [ 354.446406] [<ffffffff810cdf4e>] handle_irq_event+0x4e/0x80 [ 354.446417] [<ffffffff810d06c4>] handle_fasteoi_irq+0x64/0xf0 [ 354.446427] [<ffffffff8100c252>] handle_irq+0x22/0x40 [ 354.446438] [<ffffffff815f422a>] do_IRQ+0x5a/0xe0 [ 354.446447] [<ffffffff815ea913>] common_interrupt+0x13/0x13 [ 354.446453] <EOI> [<ffffffff813725fb>] ? arch_local_irq_enable+0x8/0xd [ 354.446476] [<ffffffff810887a5>] ? sched_clock_idle_wakeup_event+0x15/0x20 [ 354.446486] [<ffffffff813730ed>] acpi_idle_enter_simple+0xcc/0x102 [ 354.446497] [<ffffffff814ab5c2>] cpuidle_idle_call+0xa2/0x1d0 [ 354.446509] [<ffffffff8100920b>] cpu_idle+0xab/0x100 [ 354.446520] [<ffffffff815b858e>] rest_init+0x72/0x74 [ 354.446531] [<ffffffff81ad0c2b>] start_kernel+0x3d4/0x3df [ 354.446540] [<ffffffff81ad0388>] x86_64_start_reservations+0x132/0x136 [ 354.446552] [<ffffffff81ad0140>] ? early_idt_handlers+0x140/0x140 [ 354.446561] [<ffffffff81ad0459>] x86_64_start_kernel+0xcd/0xdc [ 354.446568] handlers: [ 354.446642] [<ffffffffa0001f40>] ahci_interrupt [ 354.446743] [<ffffffffa00496c0>] e1000_intr [ 354.446830] Disabling IRQ #19 /proc/interrupts is consistent (IRQ45 now gone): CPU0 CPU1 0: 45 3 IO-APIC-edge timer 1: 0 4 IO-APIC-edge i8042 5: 0 0 IO-APIC-edge parport0 7: 1 0 IO-APIC-edge 8: 1 0 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 0 6 IO-APIC-edge i8042 16: 11 559 IO-APIC-fasteoi firewire_ohci, hda_intel 17: 6 104 IO-APIC-fasteoi ehci_hcd:usb1, ehci_hcd:usb2, ehci_hcd:usb3 18: 0 4 IO-APIC-fasteoi ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 19: 200703 10373 IO-APIC-fasteoi ahci, eth1 40: 1001 66 PCI-MSI-edge eth0 41: 0 0 PCI-MSI-edge xhci_hcd 42: 0 0 PCI-MSI-edge xhci_hcd 43: 0 0 PCI-MSI-edge xhci_hcd 44: 1 427 PCI-MSI-edge hda_intel NMI: 0 0 Non-maskable interrupts LOC: 12670 23434 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RES: 4824 3363 Rescheduling interrupts CAL: 317 240 Function call interrupts TLB: 388 264 TLB shootdowns TRM: 0 0 Thermal event interrupts THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 3 3 Machine check polls ERR: 1 MIS: 0 >> To get the AHCI interrupt away from IRQ 19, try the patch below. >> (But please don't show that ugly hack to any AMD guy. :) > I'll try that next too. Moving on to the patch... Rgds, J. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 13:54 ` Jeroen Van den Keybus @ 2011-12-04 14:08 ` Jeroen Van den Keybus 2011-12-04 15:06 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-04 14:08 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Clemens, FYI, > In theory, lspci's "Status: ... INTx+" shows an active interrupt line. I have succeeded in catching a lspci on the SATA controller with INTx+ while IRQ 19 is disabled. 00:11.0 SATA controller: ATI Technologies Inc SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] (rev 40) (prog-if 01 [AHCI 1.0]) Subsystem: ASUSTeK Computer Inc. Device 8496 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+ Latency: 32 Interrupt: pin A routed to IRQ 19 Region 0: I/O ports at f140 [size=8] Region 1: I/O ports at f130 [size=4] Region 2: I/O ports at f120 [size=8] Region 3: I/O ports at f110 [size=4] Region 4: I/O ports at f100 [size=16] Region 5: Memory at feb4f000 (32-bit, non-prefetchable) [size=1K] Capabilities: [70] SATA HBA v1.0 InCfgSpace Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: ahci Kernel modules: ahci The fact that the next lspci's showed INTx- shows that its pin is definitely not stuck, does it not ? When I do e.g. $ du -h -x --max-depth=1 in a second terminal, it gets the line nicely back to INTx+ due to outstanding SATA commands. Cancelling the above du command (would otherwise take ages to complete) results in INTx-. Continuing with your patch... J. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 14:08 ` Jeroen Van den Keybus @ 2011-12-04 15:06 ` Jeroen Van den Keybus 2011-12-04 16:59 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-04 15:06 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel I finished patching and testing kernel 3.2.0-rc2. Also added #define DEBUG and running with cmdline "debug apic=debug" The AHCI interrupt was successfully moved to IRQ 40. Good patchwork ! Up for only 10 min., IRQ 19 has not been revoked However, I already lost IRQ 16 after 2 minutes. This kernel doesn't have support for any audio, so there was only firewire_ohci on this line. However, lspci for this device shows a firm INTx+. Cleared that by rmmod'ing and modprobe'ing firewire_ohci again. After 20 min. IRQ 19 was lost again. Now _I_ am lost. The only thing that IRQ 16 and IRQ 19 have in common is that there are devices on them that do have an INTx line but do not use it (MSI instead). However, I ran this kernel with pci=nomsi (earlier post) and IRQs 16 and 19 went down as well. IRQs 17 and 18 were never revoked. Yours truly puzzled, J. $ uname -r 3.2.0-rc2 $ cat /proc/interrupts CPU0 CPU1 0: 44 2 IO-APIC-edge timer 1: 0 0 IO-APIC-edge i8042 7: 1 0 IO-APIC-edge 8: 0 1 IO-APIC-edge rtc0 9: 0 0 IO-APIC-fasteoi acpi 12: 0 2 IO-APIC-edge i8042 16: 15 99986 IO-APIC-fasteoi firewire_ohci 17: 0 110 IO-APIC-fasteoi ehci_hcd:usb6, ehci_hcd:usb8, ehci_hcd:usb9 18: 3 37 IO-APIC-fasteoi ohci_hcd:usb3, ohci_hcd:usb4, ohci_hcd:usb5, ohci_hcd:usb7 19: 156 21 IO-APIC-fasteoi eth1 40: 26986 6951 PCI-MSI-edge ahci 41: 1268 125 PCI-MSI-edge eth0 42: 0 0 PCI-MSI-edge xhci_hcd 43: 0 0 PCI-MSI-edge xhci_hcd 44: 0 0 PCI-MSI-edge xhci_hcd NMI: 0 0 Non-maskable interrupts LOC: 16629 27721 Local timer interrupts SPU: 0 0 Spurious interrupts PMI: 0 0 Performance monitoring interrupts IWI: 0 0 IRQ work interrupts RES: 5957 3060 Rescheduling interrupts CAL: 138 174 Function call interrupts TLB: 345 234 TLB shootdowns THR: 0 0 Threshold APIC interrupts MCE: 0 0 Machine check exceptions MCP: 2 2 Machine check polls ERR: 1 MIS: 0 $ sudo lspci -vv -s05:02.0 05:02.0 FireWire (IEEE 1394): VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller (rev c0) (prog-if 10 [OHCI]) Subsystem: ASUSTeK Computer Inc. M4A series motherboard Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx+ Latency: 32 (8000ns max), Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at fea60000 (32-bit, non-prefetchable) [size=2K] Region 1: I/O ports at d000 [size=128] Capabilities: [50] Power Management version 2 Flags: PMEClk- DSI- D1- D2+ AuxCurrent=0mA PME(D0-,D1-,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Kernel driver in use: firewire_ohci Kernel modules: firewire-ohci dmesg log from the patch: [ 0.279002] pci 0000:00:11.0: [1002:4391] type 0 class 0x000106 [ 0.279021] pci 0000:00:11.0: calling quirk_no_ata_d3+0x0/0x1b [ 0.279030] pci 0000:00:11.0: calling quirk_mmio_always_on+0x0/0x17 [ 0.279053] pci 0000:00:11.0: reg 10: [io 0xf140-0xf147] [ 0.279072] pci 0000:00:11.0: reg 14: [io 0xf130-0xf133] [ 0.279091] pci 0000:00:11.0: reg 18: [io 0xf120-0xf127] [ 0.279110] pci 0000:00:11.0: reg 1c: [io 0xf110-0xf113] [ 0.279129] pci 0000:00:11.0: reg 20: [io 0xf100-0xf10f] [ 0.279148] pci 0000:00:11.0: reg 24: [mem 0xfeb4f000-0xfeb4f3ff] [ 0.279174] pci 0000:00:11.0: calling sb7x0_ahci_msi_enable+0x0/0x112 [ 0.279195] pci 0000:00:11.0: AHCI: enabled MSI [ 0.279204] pci 0000:00:11.0: calling quirk_resource_alignment+0x0/0x16b dmesg log from lost IRQ 16: [ 104.618738] irq 16: nobody cared (try booting with the "irqpoll" option) [ 104.618750] Pid: 0, comm: kworker/0:0 Not tainted 3.2.0-rc2 #7 [ 104.618754] Call Trace: [ 104.618757] <IRQ> [<ffffffff810bb9cd>] __report_bad_irq+0x3d/0xe0 [ 104.618774] [<ffffffff810bbe0d>] note_interrupt+0x14d/0x210 [ 104.618780] [<ffffffff810b98a4>] handle_irq_event_percpu+0xc4/0x290 [ 104.618786] [<ffffffff810b9ab8>] handle_irq_event+0x48/0x70 [ 104.618792] [<ffffffff810bc7fa>] handle_fasteoi_irq+0x5a/0xe0 [ 104.618799] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 104.618805] [<ffffffff81506baa>] do_IRQ+0x5a/0xd0 [ 104.618812] [<ffffffff814fe76b>] common_interrupt+0x6b/0x6b [ 104.618815] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [ 104.618832] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [ 104.618840] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [ 104.618848] [<ffffffff814223b8>] cpuidle_idle_call+0xb8/0x230 [ 104.618855] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 104.618861] [<ffffffff814efa56>] start_secondary+0x1ed/0x1f4 [ 104.618866] handlers: [ 104.618874] [<ffffffffa00ad280>] irq_handler [ 104.618878] Disabling IRQ #16 dmesg log from firewire_ohci reloading: [ 1041.257730] firewire_ohci 0000:05:02.0: PCI INT A disabled [ 1041.257737] firewire_ohci: Removed fw-ohci device. [ 1062.915595] firewire_ohci 0000:05:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 [ 1062.915610] firewire_ohci 0000:05:02.0: calling quirk_via_vlink+0x0/0xd0 [ 1062.980387] firewire_ohci: Added fw-ohci device 0000:05:02.0, OHCI v1.10, 4 IR + 8 IT contexts, quirks 0x11 [ 1063.481956] firewire_core: created device fw0: GUID 001e8c0000509146, S400 dmesg log from lost IRQ 19: [ 1205.490580] irq 19: nobody cared (try booting with the "irqpoll" option) [ 1205.490592] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #7 [ 1205.490596] Call Trace: [ 1205.490599] <IRQ> [<ffffffff810bb9cd>] __report_bad_irq+0x3d/0xe0 [ 1205.490616] [<ffffffff810bbe0d>] note_interrupt+0x14d/0x210 [ 1205.490623] [<ffffffff810b98a4>] handle_irq_event_percpu+0xc4/0x290 [ 1205.490629] [<ffffffff810b9ab8>] handle_irq_event+0x48/0x70 [ 1205.490635] [<ffffffff810bc7fa>] handle_fasteoi_irq+0x5a/0xe0 [ 1205.490642] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 1205.490649] [<ffffffff81506baa>] do_IRQ+0x5a/0xd0 [ 1205.490655] [<ffffffff814fe76b>] common_interrupt+0x6b/0x6b [ 1205.490658] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [ 1205.490677] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [ 1205.490684] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [ 1205.490692] [<ffffffff814223b8>] cpuidle_idle_call+0xb8/0x230 [ 1205.490699] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 1205.490706] [<ffffffff814e2370>] rest_init+0x94/0xa4 [ 1205.490713] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 1205.490719] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 1205.490725] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 1205.490729] handlers: [ 1205.490736] [<ffffffffa01164f0>] e1000_intr [ 1205.490740] Disabling IRQ #19 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 15:06 ` Jeroen Van den Keybus @ 2011-12-04 16:59 ` Clemens Ladisch 2011-12-06 0:06 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-04 16:59 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel, linux1394-devel Jeroen Van den Keybus wrote: > The problem occurs with the e1000 idle (unplugged) and under heavy > usage (plugged). Time to failure is also in the same order of > magnitude (i.e. 1..30 minutes). As of now, I never had IRQ 19 disabled > with the e1000 removed. The e1000 delivered with Ubuntu isn't > particularly recent (7.3.21-k8-NAPI). That version number doesn't mean much; there have been many changes to the kernel driver since it was last updated. One interesting patch is <http://git.kernel.org/linus/4c11b8adbc48>; please check if you have it (the file was recently moved into drivers/net/ethernet/intel/e1000/). But it's from January, your 3.2-rc* should already have it. > I have succeeded in catching a lspci on the SATA controller with INTx+ > while IRQ 19 is disabled. [...] > The fact that the next lspci's showed INTx- shows that its pin is > definitely not stuck, does it not ? Indeed; that SATA controller appears to work fine. > I already lost IRQ 16 after 2 minutes. This kernel doesn't have > support for any audio, so there was only firewire_ohci on this line. > However, lspci for this device shows a firm INTx+. Your VT6308 is a widely-used chip, and there are no known interrupt- related problems with it. This PCI status register is part of the device itself, i.e., the FireWire controller chip; there is nothing in the rest of the system, hardware or software, that could affect this INTx value. This means that the controller itself thinks that there is some FireWire-related reason for the interrupts. To instruct the firewire-ohci driver to log all interrupts and what the device thinks the reason for them is, please run: echo 4 > /sys/module/firewire_ohci/parameters/debug As long as there is nothing connected, there should be nothing but a timing interrupt every 64 seconds, like this: firewire_ohci: IRQ 00200000 cycle64Seconds > After 20 min. IRQ 19 was lost again. > > Now _I_ am lost. The only thing that IRQ 16 and IRQ 19 have in common > is that there are devices on them that do have an INTx line but do not > use it (MSI instead). However, I ran this kernel with pci=nomsi > (earlier post) and IRQs 16 and 19 went down as well. >From the information available so far, it appears that you have two similar but _independent_ problems with the e1000 and firewire devices. (It might be possible that static electricity zapped both your PCI card and the FireWire controller (which is directly near the first PCI slot), or something like that.) Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-04 16:59 ` Clemens Ladisch @ 2011-12-06 0:06 ` Jeroen Van den Keybus 2011-12-08 11:33 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-06 0:06 UTC (permalink / raw) To: Clemens Ladisch Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel, linux1394-devel > As long as there is nothing connected, there should be nothing but > a timing interrupt every 64 seconds, like this: > firewire_ohci: IRQ 00200000 cycle64Seconds That is correct. I see those messages indeed. Until now, however, I have not been able / lucky to witness another IRQ 16 banning. Still running the test. But... I have also been looking into the e1000 driver. What I did was add printk's on every invocation of the e1000_intr(). I used printk_ratelimit(), as well as a local occurrence counter. There are three places where I did the check and what I wrote to the log: 1. Right after determining that the Interrupt Cause Register is zero. That means the interrupt was not meant for or caused by the e1000 (hardware failure let alone) ==> e1000: not ours 2. Right after determining that the ICR is set, but the driver is not active. ==> e1000: ours, but down 3. At the end of e1000_intr. ==> e1000: ours. The result: [113757.420967] e1000: ours (240) [113759.424936] e1000: ours (241) [113761.428516] e1000: ours (242) [113761.428528] e1000: not ours (0) [113761.428536] e1000: not ours (1) [113761.428543] e1000: not ours (2) [113761.428551] e1000: not ours (3) [113761.428558] e1000: not ours (4) [113761.428566] e1000: not ours (5) [113761.428579] e1000: not ours (6) [113762.676114] irq 19: nobody cared (try booting with the "irqpoll" option) [113762.676126] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #7 [113762.676130] Call Trace: [113762.676133] <IRQ> [<ffffffff810bb9cd>] __report_bad_irq+0x3d/0xe0 [113762.676151] [<ffffffff810bbe0d>] note_interrupt+0x14d/0x210 [113762.676157] [<ffffffff810b98a4>] handle_irq_event_percpu+0xc4/0x290 [113762.676164] [<ffffffff810b9ab8>] handle_irq_event+0x48/0x70 [113762.676170] [<ffffffff810bc7fa>] handle_fasteoi_irq+0x5a/0xe0 [113762.676177] [<ffffffff81004012>] handle_irq+0x22/0x40 [113762.676183] [<ffffffff81506baa>] do_IRQ+0x5a/0xd0 [113762.676189] [<ffffffff814fe76b>] common_interrupt+0x6b/0x6b [113762.676192] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [113762.676211] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [113762.676219] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [113762.676227] [<ffffffff814223b8>] cpuidle_idle_call+0xb8/0x230 [113762.676234] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [113762.676241] [<ffffffff814e2370>] rest_init+0x94/0xa4 [113762.676248] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [113762.676254] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [113762.676260] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [113762.676264] handlers: [113762.676271] [<ffffffffa01164f0>] e1000_intr [113762.676275] Disabling IRQ #19 [113768.766055] firewire_ohci: IRQ 00200000 cycle64Seconds [113832.768181] firewire_ohci: IRQ 00200000 cycle64Seconds [113896.770536] firewire_ohci: IRQ 00200000 cycle64Seconds [113960.772976] firewire_ohci: IRQ 00200000 cycle64Seconds [114024.775340] firewire_ohci: IRQ 00200000 cycle64Seconds [114088.776662] firewire_ohci: IRQ 00200000 cycle64Seconds [114152.778105] firewire_ohci: IRQ 00200000 cycle64Seconds [114200.220155] e1000 0000:05:01.0: PCI INT A disabled [114216.779703] firewire_ohci: IRQ 00200000 cycle64Seconds [114265.335175] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI [114265.335185] e1000: Copyright (c) 1999-2006 Intel Corporation. [114265.335268] e1000 0000:05:01.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19 [114265.931952] e1000 0000:05:01.0: eth1: (PCI:33MHz:32-bit) 00:0e:0c:d9:6f:ca [114265.931977] e1000 0000:05:01.0: eth1: Intel(R) PRO/1000 Network Connection [114265.947250] e1000_intr: 199750 callbacks suppressed [114265.947257] e1000: ours (0) [114265.948433] e1000: ours (1) [114267.9:52645] e1000: ours (2) [114269.956659] e1000: ours (3) [114271.960528] e1000: ours (4) [114273.964811] e1000: ours (5) The e1000 chip raises the IRQ every 2 seconds. The e1000 driver sees it ([...] e1000: ours) and, by reading the ICR, clears the IRQ line. At ours (242) the interrupt arrives exactly at its expected time. However, 8 microseconds later, e1000_intr() is invoked again. Now the ICR is still empty, so e1000_intr() is returning IRQ_NONE. Then, e1000_intr() is overwhelmed by interrupts that are apparently not caused by the e1000 (and, by reading its ICR every time again, that IRQ would have been cleared anyway). I suspect that the IRQ is simply not properly acknowledged. (Only 6 occurrences of 'not ours' were logged as a result of the use of printk_ratelimit(). After unloading and loading the modified e1000.ko, ratelimit reports that nearly 200k messages have been suppressed.) I will now be checking this again on a fresh build (to ensure I haven't forgotten to unpatch anything). I will also install a new e1000 card although I doubt that it is defective. J. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-06 0:06 ` Jeroen Van den Keybus @ 2011-12-08 11:33 ` Jeroen Van den Keybus 2011-12-08 12:45 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-08 11:33 UTC (permalink / raw) To: Clemens Ladisch Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel, linux1394-devel > I will now be checking this again on a fresh build (to ensure I > haven't forgotten to unpatch anything). I will also install a new > e1000 card although I doubt that it is defective. I made a fresh 3.2.0-rc2 build and the problems are still there (the previous kernel still had IRQ_FORCED_THREADING disabled). I both modified the IRQ handlers in the firewire-ohci and e1000 driver to log, when they are called, whether they think the interrupt was meant for them ('ours') or not ('not ours'). The results (for IRQ16 it took a rather long while to obtain) are listed below. I have the impression that I see the same failure mechanism for both IRQs. All goes well for a while, until an IRQ storm starts right (e1000: 19 us, firewire-ohci: 39 us) after a valid IRQ. Therefore there is a strong correlation between the arrival of the spurious interrupt, alledgedly caused by a mystery device, and the earlier arrival of a valid interrupt for a device. Combined with the fact that it happens on 2 different IRQs pretty much rules out the possibilty for me that there is either a mystery device at all, or that the existing devices would both be defective, does it not ? I also do not understand, if there would be a stuck IRQ line, why I can unload and reload e1000 and firewire-ohci without immediately getting the same IRQ storm. Are there any tools suitable for tracing the handling of the last valid interrupt ? Thanks for any tips (and again for Clemens for providing a hack making it possible for me to keep the disk IRQs out of the danger zone). J. dmesg logs for IRQ 16 and IRQ 19 getting banned. ... [67962.892870] e1000: ours (271) [67964.897018] e1000: ours (272) [67966.900981] e1000: ours (273) [67968.904908] e1000: ours (274) [67970.908794] e1000: ours (275) [67970.908813] e1000: not ours (0) [67970.908825] e1000: not ours (1) [67970.908835] e1000: not ours (2) [67970.908845] e1000: not ours (3) [67970.908855] e1000: not ours (4) [67970.908865] e1000: not ours (5) [67970.908877] e1000: not ours (6) [67970.908887] e1000: not ours (7) [67970.908895] e1000: not ours (8) [67970.908907] e1000: not ours (9) [67970.908917] e1000: not ours (10) [67970.908927] e1000: not ours (11) [67970.908936] e1000: not ours (12) [67970.908945] e1000: not ours (13) [67970.908954] e1000: not ours (14) [67970.908964] e1000: not ours (15) [67971.904010] e1000_intr: 152423 callbacks suppressed [67971.904013] e1000: not ours (16) [67971.904021] e1000: not ours (17) [67971.904030] e1000: not ours (18) [67971.904039] e1000: not ours (19) [67971.904047] e1000: not ours (20) [67971.904056] e1000: not ours (21) [67971.904065] e1000: not ours (22) [67971.904075] e1000: not ours (23) [67971.904084] e1000: not ours (24) [67971.904093] e1000: not ours (25) [67971.904102] e1000: not ours (26) [67971.904112] e1000: not ours (27) [67971.904121] e1000: not ours (28) [67971.904128] e1000: not ours (29) [67971.904137] e1000: not ours (30) [67971.904147] e1000: not ours (31) [67971.904156] e1000: not ours (32) [67971.904165] e1000: not ours (33) [67971.904174] e1000: not ours (34) [67971.904184] e1000: not ours (35) [67972.210296] irq 19: nobody cared (try booting with the "irqpoll" option) [67972.210305] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #2 [67972.210309] Call Trace: [67972.210312] <IRQ> [<ffffffff810bbafd>] __report_bad_irq+0x3d/0xe0 [67972.210329] [<ffffffff810bbf3d>] note_interrupt+0x14d/0x210 [67972.210335] [<ffffffff810b98c4>] handle_irq_event_percpu+0xc4/0x290 [67972.210342] [<ffffffff810b9ad8>] handle_irq_event+0x48/0x70 [67972.210348] [<ffffffff810bc92a>] handle_fasteoi_irq+0x5a/0xe0 [67972.210354] [<ffffffff81004012>] handle_irq+0x22/0x40 [67972.210361] [<ffffffff81506caa>] do_IRQ+0x5a/0xd0 [67972.210367] [<ffffffff814fe86b>] common_interrupt+0x6b/0x6b [67972.210370] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [67972.210387] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [67972.210395] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [67972.210403] [<ffffffff814224e8>] cpuidle_idle_call+0xb8/0x230 [67972.210409] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [67972.210416] [<ffffffff814e24a0>] rest_init+0x94/0xa4 [67972.210423] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [67972.210429] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [67972.210435] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [67972.210438] handlers: [67972.210445] [<ffffffffa008e4f0>] e1000_intr [67972.210449] Disabling IRQ #19 [67992.794771] irq_handler: 47265 callbacks suppressed [67992.794783] firewire_ohci: ours (14) [68056.795654] firewire_ohci: ours (15) ... ... [158362.106314] firewire_ohci: ours (1426) [158426.107131] firewire_ohci: ours (1427) [158490.107972] firewire_ohci: ours (1428) [158554.108857] firewire_ohci: ours (1429) [158618.109671] firewire_ohci: ours (1430) [158682.110521] firewire_ohci: ours (1431) [158746.111369] firewire_ohci: ours (1432) [158746.111408] firewire_ohci: not ours (0) [158746.111421] firewire_ohci: not ours (1) [158746.111432] firewire_ohci: not ours (2) [158746.111444] firewire_ohci: not ours (3) [158746.111461] firewire_ohci: not ours (4) [158746.111473] firewire_ohci: not ours (5) [158746.111484] firewire_ohci: not ours (6) [158746.111495] firewire_ohci: not ours (7) [158746.111502] firewire_ohci: not ours (8) [158746.111510] firewire_ohci: not ours (9) [158746.111518] firewire_ohci: not ours (10) [158746.111526] firewire_ohci: not ours (11) [158746.111534] firewire_ohci: not ours (12) [158746.111542] firewire_ohci: not ours (13) [158746.111550] firewire_ohci: not ours (14) [158746.111558] firewire_ohci: not ours (15) [158746.111565] firewire_ohci: not ours (16) [158746.111573] firewire_ohci: not ours (17) [158746.111581] firewire_ohci: not ours (18) [158747.362748] irq 16: nobody cared (try booting with the "irqpoll" option) [158747.362757] Pid: 0, comm: kworker/0:0 Not tainted 3.2.0-rc2 #2 [158747.362761] Call Trace: [158747.362764] <IRQ> [<ffffffff810bbafd>] __report_bad_irq+0x3d/0xe0 [158747.362782] [<ffffffff810bbf3d>] note_interrupt+0x14d/0x210 [158747.362788] [<ffffffff810b98c4>] handle_irq_event_percpu+0xc4/0x290 [158747.362794] [<ffffffff810b9ad8>] handle_irq_event+0x48/0x70 [158747.362800] [<ffffffff810bc92a>] handle_fasteoi_irq+0x5a/0xe0 [158747.362807] [<ffffffff81004012>] handle_irq+0x22/0x40 [158747.362814] [<ffffffff81506caa>] do_IRQ+0x5a/0xd0 [158747.362820] [<ffffffff814fe86b>] common_interrupt+0x6b/0x6b [158747.362823] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [158747.362840] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [158747.362848] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [158747.362856] [<ffffffff814224e8>] cpuidle_idle_call+0xb8/0x230 [158747.362862] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [158747.362869] [<ffffffff814efb86>] start_secondary+0x1ed/0x1f4 [158747.362873] handlers: [158747.362879] [<ffffffffa00b2100>] irq_handler [158747.362883] Disabling IRQ #16 ... ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-08 11:33 ` Jeroen Van den Keybus @ 2011-12-08 12:45 ` Clemens Ladisch 2011-12-08 21:27 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-08 12:45 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel, linux1394-devel Jeroen Van den Keybus wrote: > I have the impression that I see the same failure mechanism for both > IRQs. All goes well for a while, until an IRQ storm starts right > (e1000: 19 us, firewire-ohci: 39 us) after a valid IRQ. > > Therefore there is a strong correlation between the arrival of the > spurious interrupt, alledgedly caused by a mystery device, and the > earlier arrival of a valid interrupt for a device. Combined with the > fact that it happens on 2 different IRQs pretty much rules out the > possibilty for me that there is either a mystery device at all, or > that the existing devices would both be defective, does it not ? There appears to be a problem with the interrupt handling. In PCI, interrupts are level-triggered, which means that the interrupt line (INTx) is active when it's at level 0 and inactive when it's at level 1. When a device wants to trigger an interrupt, it outputs zero on its interrupt output. The level doesn't get reset to 1 until the driver acknowledges the interrupt (in e1000, read of the ICR; in firewire-ohci, write of IntEventClear). As long as the line stays at 0, all interrupt handlers will continue being called. This mechanism allows multiple devices to share one interrupt line. In PCI Express, there are only one-to-one connections, and there are no separate interrupt lines. A device raises an interrupt by sending an interrupt message, which could be understood as a memory write to a special address at the interrupt controller. Nothing needs to be done to deactive the interrupt; if the device has another reason for an interrupt, it just sends another interrupt message. When a PCI device is connected to a PCI Express system, the old INTx interrupt line must be converted to PCI Express messages. This is done with _two_ special messages, Assert_INTx and Deassert_INTx. The first tells the interrupt controller that some INTx line went from 1 to 0, the second tells it that it went from 0 back to 1; this allows the interrupt controller to implement the level-triggered behaviour. It appears that some Deassert_INTx messages get lost on your system. There are no indications of any other missing PCIe packets, so this looks like a problem with the interrupt handling in your PCI/PCIe bridge, the ASM1083 chip. > I also do not understand, if there would be a stuck IRQ line, why I > can unload and reload e1000 and firewire-ohci without immediately > getting the same IRQ storm. Linux will reenable the interrupt line when a new driver attaches to it. At this point, it's still stuck, but the device initialization will trigger some actual interrupts, and after the first assert/deassert pair, the line will be unstuck. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-08 12:45 ` Clemens Ladisch @ 2011-12-08 21:27 ` Jeroen Van den Keybus 2011-12-09 8:22 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-08 21:27 UTC (permalink / raw) To: Clemens Ladisch Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel, linux1394-devel Thanks for explaining the PCI to PCIe bridge architecture. Of course, the ASM1083 can only be the cause if the Firewire controller is also on that bus. Which I don't know. > It appears that some Deassert_INTx messages get lost on your system. > There are no indications of any other missing PCIe packets, so this > looks like a problem with the interrupt handling in your PCI/PCIe > bridge, the ASM1083 chip. Assuming this is the case, I modified the e1000 driver to explicitly set its IRQ line after 5 times having to send IRQ_NONE. (e1000_intr() code at end of this post). The result of this test is that the IRQ line indeed is set (in the next invocation, the ISR sees the forced RXT0 interrupt, clears the IRQ line and sends IRQ_HANDLED). But alas, the storm is not silenced at all. If the ASM108x was the problem, I suspect that explicitly raising and clearing the interrupt would have retriggered the INTx_Assert and INTx_Deassert messages ? Meaning the bridge wouldn't be the problem. @ Clemens: If I understand correctly, the IO-APIC is not even used in this case ? (IRQ requests from e1000 all going through PCIe) Or is there also a virtual IO-APIC monitoring Assert and Deassert messages. Is the BIOS responsible for writing a mapping for the PCI IRQs to MSIs into the ASM108x ? (And BTW, should the linux1394-devel still be posted ?) I'm thinking of immediately re-enabling the irqs after they've been disabled in spurious.c. I also think that the following posts may refer to the same problem: http://ubuntuforums.org/showthread.php?t=1883854 https://lkml.org/lkml/2011/6/30/197 https://lkml.org/lkml/2011/10/14/146 Rgds, J. dmesg log: [247181.656647] e1000: ours (60) [247183.660996] e1000: ours (61) [247185.664907] e1000: ours (62) [247185.664926] e1000: not ours (0) [247185.664937] e1000: not ours (1) [247185.664948] e1000: not ours (2) [247185.664958] e1000: not ours (3) [247185.664968] e1000: not ours (4) [247185.664982] e1000: sending RXT0 interrupt (mask=0x00000000) [247185.664997] e1000: ours (63) [247185.665009] e1000: not ours (0) [247185.665024] e1000: not ours (1) [247185.665034] e1000: not ours (2) [247185.665041] e1000: not ours (3) [247185.665053] e1000: not ours (4) [247185.665065] e1000: sending RXT0 interrupt (mask=0x0000009d) [247185.665077] e1000: ours (64) [247185.665085] e1000: not ours (0) [247185.665095] e1000: not ours (1) [247185.665105] e1000: not ours (2) [247186.319878] irq 19: nobody cared (try booting with the "irqpoll" option) [247186.319887] Pid: 0, comm: swapper Not tainted 3.2.0-rc2 #2 [247186.319891] Call Trace: [247186.319894] <IRQ> [<ffffffff810bbafd>] __report_bad_irq+0x3d/0xe0 [247186.319912] [<ffffffff810bbf3d>] note_interrupt+0x14d/0x210 [247186.319918] [<ffffffff810b98c4>] handle_irq_event_percpu+0xc4/0x290 [247186.319924] [<ffffffff810b9ad8>] handle_irq_event+0x48/0x70 [247186.319930] [<ffffffff810bc92a>] handle_fasteoi_irq+0x5a/0xe0 [247186.319937] [<ffffffff81004012>] handle_irq+0x22/0x40 [247186.319943] [<ffffffff81506caa>] do_IRQ+0x5a/0xd0 [247186.319950] [<ffffffff814fe86b>] common_interrupt+0x6b/0x6b [247186.319953] <EOI> [<ffffffff81009906>] ? native_sched_clock+0x26/0x70 [247186.319970] [<ffffffffa00c50d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [247186.319978] [<ffffffffa00c50ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [247186.319986] [<ffffffff814224e8>] cpuidle_idle_call+0xb8/0x230 [247186.319992] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [247186.319999] [<ffffffff814e24a0>] rest_init+0x94/0xa4 [247186.320006] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [247186.320013] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [247186.320018] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [247186.320022] handlers: [247186.320030] [<ffffffffa008e4f0>] e1000_intr [247186.320034] Disabling IRQ #19 The modified e1000 interrupt handler: static irqreturn_t e1000_intr(int irq, void *data) { struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; u32 icr = er32(ICR); static int i_not_ours = 0; if (unlikely((!icr))) { if (i_not_ours < 5) { if (printk_ratelimit()) printk("e1000: not ours (%d)\n", i_not_ours++); } else { if (printk_ratelimit()) printk("e1000: sending RXT0 interrupt (mask=0x%08x)\n", er32(IMS)); ew32(ICS, E1000_ICS_RXT0); } return IRQ_NONE; /* Not our interrupt */ } /* * we might have caused the interrupt, but the above * read cleared it, and just in case the driver is * down there is nothing to do so return handled */ if (unlikely(test_bit(__E1000_DOWN, &adapter->flags))) { static int i = 0; if (printk_ratelimit()) printk("e1000: ours, but down (%d)\n", i++); return IRQ_HANDLED; } if (unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) { hw->get_link_status = 1; /* guard against interrupt when we're going down */ if (!test_bit(__E1000_DOWN, &adapter->flags)) schedule_delayed_work(&adapter->watchdog_task, 1); } /* disable interrupts, without the synchronize_irq bit */ ew32(IMC, ~0); E1000_WRITE_FLUSH(); if (likely(napi_schedule_prep(&adapter->napi))) { adapter->total_tx_bytes = 0; adapter->total_tx_packets = 0; adapter->total_rx_bytes = 0; adapter->total_rx_packets = 0; __napi_schedule(&adapter->napi); } else { /* this really should not happen! if it does it is basically a * bug, but not a hard error, so enable ints and continue */ if (!test_bit(__E1000_DOWN, &adapter->flags)) e1000_irq_enable(adapter); } { static int i = 0; if (printk_ratelimit()) printk("e1000: ours (%d)\n", i++); i_not_ours = 0; } return IRQ_HANDLED; } ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-08 21:27 ` Jeroen Van den Keybus @ 2011-12-09 8:22 ` Clemens Ladisch 2011-12-09 11:17 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-09 8:22 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Jeroen Van den Keybus wrote: > the ASM1083 can only be the cause if the Firewire controller is also > on that bus. The VT6308 is PCI, and you have only one bus. >> It appears that some Deassert_INTx messages get lost on your system. >> There are no indications of any other missing PCIe packets, so this >> looks like a problem with the interrupt handling in your PCI/PCIe >> bridge, the ASM1083 chip. > > Assuming this is the case, I modified the e1000 driver to explicitly > set its IRQ line after 5 times having to send IRQ_NONE. (e1000_intr() > code at end of this post). The result of this test is that the IRQ > line indeed is set (in the next invocation, the ISR sees the forced > RXT0 interrupt, clears the IRQ line and sends IRQ_HANDLED). But alas, > the storm is not silenced at all. > > If the ASM108x was the problem, I suspect that explicitly raising and > clearing the interrupt would have retriggered the INTx_Assert and > INTx_Deassert messages ? Yes. > Meaning the bridge wouldn't be the problem. It's possible that 1) the ASM1083 does not react to changes of the PCI interrupt line, or 2) the interrupt controller ignores INTx_Deassert messages. I'm wondering what the difference between triggering an interrupt and reloading the driver is that makes it work again. I'd guess that reattaching the driver reinitializes the interrupt, which would point to 2). > If I understand correctly, the IO-APIC is not even used in this case ? > (IRQ requests from e1000 all going through PCIe) Or is there also > a virtual IO-APIC monitoring Assert and Deassert messages. All PCI interrupts (whether 'real' lines in hardware or emulated with PCIe messages) end up at the I/O-APIC. > Is the BIOS responsible for writing a mapping for the PCI IRQs to MSIs > into the ASM108x ? MSIs are edge-triggered; their message is different from the (de)assert messages used for PCI level-triggered interrupts. AFAIK the interrupt handling in a PCI/PCIe bridge should work transparently, i.e., the bridge does not need to be configured by software. The BIOS is responsible for telling the kernel about all interrupt mappings, and ACPI takes part in initializing the I/O-APIC. Check if a newer BIOS exists. > (And BTW, should the linux1394-devel still be posted ?) Your trigger-interrupt patch would also be possible with firewire-ohci. > I'm thinking of immediately re-enabling the irqs after they've been > disabled in spurious.c. You could try free_irq/request_irq, but I guess you cannot do this directly from inside an interrupt handler. > I also think that the following posts may refer to the same problem: > > http://ubuntuforums.org/showthread.php?t=1883854 > https://lkml.org/lkml/2011/6/30/197 > https://lkml.org/lkml/2011/10/14/146 That's similar symptoms, but completely different hardware. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-09 8:22 ` Clemens Ladisch @ 2011-12-09 11:17 ` Jeroen Van den Keybus 2011-12-09 12:55 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-09 11:17 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel > The VT6308 is PCI, and you have only one bus. There's also a bridge on the FCH: bus 0 dev. 14 fn. 4, but I see that the memory and IO regions of the e1000 and the VT630x are withing the range of the ASM108x bridge. Thanks for pointing that out. > It's possible that > 1) the ASM1083 does not react to changes of the PCI interrupt line, or > 2) the interrupt controller ignores INTx_Deassert messages. I'm a bit puzzled. The IOAPIC operates on external IRQ lines. So that would mean the LAPICs ? > I'm wondering what the difference between triggering an interrupt and > reloading the driver is that makes it work again. I'd guess that > reattaching the driver reinitializes the interrupt, which would point > to 2). I tried irq_disable(); irq_enable(); in spurious.c. That didn't change anything. Storm continues. Also important: from my logs it appears that when a driver is reloaded, there is indeed no storm at all. In my log posted at Dec. 6, that is clearly visible. > All PCI interrupts (whether 'real' lines in hardware or emulated with > PCIe messages) end up at the I/O-APIC. That would mean that the IO-APIC would decode MSI messages. I don't think it can do that. Would it not be possible that the PCI bus IRQ lines are directly connected to the FCH IO-APIC inputs (and that the ASM1083 INTx lines are simply not connected ? (Makes me wonder why Asus did not simply use the existing PCI bridge on the FCH, which BTW also seems to depend on the use of the external INTx lines.) > Check if a newer BIOS exists. Did that. No newer version is available. > Your trigger-interrupt patch would also be possible with firewire-ohci. If this would work, you'd have to patch the drivers of all PCI devices. I'd much rather do it by modifying something in the IRQ handling code. >> I'm thinking of immediately re-enabling the irqs after they've been >> disabled in spurious.c. > You could try free_irq/request_irq, but I guess you cannot do this > directly from inside an interrupt handler. No, I did irq_disable/irq_enable, which should directly call the mask/unmask methods of the chip handler. I still must check that, though, and especially if the correct handler is used. But as I said before, it didn't seem to do the trick. >> I also think that the following posts may refer to the same problem: << > That's similar symptoms, but completely different hardware. At first sight, yes, but they still share some of the problem areas. I justed wanted to point out possibly similar cases. Never mind. ( >> http://ubuntuforums.org/showthread.php?t=1883854 Asus board with ASM1083 (same bridge). >> https://lkml.org/lkml/2011/6/30/197 >> https://lkml.org/lkml/2011/10/14/146 Has device 1b21:1080 (same bridge) in its Asus system. https://lkml.org/lkml/2011/10/22/157 Has a E35M1-M board (essentially the same board as the E45M1-M with the AMD E350 instead of the E450) ) J. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-09 11:17 ` Jeroen Van den Keybus @ 2011-12-09 12:55 ` Clemens Ladisch 2011-12-10 12:10 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-09 12:55 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Jeroen Van den Keybus wrote: >> I'm wondering what the difference between triggering an interrupt and >> reloading the driver is that makes it work again. I'd guess that >> reattaching the driver reinitializes the interrupt, which would point >> to 2). > > I tried irq_disable(); irq_enable(); in spurious.c. That didn't change > anything. Storm continues. Also important: from my logs it appears > that when a driver is reloaded, there is indeed no storm at all. Temporarily disabling an irq is different from completely shutting it down. >> All PCI interrupts (whether 'real' lines in hardware or emulated with >> PCIe messages) end up at the I/O-APIC. > > That would mean that the IO-APIC would decode MSI messages. PCI interrupt messages (INTx_(de)assert) are special messages, while MSI messages are just normal memory writes. PCI interrupts (whether external or emulated) are always handled by the I/O-APIC. MSI interrupts usually go to some LAPIC (see the MSI address in the lspci output; the I/O-APIC is at FEC00000, the LAPICs are at FEE00000). > Would it not be possible that the PCI bus IRQ lines are directly > connected to the FCH IO-APIC inputs (and that the ASM1083 INTx lines > are simply not connected ? > > (Makes me wonder why Asus did not simply use the existing PCI bridge > on the FCH, which BTW also seems to depend on the use of the external > INTx lines.) Device 14.4 would be the PCI bridge on AMD southbridges, but your model, the A50M, does not have PCI support. The interrupt lines are correctly connected to the ASM1083; otherwise, they wouldn't work at all. Also see "lspci -t". >>> I also think that the following posts may refer to the same problem: >> >> That's similar symptoms, but completely different hardware. > > At first sight, yes, but they still share some of the problem areas. I > justed wanted to point out possibly similar cases. Indeed, I didn't realize they all have an ASM1083 bridge. So it appears that this chip is just buggy. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-09 12:55 ` Clemens Ladisch @ 2011-12-10 12:10 ` Jeroen Van den Keybus 2011-12-10 17:58 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-10 12:10 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel > Temporarily disabling an irq is different from completely shutting it > down. Yes, but when unloading / reloading the driver, it seems that, at least for the I/O-APIC, no more than that actually happens. (I've read the original Intel I/O-APIC datasheet and the only way to clear a pending IRQ is to do it by means of a message from the (local) APIC - apart from some dirty trickery with modifying the IRQ type entry) >>> All PCI interrupts (whether 'real' lines in hardware or emulated with >>> PCIe messages) end up at the I/O-APIC. >> >> That would mean that the IO-APIC would decode MSI messages. I was wrong there. Indeed the INTx Assert/Deassert messages are entirely different and picked up by the I/O-APIC. The fact that, without legacy PCI interrupts, IRQs 17 and 18 have never failed indicates that the problem lies with the PCI/PCIe bridge. I'm thinking of the following scenario: - PCI device raises IRQ line. - Bridge sees the transition and signals Assert. - Assert travels through the PCIe fabric and arrives at the I/O-APIC. - CPU services the IRQ, and does at least one (slow) PCI read to have the device deassert its IRQ line. In practice, more PCI read/writes are needed, requiring the bridge to do some PCIe traffic generation. - Bridge sees the IRQ line trasition and signals Deassert, This message has only a few usecs to arrive at the I/O-APIC. - _However_ the CPU has by large already handled the IRQ and gets interrupted again before the Deassert ever gets out. The resulting PCI bus traffic further delays the Deassert message (due to e.g. PCIe transmit credit exhaustion). Another scenarion is an electrical problem such as insufficient margin for high INTx signal detection. But I'll have to wire the setup to test that. My idea is that if we would not immediately hammer the bridge with PCIe transactions, the Deassert message may eventually arrive ? Also, is there any control by Linux of the credits issued ? I therefore patched the polling system by detecting a stuck IRQ already after 10 unserviced IRQs. Then the polling system will take over for 50 cycles (5 seconds), after which the IRQ is reenabled. The 10 cycles may seem not too much, but usually there are no unserviced IRQs at all, and I reenable the IRQ anyway after 5 seconds. And the storms, if the IRQ is really stuck, are very small bursts of 10 IRQs, causing no significant overhead. The alternative is a system unable to run Linux. It is not very elegant, but it gets the job done and allows the kernel to recover from a single upset. Also, interrupt storms lasting over 1 second are avoided. Results so far: dmesg log: ... [ 25.605552] init: plymouth-upstart-bridge main process (508) killed by TERM signal [ 26.641229] EXT4-fs (sda1): re-mounted. Opts: errors=remount-ro,commit=0 [ 1607.941232] irq 19: nobody cared (try booting with the "irqpoll" option) [ 1607.941244] Pid: 0, comm: swapper Not tainted 3.2.0-rc4 #5 [ 1607.941248] Call Trace: [ 1607.941252] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 1607.941269] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 1607.941276] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 1607.941282] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 1607.941288] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 1607.941295] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 1607.941301] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 1607.941311] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 1607.941314] <EOI> [<ffffffff81423b10>] ? ladder_select_state+0x180/0x180 [ 1607.941325] [<ffffffff81422904>] ? cpuidle_idle_call+0xf4/0x230 [ 1607.941331] [<ffffffff814228c8>] ? cpuidle_idle_call+0xb8/0x230 [ 1607.941338] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 1607.941345] [<ffffffff814e2910>] rest_init+0x94/0xa4 [ 1607.941350] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 1607.941357] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 1607.941363] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 1607.941367] handlers: [ 1607.941378] [<ffffffffa006f4f0>] e1000_intr [ 1607.941382] Disabling IRQ 19 [ 1608.040189] Polling IRQ. [ 1608.140227] Polling IRQ. ... [ 1612.840243] Polling IRQ. [ 1612.940039] Polling IRQ. [ 1613.040185] Reenabling IRQ. [ 1908.541558] irq 19: nobody cared (try booting with the "irqpoll" option) [ 1908.541570] Pid: 0, comm: swapper Not tainted 3.2.0-rc4 #5 [ 1908.541574] Call Trace: [ 1908.541578] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 1908.541595] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 1908.541602] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 1908.541608] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 1908.541614] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 1908.541620] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 1908.541627] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 1908.541633] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 1908.541637] <EOI> [<ffffffff81423b34>] ? menu_reflect+0x24/0x50 [ 1908.541647] [<ffffffff810011da>] ? cpu_idle+0x8a/0x130 [ 1908.541652] [<ffffffff81001215>] ? cpu_idle+0xc5/0x130 [ 1908.541659] [<ffffffff814e2910>] rest_init+0x94/0xa4 [ 1908.541665] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 1908.541672] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 1908.541678] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 1908.541681] handlers: [ 1908.541695] [<ffffffffa006f4f0>] e1000_intr [ 1908.541699] Disabling IRQ 19 [ 1908.640189] Polling IRQ. [ 1908.740186] Polling IRQ. ... [ 1913.440205] Polling IRQ. [ 1913.540205] Polling IRQ. [ 1913.640088] Reenabling IRQ. [ 2319.361659] irq 19: nobody cared (try booting with the "irqpoll" option) [ 2319.361671] Pid: 0, comm: swapper Not tainted 3.2.0-rc4 #5 [ 2319.361675] Call Trace: [ 2319.361679] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 2319.361696] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 2319.361702] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 2319.361709] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 2319.361715] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 2319.361721] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 2319.361727] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 2319.361734] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 2319.361737] <EOI> [<ffffffff81009926>] ? native_sched_clock+0x26/0x70 [ 2319.361754] [<ffffffffa00cd0de>] ? acpi_idle_enter_simple+0xd0/0x102 [processor] [ 2319.361762] [<ffffffffa00cd0ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [ 2319.361769] [<ffffffff814228c8>] cpuidle_idle_call+0xb8/0x230 [ 2319.361776] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 2319.361783] [<ffffffff814e2910>] rest_init+0x94/0xa4 [ 2319.361789] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 2319.361796] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 2319.361802] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 2319.361806] handlers: [ 2319.361816] [<ffffffffa006f4f0>] e1000_intr [ 2319.361820] Disabling IRQ 19 [ 2319.460205] Polling IRQ. [ 2319.560207] Polling IRQ. ... [ 2324.260030] Polling IRQ. [ 2324.360118] Polling IRQ. [ 2324.460064] Reenabling IRQ. [ 2782.285470] irq 19: nobody cared (try booting with the "irqpoll" option) [ 2782.285482] Pid: 0, comm: swapper Not tainted 3.2.0-rc4 #5 [ 2782.285486] Call Trace: [ 2782.285490] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 2782.285507] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 2782.285514] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 2782.285520] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 2782.285526] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 2782.285532] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 2782.285539] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 2782.285545] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 2782.285548] <EOI> [<ffffffff81009926>] ? native_sched_clock+0x26/0x70 [ 2782.285566] [<ffffffffa00cd0d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [ 2782.285574] [<ffffffffa00cd0ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [ 2782.285581] [<ffffffff814228c8>] cpuidle_idle_call+0xb8/0x230 [ 2782.285588] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 2782.285594] [<ffffffff814e2910>] rest_init+0x94/0xa4 [ 2782.285600] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 2782.285607] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 2782.285613] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 2782.285617] handlers: [ 2782.285627] [<ffffffffa006f4f0>] e1000_intr [ 2782.285631] Disabling IRQ 19 [ 2782.384226] Polling IRQ. [ 2782.484041] Polling IRQ. ... [ 2787.184224] Polling IRQ. [ 2787.284223] Polling IRQ. [ 2787.384222] Reenabling IRQ. [ 3485.689347] irq 19: nobody cared (try booting with the "irqpoll" option) [ 3485.689360] Pid: 0, comm: swapper Not tainted 3.2.0-rc4 #5 [ 3485.689364] Call Trace: [ 3485.689368] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 3485.689385] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 3485.689392] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 3485.689398] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 3485.689404] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 3485.689411] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 3485.689417] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 3485.689424] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 3485.689427] <EOI> [<ffffffff81009926>] ? native_sched_clock+0x26/0x70 [ 3485.689444] [<ffffffffa00cd0d3>] ? acpi_idle_enter_simple+0xc5/0x102 [processor] [ 3485.689452] [<ffffffffa00cd0ce>] ? acpi_idle_enter_simple+0xc0/0x102 [processor] [ 3485.689459] [<ffffffff814228c8>] cpuidle_idle_call+0xb8/0x230 [ 3485.689466] [<ffffffff81001215>] cpu_idle+0xc5/0x130 [ 3485.689472] [<ffffffff814e2910>] rest_init+0x94/0xa4 [ 3485.689478] [<ffffffff81aafba4>] start_kernel+0x3a7/0x3b4 [ 3485.689485] [<ffffffff81aaf322>] x86_64_start_reservations+0x132/0x136 [ 3485.689491] [<ffffffff81aaf416>] x86_64_start_kernel+0xf0/0xf7 [ 3485.689495] handlers: [ 3485.689505] [<ffffffffa006f4f0>] e1000_intr [ 3485.689509] Disabling IRQ 19 [ 3485.788062] Polling IRQ. [ 3485.888240] Polling IRQ. ... [ 3490.588069] Polling IRQ. [ 3490.688209] Polling IRQ. [ 3490.788079] Reenabling IRQ. [ 3810.336883] irq 19: nobody cared (try booting with the "irqpoll" option) [ 3810.336896] Pid: 1764, comm: sshd Not tainted 3.2.0-rc4 #5 [ 3810.336900] Call Trace: [ 3810.336904] <IRQ> [<ffffffff810bbe9d>] __report_bad_irq+0x3d/0xe0 [ 3810.336921] [<ffffffff810bc147>] note_interrupt+0x157/0x200 [ 3810.336927] [<ffffffff810b9a54>] handle_irq_event_percpu+0xc4/0x290 [ 3810.336935] [<ffffffff810500e5>] ? __local_bh_enable+0x35/0x90 [ 3810.336941] [<ffffffff810b9c68>] handle_irq_event+0x48/0x70 [ 3810.336947] [<ffffffff810bcb1a>] handle_fasteoi_irq+0x5a/0xe0 [ 3810.336954] [<ffffffff81004012>] handle_irq+0x22/0x40 [ 3810.336960] [<ffffffff8150712a>] do_IRQ+0x5a/0xd0 [ 3810.336966] [<ffffffff814feceb>] common_interrupt+0x6b/0x6b [ 3810.336969] <EOI> [<ffffffff810310a3>] ? __wake_up+0x53/0x70 [ 3810.336979] [<ffffffff81154780>] ? fget_light+0xa0/0x100 [ 3810.336985] [<ffffffff81166516>] do_select+0x336/0x6e0 [ 3810.336991] [<ffffffff81165ee0>] ? poll_freewait+0xe0/0xe0 [ 3810.336996] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337001] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337006] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337011] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337016] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337020] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337025] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337030] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337035] [<ffffffff81165fd0>] ? __pollwait+0xf0/0xf0 [ 3810.337040] [<ffffffff81166a91>] core_sys_select+0x1d1/0x320 [ 3810.337047] [<ffffffff8103ee11>] ? get_parent_ip+0x11/0x50 [ 3810.337053] [<ffffffff81501bfd>] ? sub_preempt_count+0x9d/0xd0 [ 3810.337060] [<ffffffff810728a1>] ? __srcu_read_unlock+0x41/0x70 [ 3810.337065] [<ffffffff81190f82>] ? fsnotify+0x1c2/0x2a0 [ 3810.337071] [<ffffffff81166c9b>] sys_select+0xbb/0x100 [ 3810.337078] [<ffffffff8115338a>] ? sys_write+0x4a/0x90 [ 3810.337083] [<ffffffff8150582b>] system_call_fastpath+0x16/0x1b [ 3810.337087] handlers: [ 3810.337102] [<ffffffffa006f4f0>] e1000_intr [ 3810.337106] Disabling IRQ 19 [ 3810.436188] Polling IRQ. [ 3810.536226] Polling IRQ. ... ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-10 12:10 ` Jeroen Van den Keybus @ 2011-12-10 17:58 ` Clemens Ladisch 2011-12-11 15:28 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2011-12-10 17:58 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Jeroen Van den Keybus wrote: > [...] > - CPU services the IRQ, and does at least one (slow) PCI read to have > the device deassert its IRQ line. In practice, more PCI read/writes > are needed, requiring the bridge to do some PCIe traffic generation. > - Bridge sees the IRQ line trasition and signals Deassert, This > message has only a few usecs to arrive at the I/O-APIC. > - _However_ the CPU has by large already handled the IRQ and gets > interrupted again before the Deassert ever gets out. The resulting PCI > bus traffic further delays the Deassert message (due to e.g. PCIe > transmit credit exhaustion). > > My idea is that if we would not immediately hammer the bridge with > PCIe transactions, the Deassert message may eventually arrive ? PCIe messages are somewhat ordered; posted memory writes are allowed, but IIRC a read transaction serializes all previous and following transactions. Assuming that all involved devices work correctly. > Also, is there any control by Linux of the credits issued ? I don't think these can be controlled by software. The hardware is supposed to get them correct. > I therefore patched the polling system by detecting a stuck IRQ > already after 10 unserviced IRQs. Then the polling system will take > over for 50 cycles (5 seconds), after which the IRQ is reenabled. > > [ 1607.941232] irq 19: nobody cared (try booting with the "irqpoll" option) > [ 1613.040185] Reenabling IRQ. > [ 1908.541558] irq 19: nobody cared (try booting with the "irqpoll" option) > [ 1913.640088] Reenabling IRQ. > [ 2319.361659] irq 19: nobody cared (try booting with the "irqpoll" option) > [ 2324.460064] Reenabling IRQ. > [ 2782.285470] irq 19: nobody cared (try booting with the "irqpoll" option) > [ 2787.384222] Reenabling IRQ. > [ 3485.689347] irq 19: nobody cared (try booting with the "irqpoll" option) > [ 3490.788079] Reenabling IRQ. > [ 3810.336883] irq 19: nobody cared (try booting with the "irqpoll" option) So the IRQ _does_ get unstuck eventually; I didn't expact that. So either the ASM1083 delays its Deassert messages, or it is just way too slow to react to changes in its PCI interrupt line inputs. I'd guess that you can make the pollig time shorter; a few milliseconds should be enough. Your patch might be useful to others afflicted with this chip. Could you publish it? Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2011-12-10 17:58 ` Clemens Ladisch @ 2011-12-11 15:28 ` Jeroen Van den Keybus 0 siblings, 0 replies; 40+ messages in thread From: Jeroen Van den Keybus @ 2011-12-11 15:28 UTC (permalink / raw) To: Clemens Ladisch; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel > So the IRQ _does_ get unstuck eventually; I didn't expact that. It would nevertheless make sense that the designer of the I/O-APIC would have implemented a reasonable timeout for INTx-Deassert reception. Perhaps that's what we see. > So either the ASM1083 delays its Deassert messages, or it is just way > too slow to react to changes in its PCI interrupt line inputs. I'm afraid the only sensible thing to find this out would be to somehow monitor the PCIe link traffic into the FCH from this ASM1083. Maybe someone from AMD knows if this can be done ? Let's not forget that the board seems to run fine under the Windows 7 O/S and maybe Linux simply doesn't do a special trick with the bridge or the chipset that Windows does. So, without further evidence, I would not (yet) blame the bridge. > I'd guess that you can make the pollig time shorter; a few milliseconds > should be enough. I tested the patch for a while now. I indeed decreased the polling interval to 10 ms (100 Hz), and the IRQ is already enabled after 1 second (100 cycles). It works to a degree that the computer system actually becomes useful. Under heavy use, the patch kicks in up to 10 times a minute. Otherwise it only is required a few times per hour. I also turn off polling entirely when it is no longer needed. Specifically for the Asus E45M1-M PRO I would recommend: 1. The IRQ bug manifests itself when using any device behind the ASM1083 bridge. That includes the 2 PCI slots on the motherboard, as well as the Firewire interface. Avoid their use. Preferably use the PCIe x1 slot. 2. An important problem is that, when IRQ 16..19 goes down, an integrated device, which otherwise works flawlessly, goes along with it. This includes the SATA, USB and both audio (HDMI / Analog) subsystems. If possible, enable the use of MSI for these devices. Clemens's patch for AHCI MSI is a real help here. 3. Step 1 above will practically eliminate the occurrence of the IRQ bug. If the PCI bus really is needed, the patch below must be used (with the kernel irqpoll command line option turned on, of course). > Your patch might be useful to others afflicted with this chip. Could > you publish it? No problem, but I've never done this before. Is the result of diff below ok ? Could someone specialized also have a look into the thread-safety ? J. (Begin of patch for kernel/irq/spurious.c) 21c21 < #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) --- > #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/100) 144c144 < int i; --- > int i, poll_again; 149a150 > poll_again = 0; /* Will stay false as long as no polling candidate is found */ 151c152 < unsigned int state; --- > unsigned int state, irq; 161,164c162,182 < < local_irq_disable(); < try_one_irq(i, desc, true); < local_irq_enable(); --- > > /* We end up here with a disabled spurious interrupt. > desc->irqs_unhandled now tracks the number of times > the interrupt has been polled */ > > irq = desc->irq_data.irq; > if (desc->irqs_unhandled < 100) { /* 1 second delay with poll frequency 100 Hz */ > if (desc->irqs_unhandled == 0) > printk("Polling IRQ %d\n", irq); > local_irq_disable(); > try_one_irq(i, desc, true); > local_irq_enable(); > desc->irqs_unhandled++; > poll_again = 1; > } else { > printk("Reenabling IRQ %d\n", irq); > irq_enable(desc); /* Reenable the interrupt line */ > desc->depth--; > desc->istate &= (~IRQS_SPURIOUS_DISABLED); > desc->irqs_unhandled = 0; > } 165a184,186 > if (poll_again) > mod_timer(&poll_spurious_irq_timer, > jiffies + POLL_SPURIOUS_IRQ_INTERVAL); 168,169d188 < mod_timer(&poll_spurious_irq_timer, < jiffies + POLL_SPURIOUS_IRQ_INTERVAL); 180c199 < * If 99,900 of the previous 100,000 interrupts have not been handled --- > * If 9 of the previous 10 interrupts have not been handled 184c203,211 < * (The other 100-of-100,000 interrupts may have been a correctly --- > * Although this may cause early deactivation of a sporadically > * malfunctioning IRQ line, the poll system will: > * a) Poll it for 100 cycles at a 100 Hz rate > * b) Reenable it afterwards > * > * In worst case, with current settings, this will cause short bursts > * of 10 interrupts every second. > * > * (The other single interrupt may have been a correctly 305c332 < if (likely(desc->irq_count < 100000)) --- > if (likely(desc->irq_count < 10)) 309c336 < if (unlikely(desc->irqs_unhandled > 99900)) { --- > if (unlikely(desc->irqs_unhandled >= 9)) { 313c340 < __report_bad_irq(irq, desc, action_ret); --- > /* __report_bad_irq(irq, desc, action_ret); */ 317c344 < printk(KERN_EMERG "Disabling IRQ #%d\n", irq); --- > printk(KERN_EMERG "Disabling IRQ %d\n", irq); (End of patch) ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <fa.CZQqvHf3CBfYWzhSDPNOWxTTD9w@ifi.uio.no>]
[parent not found: <fa.Vmg5vDod2/oKvwyy9BcalhoT+Lo@ifi.uio.no>]
* Re: Unhandled IRQs on AMD E-450 [not found] ` <fa.Vmg5vDod2/oKvwyy9BcalhoT+Lo@ifi.uio.no> @ 2012-04-25 8:35 ` andymatei 2012-04-25 8:48 ` Clemens Ladisch 0 siblings, 1 reply; 40+ messages in thread From: andymatei @ 2012-04-25 8:35 UTC (permalink / raw) To: fa.linux.kernel Cc: Clemens Ladisch, Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Hello all, I am not that good in Linux as you guys are but here it goes. I have bought the Asus E45M1-M PRO to use in my home storage. I had 2 PCI Intel NICs that i wanted to use for iSCSI traffic that goes to 2 ESXi servers. 4 Seagate@2TB and one WD for OS. I use Openfiler as a storage appliance. Problem appears after 20-30 minutes after starting the PC. I get "Disabling IRQ 16" and "Disabling IRQ 19". SATA disks drop from 130MB/s to 2MB/s speed, web interface does not work anymore. After reboot everything works but goes down again after 30 minutes. Is there a patch or something that resolves this issue yet? Thanks, Andrei ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-25 8:35 ` andymatei @ 2012-04-25 8:48 ` Clemens Ladisch 2012-04-27 8:22 ` Borislav Petkov 0 siblings, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2012-04-25 8:48 UTC (permalink / raw) To: andymatei; +Cc: Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel andymatei@gmail.com wrote: > I have bought the Asus E45M1-M PRO to use in my home storage. [...] > I get "Disabling IRQ 16" and "Disabling IRQ 19". [...] > Is there a patch or something that resolves this issue yet? As far as I remember, this is a hardware bug in the ASMedia PCIe/PCI bridge. This can neither easily nor completely be worked around in software. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-25 8:48 ` Clemens Ladisch @ 2012-04-27 8:22 ` Borislav Petkov 2012-04-27 8:29 ` Andrei Matei 2012-04-27 11:46 ` Jeroen Van den Keybus 0 siblings, 2 replies; 40+ messages in thread From: Borislav Petkov @ 2012-04-27 8:22 UTC (permalink / raw) To: Clemens Ladisch Cc: andymatei, Huang, Shane, Borislav Petkov, linux-kernel, Jeroen Van den Keybus On Wed, Apr 25, 2012 at 10:48:32AM +0200, Clemens Ladisch wrote: > andymatei@gmail.com wrote: > > I have bought the Asus E45M1-M PRO to use in my home storage. [...] > > I get "Disabling IRQ 16" and "Disabling IRQ 19". [...] > > Is there a patch or something that resolves this issue yet? > > As far as I remember, this is a hardware bug in the ASMedia PCIe/PCI > bridge. This can neither easily nor completely be worked around in > software. So, I got noted of this discussion: https://lkml.org/lkml/2012/1/30/216 So why aren't you guys producing a proper patch for people to test? Simply get everyone of the bugreporters to give it a try and if all is well, Linus said he'll take it. Clemens, Jeroen? -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 8:22 ` Borislav Petkov @ 2012-04-27 8:29 ` Andrei Matei 2012-04-27 11:46 ` Jeroen Van den Keybus 1 sibling, 0 replies; 40+ messages in thread From: Andrei Matei @ 2012-04-27 8:29 UTC (permalink / raw) To: Borislav Petkov Cc: Clemens Ladisch, Huang, Shane, Borislav Petkov, linux-kernel, Jeroen Van den Keybus How come ASUS is treating its expensive board clients like this? This issue is going on for months now and nothing happends. I would really like my money back for this board On Fri, Apr 27, 2012 at 11:22 AM, Borislav Petkov <borislav.petkov@amd.com> wrote: > On Wed, Apr 25, 2012 at 10:48:32AM +0200, Clemens Ladisch wrote: >> andymatei@gmail.com wrote: >> > I have bought the Asus E45M1-M PRO to use in my home storage. [...] >> > I get "Disabling IRQ 16" and "Disabling IRQ 19". [...] >> > Is there a patch or something that resolves this issue yet? >> >> As far as I remember, this is a hardware bug in the ASMedia PCIe/PCI >> bridge. This can neither easily nor completely be worked around in >> software. > > So, I got noted of this discussion: https://lkml.org/lkml/2012/1/30/216 > > So why aren't you guys producing a proper patch for people to test? > Simply get everyone of the bugreporters to give it a try and if all is > well, Linus said he'll take it. > > Clemens, Jeroen? > > -- > Regards/Gruss, > Boris. > > Advanced Micro Devices GmbH > Einsteinring 24, 85609 Dornach > GM: Alberto Bozzo > Reg: Dornach, Landkreis Muenchen > HRB Nr. 43632 WEEE Registernr: 129 19551 > -- Andrei Matei IT Manager @ Daromex SRL 0744 700 445 andrei@matei.ro andymatei@gmail.com ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 8:22 ` Borislav Petkov 2012-04-27 8:29 ` Andrei Matei @ 2012-04-27 11:46 ` Jeroen Van den Keybus 2012-04-27 13:06 ` Josh Boyer 1 sibling, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2012-04-27 11:46 UTC (permalink / raw) To: Borislav Petkov Cc: Clemens Ladisch, andymatei, Huang, Shane, Borislav Petkov, linux-kernel > So why aren't you guys producing a proper patch for people to test? I have been working on a patch for 3.4-rc2, in which I also implemented another mechanism to detect a stuck interrupt (using a level mechanism: an unhandled IRQ increases the level by, say, 10, and a handled one reduces the level by 1. It is considered stuck when the level reaches e.g. 100). It also implements proper dmesg reporting. I also looked for x86'isms as suggested by Alan, but I think there's no problem. However, while testing, I noticed that the stuck interrupt detection fired also on MSI interrupts in the 3.4 kernel under heavy load, suggesting that these would also be suffering from being unhandled. The patch also includes Clemens' patch to put the disk IRQ on MSI, which makes the E450 board actually usable. I haven't had time to debug the new issue (false stuck irq detection in 3.4 - not too harmful since the interrupt is reenabled a second later, but wrong nevertheless). If any of you is willing to help and have a look into the code, I have attached it below. (There's also a small patch that accidentally got in there to reimplement cpu_possible_map, as it was needed by the fglrx driver I'm using. Please disregard. Or use.) Another problem is the arming of the timer (mod_timer), which may cause an incorrect poll interval when a IRQ gets stuck while anotther one is being polled (either because it's also stuck and polling or because forced polling is on for that particular one). Jeroen. Only in linux-3.4-rc2/arch/x86/include: generated diff -upr linux-3.4-rc2.orig/drivers/pci/quirks.c linux-3.4-rc2/drivers/pci/quirks.c --- linux-3.4-rc2.orig/drivers/pci/quirks.c 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/drivers/pci/quirks.c 2012-04-11 11:33:42.821707398 +0200 @@ -2917,6 +2917,48 @@ static void __devinit disable_igfx_irq(s DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x0102, disable_igfx_irq); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x010a, disable_igfx_irq); +#if defined(CONFIG_PCI_MSI) && \ + (defined(CONFIG_SATA_AHCI) || defined(CONFIG_SATA_AHCI_MODULE)) +static void __init sb7x0_ahci_msi_enable(struct pci_dev *dev) +{ + u8 rev, ptr; + int where; + u32 misc_control; + + pci_bus_read_config_byte(dev->bus, PCI_DEVFN(0x14, 0), + PCI_REVISION_ID, &rev); + if (rev < 0x3c) /* A14 */ + return; + + pci_read_config_byte(dev, 0x34, &ptr); + if (ptr == 0x70) { + where = 0x34; + } else { + pci_read_config_byte(dev, 0x61, &ptr); + if (ptr == 0x70) + where = 0x61; + else + return; + } + + pci_read_config_byte(dev, 0x51, &ptr); + if (ptr != 0x70) + return; + + pci_read_config_dword(dev, 0x40, &misc_control); + misc_control |= 1; + pci_write_config_dword(dev, 0x40, misc_control); + + pci_write_config_byte(dev, where, 0x50); + + misc_control &= ~1; + pci_write_config_dword(dev, 0x40, misc_control); + + dev_dbg(&dev->dev, "AHCI: enabled MSI\n"); +} +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_ATI, 0x4391, sb7x0_ahci_msi_enable); +#endif + static void pci_do_fixups(struct pci_dev *dev, struct pci_fixup *f, struct pci_fixup *end) { Only in linux-3.4-rc2/drivers/pci: quirks.c.orig Only in linux-3.4-rc2/include: config Only in linux-3.4-rc2/include: generated diff -upr linux-3.4-rc2.orig/include/linux/cpumask.h linux-3.4-rc2/include/linux/cpumask.h --- linux-3.4-rc2.orig/include/linux/cpumask.h 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/include/linux/cpumask.h 2012-04-15 21:05:55.733994054 +0200 @@ -764,6 +764,7 @@ static inline const struct cpumask *get_ * */ #ifndef CONFIG_DISABLE_OBSOLETE_CPUMASK_FUNCTIONS +#define cpu_possible_map (*(cpumask_t *)cpu_possible_mask) #define cpumask_of_cpu(cpu) (*get_cpu_mask(cpu)) #define CPU_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(NR_CPUS) diff -upr linux-3.4-rc2.orig/include/linux/irqdesc.h linux-3.4-rc2/include/linux/irqdesc.h --- linux-3.4-rc2.orig/include/linux/irqdesc.h 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/include/linux/irqdesc.h 2012-04-15 15:26:07.776394544 +0200 @@ -14,28 +14,27 @@ struct timer_rand_state; struct module; /** * struct irq_desc - interrupt descriptor - * @irq_data: per irq and chip data passed down to chip functions - * @timer_rand_state: pointer to timer rand state struct - * @kstat_irqs: irq stats per cpu - * @handle_irq: highlevel irq-events handler - * @preflow_handler: handler called before the flow handler (currently used by sparc) - * @action: the irq action chain - * @status: status information - * @core_internal_state__do_not_mess_with_it: core internal status information - * @depth: disable-depth, for nested irq_disable() calls - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers - * @irq_count: stats field to detect stalled irqs - * @last_unhandled: aging timer for unhandled count - * @irqs_unhandled: stats field for spurious unhandled interrupts - * @lock: locking for SMP - * @affinity_hint: hint to user space for preferred irq affinity - * @affinity_notify: context for notification of affinity changes - * @pending_mask: pending rebalanced interrupts - * @threads_oneshot: bitfield to handle shared oneshot threads - * @threads_active: number of irqaction threads currently running - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers - * @dir: /proc/irq/ procfs entry - * @name: flow handler name for /proc/interrupts output + * @irq_data: per irq and chip data passed down to chip functions + * @timer_rand_state: pointer to timer rand state struct + * @kstat_irqs: irq stats per cpu + * @handle_irq: highlevel irq-events handler + * @preflow_handler: handler called before the flow handler (currently used by sparc) + * @action: the irq action chain + * @status: status information + * @istate: core internal status information + * @depth: disable-depth, for nested irq_disable() calls + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers + * @irqs_unhandled_level: stats field for unhandled interrupt detection and tracking poll cycle count + * @irqs_unhandled_count: stats field for counting unhandled interrupt detections + * @lock: locking for SMP + * @affinity_hint: hint to user space for preferred irq affinity + * @affinity_notify: context for notification of affinity changes + * @pending_mask: pending rebalanced interrupts + * @threads_oneshot: bitfield to handle shared oneshot threads + * @threads_active: number of irqaction threads currently running + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers + * @dir: /proc/irq/ procfs entry + * @name: flow handler name for /proc/interrupts output */ struct irq_desc { struct irq_data irq_data; @@ -47,12 +46,11 @@ struct irq_desc { #endif struct irqaction *action; /* IRQ action list */ unsigned int status_use_accessors; - unsigned int core_internal_state__do_not_mess_with_it; + unsigned int istate; unsigned int depth; /* nested irq disables */ unsigned int wake_depth; /* nested wake enables */ - unsigned int irq_count; /* For detecting broken IRQs */ - unsigned long last_unhandled; /* Aging timer for unhandled count */ - unsigned int irqs_unhandled; + unsigned int irqs_unhandled_level; + unsigned int irqs_unhandled_count; raw_spinlock_t lock; struct cpumask *percpu_enabled; #ifdef CONFIG_SMP Only in linux-3.4-rc2/include/linux: version.h diff -upr linux-3.4-rc2.orig/kernel/irq/debug.h linux-3.4-rc2/kernel/irq/debug.h --- linux-3.4-rc2.orig/kernel/irq/debug.h 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/debug.h 2012-04-15 15:23:36.963893571 +0200 @@ -11,8 +11,8 @@ static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc) { - printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n", - irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled); + printk("irq %d, desc: %p, depth: %d, unhandled_level: %d, unhandled_count: %d\n", + irq, desc, desc->depth, desc->irqs_unhandled_level, desc->irqs_unhandled_count); printk("->handle_irq(): %p, ", desc->handle_irq); print_symbol("%s\n", (unsigned long)desc->handle_irq); printk("->irq_data.chip(): %p, ", desc->irq_data.chip); diff -upr linux-3.4-rc2.orig/kernel/irq/internals.h linux-3.4-rc2/kernel/irq/internals.h --- linux-3.4-rc2.orig/kernel/irq/internals.h 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/internals.h 2012-04-15 14:02:57.203854533 +0200 @@ -13,8 +13,6 @@ # define IRQ_BITMAP_BITS NR_IRQS #endif -#define istate core_internal_state__do_not_mess_with_it - extern bool noirqdebug; /* diff -upr linux-3.4-rc2.orig/kernel/irq/irqdesc.c linux-3.4-rc2/kernel/irq/irqdesc.c --- linux-3.4-rc2.orig/kernel/irq/irqdesc.c 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/irqdesc.c 2012-04-15 15:28:20.552831286 +0200 @@ -84,8 +84,8 @@ static void desc_set_defaults(unsigned i irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); desc->handle_irq = handle_bad_irq; desc->depth = 1; - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_unhandled_level = 0; + desc->irqs_unhandled_count = 0; desc->name = NULL; desc->owner = owner; for_each_possible_cpu(cpu) diff -upr linux-3.4-rc2.orig/kernel/irq/manage.c linux-3.4-rc2/kernel/irq/manage.c --- linux-3.4-rc2.orig/kernel/irq/manage.c 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/manage.c 2012-04-15 15:29:30.169058703 +0200 @@ -1086,8 +1086,8 @@ __setup_irq(unsigned int irq, struct irq *old_ptr = new; /* Reset broken irq detection when installing new handler */ - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_unhandled_level = 0; + desc->irqs_unhandled_count = 0; /* * Check whether we disabled the irq via the spurious handler diff -upr linux-3.4-rc2.orig/kernel/irq/proc.c linux-3.4-rc2/kernel/irq/proc.c --- linux-3.4-rc2.orig/kernel/irq/proc.c 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/proc.c 2012-04-15 15:36:37.734432932 +0200 @@ -248,9 +248,8 @@ static int irq_spurious_proc_show(struct { struct irq_desc *desc = irq_to_desc((long) m->private); - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", - desc->irq_count, desc->irqs_unhandled, - jiffies_to_msecs(desc->last_unhandled)); + seq_printf(m, "unhandled_level %u\n" "unhandled_count %u\n", + desc->irqs_unhandled_level, desc->irqs_unhandled_count); return 0; } diff -upr linux-3.4-rc2.orig/kernel/irq/spurious.c linux-3.4-rc2/kernel/irq/spurious.c --- linux-3.4-rc2.orig/kernel/irq/spurious.c 2012-04-08 03:30:41.000000000 +0200 +++ linux-3.4-rc2/kernel/irq/spurious.c 2012-04-15 15:40:33.947176490 +0200 @@ -18,7 +18,12 @@ static int irqfixup __read_mostly; -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) +#define SPURIOUS_IRQ_PENALTY 10 +#define SPURIOUS_IRQ_TRIGGER 100 +#define SPURIOUS_IRQ_REPORT_COUNT 5 +#define SPURIOUS_IRQ_POLL_CYCLES 100 +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) + static void poll_spurious_irqs(unsigned long dummy); static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); static int irq_poll_cpu; @@ -141,14 +146,15 @@ out: static void poll_spurious_irqs(unsigned long dummy) { struct irq_desc *desc; - int i; + int i, poll_again; if (atomic_inc_return(&irq_poll_active) != 1) goto out; irq_poll_cpu = smp_processor_id(); + poll_again = 0; /* Will stay false as long as no polling candidate is found */ for_each_irq_desc(i, desc) { - unsigned int state; + unsigned int state, irq; if (!i) continue; @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned barrier(); if (!(state & IRQS_SPURIOUS_DISABLED)) continue; - - local_irq_disable(); - try_one_irq(i, desc, true); - local_irq_enable(); + + /* We end up here with a disabled spurious interrupt. + desc->irqs_unhandled_level now tracks the number of times + the interrupt has been polled */ + + irq = desc->irq_data.irq; + if (desc->irqs_unhandled_level < SPURIOUS_IRQ_POLL_CYCLES) { + if (unlikely(desc->irqs_unhandled_count <= SPURIOUS_IRQ_REPORT_COUNT)) + if (desc->irqs_unhandled_level == 0) + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); + local_irq_disable(); + try_one_irq(i, desc, true); + local_irq_enable(); + desc->irqs_unhandled_level++; + poll_again = 1; + } else { + if (unlikely(desc->irqs_unhandled_count <= SPURIOUS_IRQ_REPORT_COUNT)) { + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); + if (desc->irqs_unhandled_count == SPURIOUS_IRQ_REPORT_COUNT) + printk(KERN_EMERG "No more stuck interrupt occurrences will be reported for IRQ %d.\n", irq); + } + irq_enable(desc); /* Reenable the interrupt line */ + desc->depth--; + desc->istate &= (~IRQS_SPURIOUS_DISABLED); + desc->irqs_unhandled_level = 0; + } } + if (poll_again) + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); out: atomic_dec(&irq_poll_active); - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); } static inline int bad_action_ret(irqreturn_t action_ret) @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu return 1; } -/* - * If 99,900 of the previous 100,000 interrupts have not been handled - * then assume that the IRQ is stuck in some manner. Drop a diagnostic - * and try to turn the IRQ off. - * - * (The other 100-of-100,000 interrupts may have been a correctly - * functioning device sharing an IRQ with the failing one) - */ static void __report_bad_irq(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret) @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st if (desc->istate & IRQS_POLL_INPROGRESS) return; - /* we get here again via the threaded handler */ if (action_ret == IRQ_WAKE_THREAD) return; @@ -281,48 +301,36 @@ void note_interrupt(unsigned int irq, st return; } - if (unlikely(action_ret == IRQ_NONE)) { - /* - * If we are seeing only the odd spurious IRQ caused by - * bus asynchronicity then don't eventually trigger an error, - * otherwise the counter becomes a doomsday timer for otherwise - * working systems - */ - if (time_after(jiffies, desc->last_unhandled + HZ/10)) - desc->irqs_unhandled = 1; - else - desc->irqs_unhandled++; - desc->last_unhandled = jiffies; - } - - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { - int ok = misrouted_irq(irq); - if (action_ret == IRQ_NONE) - desc->irqs_unhandled -= ok; - } - - desc->irq_count++; - if (likely(desc->irq_count < 100000)) - return; + /* Adjust action_ret if an optional poll was successful. + (See inlined try_misrouted_irq() for conditions (depending + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not + be set, since we wouldn't be here (note_interrupt()) + at all in that case.) */ + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) + if (misrouted_irq(irq)) + action_ret = IRQ_HANDLED; - desc->irq_count = 0; - if (unlikely(desc->irqs_unhandled > 99900)) { - /* - * The interrupt is stuck - */ - __report_bad_irq(irq, desc, action_ret); - /* - * Now kill the IRQ - */ - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); - desc->istate |= IRQS_SPURIOUS_DISABLED; - desc->depth++; - irq_disable(desc); - - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); + if (unlikely(action_ret == IRQ_NONE)) { + desc->irqs_unhandled_level += SPURIOUS_IRQ_PENALTY; + if (desc->irqs_unhandled_level >= SPURIOUS_IRQ_TRIGGER) { /* The interrupt is stuck */ + desc->irqs_unhandled_count++; + if (desc->irqs_unhandled_count <= SPURIOUS_IRQ_REPORT_COUNT) { + __report_bad_irq(irq, desc, action_ret); + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); + } + desc->istate |= IRQS_SPURIOUS_DISABLED; + desc->depth++; + irq_disable(desc); + /* TODO: Do a safe access to the timer. Now we may be extending a deadline + for a polling system already running for another interrupt. */ + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ + desc->irqs_unhandled_level = 0; + } } - desc->irqs_unhandled = 0; + else + if (unlikely(desc->irqs_unhandled_level > 0)) + desc->irqs_unhandled_level--; } bool noirqdebug __read_mostly; Only in linux-3.4-rc2/kernel/irq: spurious.c.k32 Only in linux-3.4-rc2/kernel/irq: spurious.c.k32.orig Only in linux-3.4-rc2/kernel/irq: spurious.c.orig Only in linux-3.4-rc2/scripts/basic: fixdep Only in linux-3.4-rc2/scripts: conmakehash Only in linux-3.4-rc2/scripts/genksyms: genksyms Only in linux-3.4-rc2/scripts/genksyms: keywords.hash.c Only in linux-3.4-rc2/scripts/genksyms: lex.lex.c Only in linux-3.4-rc2/scripts/genksyms: parse.tab.c Only in linux-3.4-rc2/scripts/genksyms: parse.tab.h Only in linux-3.4-rc2/scripts: kallsyms Only in linux-3.4-rc2/scripts/kconfig: conf Only in linux-3.4-rc2/scripts/kconfig: zconf.hash.c Only in linux-3.4-rc2/scripts/kconfig: zconf.lex.c Only in linux-3.4-rc2/scripts/kconfig: zconf.tab.c Only in linux-3.4-rc2/scripts/mod: elfconfig.h Only in linux-3.4-rc2/scripts/mod: mk_elfconfig Only in linux-3.4-rc2/scripts/mod: modpost Only in linux-3.4-rc2/scripts: recordmcount Only in linux-3.4-rc2/scripts/selinux/genheaders: genheaders Only in linux-3.4-rc2/scripts/selinux/mdp: mdp Only in linux-3.4-rc2/security/tomoyo: builtin-policy.h Only in linux-3.4-rc2/security/tomoyo: policy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 11:46 ` Jeroen Van den Keybus @ 2012-04-27 13:06 ` Josh Boyer 2012-04-27 13:28 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Josh Boyer @ 2012-04-27 13:06 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Borislav Petkov, Clemens Ladisch, andymatei, Huang, Shane, Borislav Petkov, linux-kernel On Fri, Apr 27, 2012 at 7:46 AM, Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> wrote: >> So why aren't you guys producing a proper patch for people to test? > > I have been working on a patch for 3.4-rc2, in which I also > implemented another mechanism to detect a stuck interrupt (using a > level mechanism: an unhandled IRQ increases the level by, say, 10, and > a handled one reduces the level by 1. It is considered stuck when the > level reaches e.g. 100). It also implements proper dmesg reporting. I > also looked for x86'isms as suggested by Alan, but I think there's no > problem. > > However, while testing, I noticed that the stuck interrupt detection > fired also on MSI interrupts in the 3.4 kernel under heavy load, > suggesting that these would also be suffering from being unhandled. > > The patch also includes Clemens' patch to put the disk IRQ on MSI, > which makes the E450 board actually usable. > > I haven't had time to debug the new issue (false stuck irq detection > in 3.4 - not too harmful since the interrupt is reenabled a second > later, but wrong nevertheless). If any of you is willing to help and > have a look into the code, I have attached it below. > > (There's also a small patch that accidentally got in there to > reimplement cpu_possible_map, as it was needed by the fglrx driver I'm > using. Please disregard. Or use.) > > Another problem is the arming of the timer (mod_timer), which may > cause an incorrect poll interval when a IRQ gets stuck while anotther > one is being polled (either because it's also stuck and polling or > because forced polling is on for that particular one). We've had a few bugs reported with this ASM108x chip in Fedora. I took your original patch you submitted quite a while ago, and tweaked it a little. We've been carrying it in Fedora and it makes things somewhat usable, but it's far from perfect. Essentially, it adds a PCI quirk to detect the buggy bridge and only changes the irqpolling mechanism if that bridge is detected. That means that people without the buggy hardware aren't impacted by this at all, which was the first problem we hit. Apparently there are some pieces of hardware that generate a small number of spurious IRQs "normally" and lowering the threshold to such a small value caused those machines to kick into polling mode when they really didn't need to. The other two issues we've run into with this patch are: 1) While the quirk helps shield people without the buggy bridge, it doesn't help the case where people have the bridge, but they have no devices actually behind it. That means such setups hit the polling mode when they don't really need to as described above. 2) People, rightfully, complain that it makes inter-activity on their desktop pretty laggy. The mouse pointer jumps around a lot and key strokes are often missed. For a server class machine, I doubt it would matter much but Fedora is essentially a desktop distro so that tends to be a high priority. I've attached the minor rework of the original patch below for reference. josh ---- It seems that some motherboard designs using the ASM1083 PCI/PCIe bridge (PCI device ID 1b21:1080, Rev. 01) suffer from stuck IRQ lines on the PCI bus (causing the kernel to emit 'IRQxx: nobody cared' and disable the IRQ). The following patch is an attempt to mitigate the serious impact of permanently disabling an IRQ in that case and actually make PCI devices better usable on this platform. It seems that the bridge fails to issue a IRQ deassertion message on the PCIe bus, when the relevant driver causes the interrupting PCI device to deassert its IRQ line. To solve this issue, it was tried to re-issue an IRQ on a PCI device being able to do so (e1000 in this case), but we suspect that the attempt to re-assert/deassert may have occurred too soon after the initial IRQ for the ASM1083. Anyway, it didn't work but if, after some delay, a new IRQ occurred, the related IRQ deassertion message eventually did clear the IOAPIC IRQ. It would be useful to re-enable the IRQ here. Therefore the patch below to poll_spurious_irqs() in spurious.c is proposed, It does the following: 1. lets the kernel decide that an IRQ is unhandled after only 10 positives (instead of 100,000); 2. briefly (a few seconds or so, currently 1 s) switches to polling IRQ at a higher rate than usual (100..1,000Hz instead of 10Hz, currently 100Hz), but not too high to avoid excessive CPU load. Any device drivers 'see' their interrupts handled with a higher latency than usual, but they will still operate properly; 3. afterwards, simply reenable the IRQ. If proper operation of the PCIe legacy IRQ line emulation is restored after 3, the system operates again at normal performance. If the IRQ is still stuck after this procedure, the sequence repeats. If a genuinely stuck IRQ is used with this solution, the system would simply sustain short bursts of 10 unhandled IRQs per second, and use polling mode indefinitely at a moderate 100Hz rate. It seemed a good alternative to the default irqpoll behaviour to me, which is why I left it in poll_spurious_irqs() (instead of creating a new kernel option). Additionally, if any device happens to share an IRQ with a faulty one, that device is no longer banned forever. Debugging output is still present and may be removed. Bad IRQ reporting is also commented out now. I have now tried it for about 2 months and I can conclude the following: 1. The patch works and, judging from my Firewire card interrupt on IRQ16, which repeats every 64 secs, I can confirm that the IRQ usually gets reset when a new IRQ arrives (polling mode runs for 64 seconds every time). 2. When testing a SiL-3114 SATA PCI card behind the ASM1083, I could keep this running at fairly high speeds (50..70MB/s) for an hour or so, but eventually the SiL driver crashed. In such conditions the PCI system had to deal with a few hundred IRQs per second / polling mode kicking in every 5..10 seconds). I would like to thank Clemens Ladisch for his invaluable help in finding a solution (and providing a patch to avoid my SATA going down every time during debugging). Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> Make it less chatty. Only kick it in if we detect an ASM1083 PCI bridge. Josh Boyer <jwboyer@redhat.com> ====== --- linux-2.6.orig/kernel/irq/spurious.c +++ linux-2.6/kernel/irq/spurious.c @@ -18,6 +18,8 @@ static int irqfixup __read_mostly; +int irq_poll_and_retry = 0; + #define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) static void poll_spurious_irqs(unsigned long dummy); static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); @@ -141,12 +143,13 @@ out: static void poll_spurious_irqs(unsigned long dummy) { struct irq_desc *desc; - int i; + int i, poll_again; if (atomic_inc_return(&irq_poll_active) != 1) goto out; irq_poll_cpu = smp_processor_id(); + poll_again = 0; /* Will stay false as long as no polling candidate is found */ for_each_irq_desc(i, desc) { unsigned int state; @@ -159,14 +162,33 @@ static void poll_spurious_irqs(unsigned if (!(state & IRQS_SPURIOUS_DISABLED)) continue; - local_irq_disable(); - try_one_irq(i, desc, true); - local_irq_enable(); + /* We end up here with a disabled spurious interrupt. + desc->irqs_unhandled now tracks the number of times + the interrupt has been polled */ + if (irq_poll_and_retry) { + if (desc->irqs_unhandled < 100) { /* 1 second delay with poll frequency 100 Hz */ + local_irq_disable(); + try_one_irq(i, desc, true); + local_irq_enable(); + desc->irqs_unhandled++; + poll_again = 1; + } else { + irq_enable(desc); /* Reenable the interrupt line */ + desc->depth--; + desc->istate &= (~IRQS_SPURIOUS_DISABLED); + desc->irqs_unhandled = 0; + } + } else { + local_irq_disable(); + try_one_irq(i, desc, true); + local_irq_enable(); + } } + if (poll_again) + mod_timer(&poll_spurious_irq_timer, + jiffies + POLL_SPURIOUS_IRQ_INTERVAL); out: atomic_dec(&irq_poll_active); - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); } static inline int bad_action_ret(irqreturn_t action_ret) @@ -177,11 +199,19 @@ static inline int bad_action_ret(irqretu } /* - * If 99,900 of the previous 100,000 interrupts have not been handled + * If 9 of the previous 10 interrupts have not been handled * then assume that the IRQ is stuck in some manner. Drop a diagnostic * and try to turn the IRQ off. * - * (The other 100-of-100,000 interrupts may have been a correctly + * Although this may cause early deactivation of a sporadically + * malfunctioning IRQ line, the poll system will: + * a) Poll it for 100 cycles at a 100 Hz rate + * b) Reenable it afterwards + * + * In worst case, with current settings, this will cause short bursts + * of 10 interrupts every second. + * + * (The other single interrupt may have been a correctly * functioning device sharing an IRQ with the failing one) */ static void @@ -269,6 +299,8 @@ try_misrouted_irq(unsigned int irq, stru void note_interrupt(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret) { + int unhandled_thresh = 999000; + if (desc->istate & IRQS_POLL_INPROGRESS) return; @@ -302,19 +334,31 @@ void note_interrupt(unsigned int irq, st } desc->irq_count++; - if (likely(desc->irq_count < 100000)) - return; + if (!irq_poll_and_retry) + if (likely(desc->irq_count < 100000)) + return; + else + if (likely(desc->irq_count < 10)) + return; desc->irq_count = 0; - if (unlikely(desc->irqs_unhandled > 99900)) { + if (irq_poll_and_retry) + unhandled_thresh = 9; + + if (unlikely(desc->irqs_unhandled >= unhandled_thresh)) { /* - * The interrupt is stuck + * The interrupt might be stuck */ - __report_bad_irq(irq, desc, action_ret); + if (!irq_poll_and_retry) { + __report_bad_irq(irq, desc, action_ret); + printk(KERN_EMERG "Disabling IRQ %d\n", irq); + } else { + printk(KERN_INFO "IRQ %d might be stuck. Polling\n", + irq); + } /* * Now kill the IRQ */ - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); desc->istate |= IRQS_SPURIOUS_DISABLED; desc->depth++; irq_disable(desc); --- linux-2.6.orig/drivers/pci/quirks.c +++ linux-2.6/drivers/pci/quirks.c @@ -1677,6 +1677,22 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_IN DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x260a, quirk_intel_pcie_pm); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x260b, quirk_intel_pcie_pm); +/* ASM108x transparent PCI bridges apparently have broken IRQ deassert + * handling. This causes interrupts to get "stuck" and eventually disabled. + * However, the interrupts are often shared and disabling them is fairly bad. + * It's been somewhat successful to switch to polling mode and retry after + * a bit, so let's do that. + */ +extern int irq_poll_and_retry; +static void quirk_asm108x_poll_interrupts(struct pci_dev *dev) +{ + dev_info(&dev->dev, "Buggy bridge found [%04x:%04x]\n", + dev->vendor, dev->device); + dev_info(&dev->dev, "Stuck interrupts will be polled and retried\n"); + irq_poll_and_retry = 1; +} +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_ASMEDIA, 0x1080, quirk_asm108x_poll_interrupts); + #ifdef CONFIG_X86_IO_APIC /* * Boot interrupts on some chipsets cannot be turned off. For these chipsets, ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 13:06 ` Josh Boyer @ 2012-04-27 13:28 ` Jeroen Van den Keybus 2012-04-27 13:49 ` Josh Boyer 0 siblings, 1 reply; 40+ messages in thread From: Jeroen Van den Keybus @ 2012-04-27 13:28 UTC (permalink / raw) To: Josh Boyer Cc: Borislav Petkov, Clemens Ladisch, andymatei, Huang, Shane, Borislav Petkov, linux-kernel > Apparently there are > some pieces of hardware that generate a small number of spurious IRQs > "normally" and lowering the threshold to such a small value caused > those machines to kick into polling mode when they really didn't need > to. Hm. I'd really expect a spurious IRQ to happen as its name suggests: spuriously. What kind of hardware behaves like this ? > 1) While the quirk helps shield people without the buggy bridge, it > doesn't help the case where people have the bridge, but they have no > devices actually behind it. That means such setups hit the polling > mode when they don't really need to as described above. Curious, the polling mode is left until a new spurious IRQ is detected. > 2) People, rightfully, complain that it makes inter-activity on their > desktop pretty laggy. The mouse pointer jumps around a lot and key > strokes are often missed. For a server class machine, I doubt it > would matter much but Fedora is essentially a desktop distro so that > tends to be a high priority. Again, I am a bit surprised. However, according to your patch: > + if (!irq_poll_and_retry) > + if (likely(desc->irq_count < 100000)) > + return; > + else > + if (likely(desc->irq_count < 10)) > + return; Don't you mean : + if (!irq_poll_and_retry) { + if (likely(desc->irq_count < 100000)) + return; + } + else { + if (likely(desc->irq_count < 10)) + return; + } Jeroen. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 13:28 ` Jeroen Van den Keybus @ 2012-04-27 13:49 ` Josh Boyer 2012-04-30 8:29 ` Jeroen Van den Keybus 0 siblings, 1 reply; 40+ messages in thread From: Josh Boyer @ 2012-04-27 13:49 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Borislav Petkov, Clemens Ladisch, andymatei, Huang, Shane, Borislav Petkov, linux-kernel On Fri, Apr 27, 2012 at 9:28 AM, Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> wrote: >> Apparently there are >> some pieces of hardware that generate a small number of spurious IRQs >> "normally" and lowering the threshold to such a small value caused >> those machines to kick into polling mode when they really didn't need >> to. > > Hm. I'd really expect a spurious IRQ to happen as its name suggests: > spuriously. What kind of hardware behaves like this ? I'd have to go back and look through all the bug reports. Essentially, it wound up being "this used to work fine, now I get stuck in polling mode". When we dropped the patch in those cases, it went back to working fine. I'm not sure if it was a matter of a shared IRQ hitting quickly in succession, or if it was really small bursts of spurious IRQs. >> 1) While the quirk helps shield people without the buggy bridge, it >> doesn't help the case where people have the bridge, but they have no >> devices actually behind it. That means such setups hit the polling >> mode when they don't really need to as described above. > > Curious, the polling mode is left until a new spurious IRQ is detected. Yes, except the threshold was lowered from 100000 to 10 in the original patch. Apparently that's too low. >> 2) People, rightfully, complain that it makes inter-activity on their >> desktop pretty laggy. The mouse pointer jumps around a lot and key >> strokes are often missed. For a server class machine, I doubt it >> would matter much but Fedora is essentially a desktop distro so that >> tends to be a high priority. > > Again, I am a bit surprised. However, according to your patch: > >> + if (!irq_poll_and_retry) >> + if (likely(desc->irq_count < 100000)) >> + return; >> + else >> + if (likely(desc->irq_count < 10)) >> + return; > > Don't you mean : > > + if (!irq_poll_and_retry) { > + if (likely(desc->irq_count < 100000)) > + return; > + } > + else { > + if (likely(desc->irq_count < 10)) > + return; > + } Indeed. Probably do. josh ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-27 13:49 ` Josh Boyer @ 2012-04-30 8:29 ` Jeroen Van den Keybus 2012-04-30 9:57 ` Clemens Ladisch 2012-04-30 10:21 ` Borislav Petkov 0 siblings, 2 replies; 40+ messages in thread From: Jeroen Van den Keybus @ 2012-04-30 8:29 UTC (permalink / raw) To: Josh Boyer Cc: Borislav Petkov, Clemens Ladisch, andymatei, Huang, Shane, Borislav Petkov, linux-kernel > I'd have to go back and look through all the bug reports. Essentially, > it wound up being "this used to work fine, now I get stuck in polling > mode". When we dropped the patch in those cases, it went back to > working fine. >>> + int unhandled_thresh = 999000; >>> + if (!irq_poll_and_retry) >>> + if (likely(desc->irq_count < 100000)) >>> + return; >>> + else >>> + if (likely(desc->irq_count < 10)) >>> + return; Given these 2 bugs I wouldn't rule out that this code would have malfunctioned, causing the problematic behaviour you describe. Are these problematic machines still available ? I have attached a patch for 3.2.16. It is essential that it is also tested by people NOT having any problems. What the patch does: - Detect stuck interrupts after a number (5) of consecutive unhandled IRQs within a specified time frame (100ms). - Start polling the handler for this interrupt at a preset (100Hz) rate for a preset number of cycles (100). - Retry and enable the interrupt line and start the cycle over. - Report the first occurrences of this process for a preset number of times (5). - Expose the state of stuck_count in /proc/irq/<irq>/spurious. It is the number of times an interrupt was deemed stuck. - Expose the state of stuck_level_max in /proc/irq/<irq>/spurious. It indicates how many consecutive unhandled interrupts were detected since the last reset (system startup or reenabling of the interrupt line). It is intended for interrupt system diagnostics. Using '$ cat /proc/irq/*/spurious' one can get a quick overview of the operating state of the patch: - Normal interrupts will have count=0 and level_max=0. Under some circumstances, a driver may have already performed work belonging to a later occurrence of an interrupt. In this case it may return reporting an unhandled event so level_max = 1. - The interrupts suffering from the problem allegedly caused by the ASM1083 PCI bridge will have a count > 0, but not increasing more rapidly than 1 / sec. - The interrupts which are genuinely stuck will show a steady (1 / sec) increase of count. This patch does not include modifications to drivers/pci/quirks.c to set the SATA interrupt mode of the Asus E450-M1 PRO to MSI. See earlier patches for that. Please test and/or comment. Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> --- diff -upr linux-3.2.16.orig/include/linux/irqdesc.h linux-3.2.16.new/include/linux/irqdesc.h --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/include/linux/irqdesc.h 2012-04-29 16:33:48.142332693 +0200 @@ -14,28 +14,29 @@ struct timer_rand_state; struct module; /** * struct irq_desc - interrupt descriptor - * @irq_data: per irq and chip data passed down to chip functions - * @timer_rand_state: pointer to timer rand state struct - * @kstat_irqs: irq stats per cpu - * @handle_irq: highlevel irq-events handler - * @preflow_handler: handler called before the flow handler (currently used by sparc) - * @action: the irq action chain - * @status: status information - * @core_internal_state__do_not_mess_with_it: core internal status information - * @depth: disable-depth, for nested irq_disable() calls - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers - * @irq_count: stats field to detect stalled irqs - * @last_unhandled: aging timer for unhandled count - * @irqs_unhandled: stats field for spurious unhandled interrupts - * @lock: locking for SMP - * @affinity_hint: hint to user space for preferred irq affinity - * @affinity_notify: context for notification of affinity changes - * @pending_mask: pending rebalanced interrupts - * @threads_oneshot: bitfield to handle shared oneshot threads - * @threads_active: number of irqaction threads currently running - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers - * @dir: /proc/irq/ procfs entry - * @name: flow handler name for /proc/interrupts output + * @irq_data: per irq and chip data passed down to chip functions + * @timer_rand_state: pointer to timer rand state struct + * @kstat_irqs: irq stats per cpu + * @handle_irq: highlevel irq-events handler + * @preflow_handler: handler called before the flow handler (currently used by sparc) + * @action: the irq action chain + * @status: status information + * @istate: core internal status information + * @depth: disable-depth, for nested irq_disable() calls + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers + * @irqs_stuck_count: stuck interrupt occurrence counter + * @irqs_stuck_level: used for stuck interrupt line detection and tracking poll cycle count + * @irqs_stuck_level_max: indicates the maximum irqs_stuck_level since last stuck interrupt occurrence + * @irqs_stuck_timeout: deadline for resetting irqs_stuck_level + * @lock: locking for SMP + * @affinity_hint: hint to user space for preferred irq affinity + * @affinity_notify: context for notification of affinity changes + * @pending_mask: pending rebalanced interrupts + * @threads_oneshot: bitfield to handle shared oneshot threads + * @threads_active: number of irqaction threads currently running + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers + * @dir: /proc/irq/ procfs entry + * @name: flow handler name for /proc/interrupts output */ struct irq_desc { struct irq_data irq_data; @@ -47,12 +48,13 @@ struct irq_desc { #endif struct irqaction *action; /* IRQ action list */ unsigned int status_use_accessors; - unsigned int core_internal_state__do_not_mess_with_it; + unsigned int istate; unsigned int depth; /* nested irq disables */ unsigned int wake_depth; /* nested wake enables */ - unsigned int irq_count; /* For detecting broken IRQs */ - unsigned long last_unhandled; /* Aging timer for unhandled count */ - unsigned int irqs_unhandled; + unsigned int irqs_stuck_count; + unsigned int irqs_stuck_level; + unsigned int irqs_stuck_level_max; + unsigned long irqs_stuck_timeout; raw_spinlock_t lock; struct cpumask *percpu_enabled; #ifdef CONFIG_SMP diff -upr linux-3.2.16.orig/kernel/irq/debug.h linux-3.2.16.new/kernel/irq/debug.h --- linux-3.2.16.orig/kernel/irq/debug.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/debug.h 2012-04-29 16:35:33.782919592 +0200 @@ -11,8 +11,9 @@ static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc) { - printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n", - irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled); + printk("irq %d, desc: %p, depth: %d, stuck_count: %d, stuck_level: %d, stuck_level_max: %d, stuck_timeout: %lu\n", + irq, desc, desc->depth, desc->irqs_stuck_count, desc->irqs_stuck_level, desc->irqs_stuck_level_max, + desc->irqs_stuck_timeout); printk("->handle_irq(): %p, ", desc->handle_irq); print_symbol("%s\n", (unsigned long)desc->handle_irq); printk("->irq_data.chip(): %p, ", desc->irq_data.chip); diff -upr linux-3.2.16.orig/kernel/irq/internals.h linux-3.2.16.new/kernel/irq/internals.h --- linux-3.2.16.orig/kernel/irq/internals.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/internals.h 2012-04-28 22:01:15.124279391 +0200 @@ -13,9 +13,7 @@ # define IRQ_BITMAP_BITS NR_IRQS #endif -#define istate core_internal_state__do_not_mess_with_it - -extern int noirqdebug; +extern bool noirqdebug; /* * Bits used by threaded handlers: diff -upr linux-3.2.16.orig/kernel/irq/irqdesc.c linux-3.2.16.new/kernel/irq/irqdesc.c --- linux-3.2.16.orig/kernel/irq/irqdesc.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/irqdesc.c 2012-04-29 16:00:01.792104426 +0200 @@ -84,8 +84,10 @@ static void desc_set_defaults(unsigned i irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); desc->handle_irq = handle_bad_irq; desc->depth = 1; - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_stuck_count = 0; + desc->irqs_stuck_level = 0; + desc->irqs_stuck_level_max = 0; + desc->irqs_stuck_timeout = jiffies; desc->name = NULL; desc->owner = owner; for_each_possible_cpu(cpu) diff -upr linux-3.2.16.orig/kernel/irq/manage.c linux-3.2.16.new/kernel/irq/manage.c --- linux-3.2.16.orig/kernel/irq/manage.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/manage.c 2012-04-30 09:26:26.395627416 +0200 @@ -1087,8 +1087,10 @@ __setup_irq(unsigned int irq, struct irq *old_ptr = new; /* Reset broken irq detection when installing new handler */ - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_stuck_count = 0; + desc->irqs_stuck_level = 0; + desc->irqs_stuck_level_max = 0; + desc->irqs_stuck_timeout = jiffies; /* * Check whether we disabled the irq via the spurious handler diff -upr linux-3.2.16.orig/kernel/irq/proc.c linux-3.2.16.new/kernel/irq/proc.c --- linux-3.2.16.orig/kernel/irq/proc.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/proc.c 2012-04-29 16:34:17.642434577 +0200 @@ -248,9 +248,9 @@ static int irq_spurious_proc_show(struct { struct irq_desc *desc = irq_to_desc((long) m->private); - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", - desc->irq_count, desc->irqs_unhandled, - jiffies_to_msecs(desc->last_unhandled)); + seq_printf(m, "irq=%3d stuck_count=%3u stuck_level_max=%3u\n", + desc->irq_data.irq, + desc->irqs_stuck_count, desc->irqs_stuck_level_max); return 0; } diff -upr linux-3.2.16.orig/kernel/irq/spurious.c linux-3.2.16.new/kernel/irq/spurious.c --- linux-3.2.16.orig/kernel/irq/spurious.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/spurious.c 2012-04-29 18:26:59.848127250 +0200 @@ -18,7 +18,12 @@ static int irqfixup __read_mostly; -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) +#define SPURIOUS_IRQ_TIMEOUT_INTERVAL (HZ/10) +#define SPURIOUS_IRQ_TRIGGER 5 +#define SPURIOUS_IRQ_REPORT_COUNT 5 +#define SPURIOUS_IRQ_POLL_CYCLES 100 +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) + static void poll_spurious_irqs(unsigned long dummy); static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); static int irq_poll_cpu; @@ -141,14 +146,15 @@ out: static void poll_spurious_irqs(unsigned long dummy) { struct irq_desc *desc; - int i; + int i, poll_again; if (atomic_inc_return(&irq_poll_active) != 1) goto out; irq_poll_cpu = smp_processor_id(); + poll_again = 0; /* Will stay false as long as no polling candidate is found */ for_each_irq_desc(i, desc) { - unsigned int state; + unsigned int state, irq; if (!i) continue; @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned barrier(); if (!(state & IRQS_SPURIOUS_DISABLED)) continue; - - local_irq_disable(); - try_one_irq(i, desc, true); - local_irq_enable(); + + /* We end up here with a disabled stuck interrupt. + desc->irqs_stuck_level now tracks the number of times + the interrupt has been polled */ + + irq = desc->irq_data.irq; + if (unlikely(desc->irqs_stuck_level == 1)) + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); + if (desc->irqs_stuck_level < SPURIOUS_IRQ_POLL_CYCLES) { + local_irq_disable(); + try_one_irq(i, desc, true); + local_irq_enable(); + desc->irqs_stuck_level++; + poll_again = 1; + } else { + if (unlikely(desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT)) { + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); + if (desc->irqs_stuck_count >= SPURIOUS_IRQ_REPORT_COUNT) + printk(KERN_EMERG "No more stuck interrupt reports for IRQ %d.\n", irq); + } + irq_enable(desc); /* Reenable the interrupt line */ + desc->depth--; + desc->istate &= (~IRQS_SPURIOUS_DISABLED); + desc->irqs_stuck_level = 0; + } } + if (poll_again) + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); out: atomic_dec(&irq_poll_active); - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); } static inline int bad_action_ret(irqreturn_t action_ret) @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu return 1; } -/* - * If 99,900 of the previous 100,000 interrupts have not been handled - * then assume that the IRQ is stuck in some manner. Drop a diagnostic - * and try to turn the IRQ off. - * - * (The other 100-of-100,000 interrupts may have been a correctly - * functioning device sharing an IRQ with the failing one) - */ static void __report_bad_irq(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret) @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st if (desc->istate & IRQS_POLL_INPROGRESS) return; - /* we get here again via the threaded handler */ if (action_ret == IRQ_WAKE_THREAD) return; @@ -281,55 +301,47 @@ void note_interrupt(unsigned int irq, st return; } - if (unlikely(action_ret == IRQ_NONE)) { - /* - * If we are seeing only the odd spurious IRQ caused by - * bus asynchronicity then don't eventually trigger an error, - * otherwise the counter becomes a doomsday timer for otherwise - * working systems - */ - if (time_after(jiffies, desc->last_unhandled + HZ/10)) - desc->irqs_unhandled = 1; - else - desc->irqs_unhandled++; - desc->last_unhandled = jiffies; - } - - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { - int ok = misrouted_irq(irq); - if (action_ret == IRQ_NONE) - desc->irqs_unhandled -= ok; + /* Adjust action_ret if an optional poll was successful. + (See inlined try_misrouted_irq() for conditions (depending + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not + be set, since we wouldn't be here (note_interrupt()) + at all in that case.) */ + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) + if (misrouted_irq(irq)) + action_ret = IRQ_HANDLED; + + if (unlikely(action_ret == IRQ_NONE) && time_before(jiffies, desc->irqs_stuck_timeout)) { + desc->irqs_stuck_level++; + if (desc->irqs_stuck_level > desc->irqs_stuck_level_max) + desc->irqs_stuck_level_max = desc->irqs_stuck_level; + if (desc->irqs_stuck_level >= SPURIOUS_IRQ_TRIGGER) { /* The interrupt is stuck */ + desc->irqs_stuck_count++; /* TODO: Prevent hypothetical overflow */ + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) { + __report_bad_irq(irq, desc, action_ret); + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); + } + desc->istate |= IRQS_SPURIOUS_DISABLED; + desc->depth++; + irq_disable(desc); + /* TODO: Do a safe access to the timer. Now we may be extending a deadline + for a polling system already running for another interrupt. */ + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ + desc->irqs_stuck_level = 1; + desc->irqs_stuck_level_max = 0; + } } - - desc->irq_count++; - if (likely(desc->irq_count < 100000)) - return; - - desc->irq_count = 0; - if (unlikely(desc->irqs_unhandled > 99900)) { - /* - * The interrupt is stuck - */ - __report_bad_irq(irq, desc, action_ret); - /* - * Now kill the IRQ - */ - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); - desc->istate |= IRQS_SPURIOUS_DISABLED; - desc->depth++; - irq_disable(desc); - - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); + else { + desc->irqs_stuck_timeout = jiffies + SPURIOUS_IRQ_TIMEOUT_INTERVAL; + desc->irqs_stuck_level = 0; } - desc->irqs_unhandled = 0; } -int noirqdebug __read_mostly; +bool noirqdebug __read_mostly; int noirqdebug_setup(char *str) { - noirqdebug = 1; + noirqdebug = true; printk(KERN_INFO "IRQ lockup detection disabled\n"); return 1; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 8:29 ` Jeroen Van den Keybus @ 2012-04-30 9:57 ` Clemens Ladisch 2012-04-30 10:41 ` Jeroen Van den Keybus 2012-04-30 10:21 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: Clemens Ladisch @ 2012-04-30 9:57 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Josh Boyer, Borislav Petkov, andymatei, Huang, Shane, Borislav Petkov, linux-kernel Jeroen Van den Keybus wrote: > - Detect stuck interrupts after a number (5) of consecutive unhandled > IRQs within a specified time frame (100ms). Why 5? This threshold is likely to be too low; fast consecutive interrupts can easily happen more often with a very busy device, while an actual stuck interrupt will call the handler in an endless loop and very quickly result in many thousands of calls. > --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 > 00:31:32.000000000 +0200 Your mailer wraps lines; see Documentation/email-clients.txt. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 9:57 ` Clemens Ladisch @ 2012-04-30 10:41 ` Jeroen Van den Keybus 2012-04-30 12:47 ` Clemens Ladisch 2012-05-29 22:20 ` Grant Likely 0 siblings, 2 replies; 40+ messages in thread From: Jeroen Van den Keybus @ 2012-04-30 10:41 UTC (permalink / raw) To: Clemens Ladisch Cc: Josh Boyer, Borislav Petkov, andymatei, Huang, Shane, Borislav Petkov, linux-kernel, Linus Torvalds, Thomas Gleixner > Why 5? This threshold is likely to be too low; fast consecutive interrupts > can easily happen more often with a very busy device, while an actual stuck > interrupt will call the handler in an endless loop and very quickly result > in many thousands of calls. Well, 5 works fine on any machine I have tested so far. I'd like to keep this number as low as possible in case a genuine stuck interrupt is encountered. Computers are powerful, but I'm reluctant to spill cycles and power. Also, on an unshared interrupt line, unhandled IRQs should never happen in succession. No work to be done by a handler should be the result of acknowledging early and getting a new interrupt when work grows in the meantime. After the resulting idle run there's no way a properly working driver could end up being interrupted again for no reason (aside from broken drivers and broken hardware, i.e. hardware emitting MSIs without getting acknowledgement). Am I right ? For shared IRQs unhandled IRQs may indeed be encountered. For this reason, I set SPURIOUS_IRQ_TRIGGER to 5. Of course, even if it misfires, we're back on track in a second. On the other hand, setting it temporarily to a high value has the benefit of being able to look at /proc/irq/.../spurious and see how high level_max has gotten on a variety of machines. What would then be a sensible number here ? Also, FYI, here's the result of '$ cat /proc/irq/*/spurious' on the E45M1-M PRO. IRQ45 is the AHCI handler and IRQ16 belongs to a device behind the ASM1083. It is the Firewire chip emitting an interrupt roughly every minute. When it misses, it is clearly seen how a new PCIe assert/deassert message pair manages to reset the stuck line. In this case, the system has switched 81 times in succession to polling mode. irq= 0 stuck_count= 0 stuck_level_max= 0 irq= 10 stuck_count= 0 stuck_level_max= 0 irq= 11 stuck_count= 0 stuck_level_max= 0 irq= 12 stuck_count= 0 stuck_level_max= 0 irq= 13 stuck_count= 0 stuck_level_max= 0 irq= 14 stuck_count= 0 stuck_level_max= 0 irq= 15 stuck_count= 0 stuck_level_max= 0 irq= 16 stuck_count= 81 stuck_level_max= 0 irq= 17 stuck_count= 0 stuck_level_max= 0 irq= 18 stuck_count= 0 stuck_level_max= 0 irq= 19 stuck_count= 0 stuck_level_max= 0 irq= 1 stuck_count= 0 stuck_level_max= 0 irq= 2 stuck_count= 0 stuck_level_max= 0 irq= 3 stuck_count= 0 stuck_level_max= 0 irq= 40 stuck_count= 0 stuck_level_max= 0 irq= 41 stuck_count= 0 stuck_level_max= 0 irq= 42 stuck_count= 0 stuck_level_max= 0 irq= 43 stuck_count= 0 stuck_level_max= 0 irq= 44 stuck_count= 0 stuck_level_max= 0 irq= 45 stuck_count= 0 stuck_level_max= 1 irq= 46 stuck_count= 0 stuck_level_max= 0 irq= 47 stuck_count= 0 stuck_level_max= 0 irq= 4 stuck_count= 0 stuck_level_max= 0 irq= 5 stuck_count= 0 stuck_level_max= 0 irq= 6 stuck_count= 0 stuck_level_max= 0 irq= 7 stuck_count= 0 stuck_level_max= 0 irq= 8 stuck_count= 0 stuck_level_max= 0 irq= 9 stuck_count= 0 stuck_level_max= 0 >> --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 >> 00:31:32.000000000 +0200 > > Your mailer wraps lines; see Documentation/email-clients.txt. Great. I only have gmail accounts. Documentation states it won't work with gmail. Any suggestions ? Jeroen. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 10:41 ` Jeroen Van den Keybus @ 2012-04-30 12:47 ` Clemens Ladisch 2012-05-29 22:20 ` Grant Likely 1 sibling, 0 replies; 40+ messages in thread From: Clemens Ladisch @ 2012-04-30 12:47 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Josh Boyer, Borislav Petkov, andymatei, Huang, Shane, Borislav Petkov, linux-kernel, Linus Torvalds, Thomas Gleixner Jeroen Van den Keybus wrote: >> Why 5? This threshold is likely to be too low; fast consecutive interrupts >> can easily happen more often with a very busy device, while an actual stuck >> interrupt will call the handler in an endless loop and very quickly result >> in many thousands of calls. > > Well, 5 works fine on any machine I have tested so far. I'd like to > keep this number as low as possible in case a genuine stuck interrupt > is encountered. Computers are powerful, but I'm reluctant to spill > cycles and power. A stuck interrupt is the consequence of a bug, there is no need to compromise just to optimize for this situation (especially as even with your broken hardware and the patch, stuck interrupts happen no more than once per second). > Also, on an unshared interrupt line, unhandled IRQs should never > happen in succession. Indeed. But this is because pending interrupts are not queued but simply noted with a boolean. > ... broken hardware, i.e. hardware emitting MSIs without getting > acknowledgement). Am I right ? Level-triggered interrupts would need acknowledgements to deactivate the interrupt line, but MSIs do not. Hardware that is designed to take advantage of this indeed works this way; this avoids the need for any MMIO accesses in the interrupt handler. Regards, Clemens ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 10:41 ` Jeroen Van den Keybus 2012-04-30 12:47 ` Clemens Ladisch @ 2012-05-29 22:20 ` Grant Likely 1 sibling, 0 replies; 40+ messages in thread From: Grant Likely @ 2012-05-29 22:20 UTC (permalink / raw) To: Jeroen Van den Keybus, Clemens Ladisch Cc: Josh Boyer, Borislav Petkov, andymatei, Huang, Shane, Borislav Petkov, linux-kernel, Linus Torvalds, Thomas Gleixner On Mon, 30 Apr 2012 12:41:24 +0200, Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> wrote: > >> --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 > >> 00:31:32.000000000 +0200 > > > > Your mailer wraps lines; see Documentation/email-clients.txt. > > Great. I only have gmail accounts. Documentation states it won't work > with gmail. Any suggestions ? Only the web interface is the problem. You need to send your mail using SMTP from a mail client. The best way is to use the git send-email tool. I've not tried the following, but it looks like it should work (I personally use a postfix spooler so I can queue up sent mail while I'm offline, but that's more complex to set up): http://morefedora.blogspot.com/2009/02/configuring-git-send-email-to-use-gmail.html g. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 8:29 ` Jeroen Van den Keybus 2012-04-30 9:57 ` Clemens Ladisch @ 2012-04-30 10:21 ` Borislav Petkov 2012-04-30 11:35 ` Jeroen Van den Keybus 1 sibling, 1 reply; 40+ messages in thread From: Borislav Petkov @ 2012-04-30 10:21 UTC (permalink / raw) To: Jeroen Van den Keybus, Linus Torvalds, Thomas Gleixner Cc: Josh Boyer, Clemens Ladisch, andymatei, Huang, Shane, linux-kernel Before you guys get more involved in this, it would be probably prudent to run this approach by tglx and Linus to check its general sanity :-). Added. On Mon, Apr 30, 2012 at 10:29:19AM +0200, Jeroen Van den Keybus wrote: > > I'd have to go back and look through all the bug reports. Essentially, > > it wound up being "this used to work fine, now I get stuck in polling > > mode". When we dropped the patch in those cases, it went back to > > working fine. > > >>> + int unhandled_thresh = 999000; > > >>> + if (!irq_poll_and_retry) > >>> + if (likely(desc->irq_count < 100000)) > >>> + return; > >>> + else > >>> + if (likely(desc->irq_count < 10)) > >>> + return; > > Given these 2 bugs I wouldn't rule out that this code would have > malfunctioned, causing the problematic behaviour you describe. Are > these problematic machines still available ? > > I have attached a patch for 3.2.16. It is essential that it is also > tested by people NOT having any problems. What the patch does: > > - Detect stuck interrupts after a number (5) of consecutive unhandled > IRQs within a specified time frame (100ms). > - Start polling the handler for this interrupt at a preset (100Hz) > rate for a preset number of cycles (100). > - Retry and enable the interrupt line and start the cycle over. > - Report the first occurrences of this process for a preset number of times (5). > - Expose the state of stuck_count in /proc/irq/<irq>/spurious. It is > the number of times an interrupt was deemed stuck. > - Expose the state of stuck_level_max in /proc/irq/<irq>/spurious. It > indicates how many consecutive unhandled interrupts were detected > since the last reset (system startup or reenabling of the interrupt > line). It is intended for interrupt system diagnostics. > > Using '$ cat /proc/irq/*/spurious' one can get a quick overview of the > operating state of the patch: > > - Normal interrupts will have count=0 and level_max=0. Under some > circumstances, a driver may have already performed work belonging to a > later occurrence of an interrupt. In this case it may return reporting > an unhandled event so level_max = 1. > - The interrupts suffering from the problem allegedly caused by the > ASM1083 PCI bridge will have a count > 0, but not increasing more > rapidly than 1 / sec. > - The interrupts which are genuinely stuck will show a steady (1 / > sec) increase of count. > > This patch does not include modifications to drivers/pci/quirks.c to > set the SATA interrupt mode of the Asus E450-M1 PRO to MSI. See > earlier patches for that. > > Please test and/or comment. > > > Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> > --- > diff -upr linux-3.2.16.orig/include/linux/irqdesc.h > linux-3.2.16.new/include/linux/irqdesc.h > --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 > 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/include/linux/irqdesc.h 2012-04-29 16:33:48.142332693 +0200 > @@ -14,28 +14,29 @@ struct timer_rand_state; > struct module; > /** > * struct irq_desc - interrupt descriptor > - * @irq_data: per irq and chip data passed down to chip functions > - * @timer_rand_state: pointer to timer rand state struct > - * @kstat_irqs: irq stats per cpu > - * @handle_irq: highlevel irq-events handler > - * @preflow_handler: handler called before the flow handler > (currently used by sparc) > - * @action: the irq action chain > - * @status: status information > - * @core_internal_state__do_not_mess_with_it: core internal status information > - * @depth: disable-depth, for nested irq_disable() calls > - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > - * @irq_count: stats field to detect stalled irqs > - * @last_unhandled: aging timer for unhandled count > - * @irqs_unhandled: stats field for spurious unhandled interrupts > - * @lock: locking for SMP > - * @affinity_hint: hint to user space for preferred irq affinity > - * @affinity_notify: context for notification of affinity changes > - * @pending_mask: pending rebalanced interrupts > - * @threads_oneshot: bitfield to handle shared oneshot threads > - * @threads_active: number of irqaction threads currently running > - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > - * @dir: /proc/irq/ procfs entry > - * @name: flow handler name for /proc/interrupts output > + * @irq_data: per irq and chip data passed down to chip functions > + * @timer_rand_state: pointer to timer rand state struct > + * @kstat_irqs: irq stats per cpu > + * @handle_irq: highlevel irq-events handler > + * @preflow_handler: handler called before the flow handler > (currently used by sparc) > + * @action: the irq action chain > + * @status: status information > + * @istate: core internal status information > + * @depth: disable-depth, for nested irq_disable() calls > + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > + * @irqs_stuck_count: stuck interrupt occurrence counter > + * @irqs_stuck_level: used for stuck interrupt line detection and > tracking poll cycle count > + * @irqs_stuck_level_max: indicates the maximum irqs_stuck_level > since last stuck interrupt occurrence > + * @irqs_stuck_timeout: deadline for resetting irqs_stuck_level > + * @lock: locking for SMP > + * @affinity_hint: hint to user space for preferred irq affinity > + * @affinity_notify: context for notification of affinity changes > + * @pending_mask: pending rebalanced interrupts > + * @threads_oneshot: bitfield to handle shared oneshot threads > + * @threads_active: number of irqaction threads currently running > + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > + * @dir: /proc/irq/ procfs entry > + * @name: flow handler name for /proc/interrupts output > */ > struct irq_desc { > struct irq_data irq_data; > @@ -47,12 +48,13 @@ struct irq_desc { > #endif > struct irqaction *action; /* IRQ action list */ > unsigned int status_use_accessors; > - unsigned int core_internal_state__do_not_mess_with_it; > + unsigned int istate; > unsigned int depth; /* nested irq disables */ > unsigned int wake_depth; /* nested wake enables */ > - unsigned int irq_count; /* For detecting broken IRQs */ > - unsigned long last_unhandled; /* Aging timer for unhandled count */ > - unsigned int irqs_unhandled; > + unsigned int irqs_stuck_count; > + unsigned int irqs_stuck_level; > + unsigned int irqs_stuck_level_max; > + unsigned long irqs_stuck_timeout; > raw_spinlock_t lock; > struct cpumask *percpu_enabled; > #ifdef CONFIG_SMP > diff -upr linux-3.2.16.orig/kernel/irq/debug.h > linux-3.2.16.new/kernel/irq/debug.h > --- linux-3.2.16.orig/kernel/irq/debug.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/debug.h 2012-04-29 16:35:33.782919592 +0200 > @@ -11,8 +11,9 @@ > > static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc) > { > - printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n", > - irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled); > + printk("irq %d, desc: %p, depth: %d, stuck_count: %d, stuck_level: > %d, stuck_level_max: %d, stuck_timeout: %lu\n", > + irq, desc, desc->depth, desc->irqs_stuck_count, > desc->irqs_stuck_level, desc->irqs_stuck_level_max, > + desc->irqs_stuck_timeout); > printk("->handle_irq(): %p, ", desc->handle_irq); > print_symbol("%s\n", (unsigned long)desc->handle_irq); > printk("->irq_data.chip(): %p, ", desc->irq_data.chip); > diff -upr linux-3.2.16.orig/kernel/irq/internals.h > linux-3.2.16.new/kernel/irq/internals.h > --- linux-3.2.16.orig/kernel/irq/internals.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/internals.h 2012-04-28 22:01:15.124279391 +0200 > @@ -13,9 +13,7 @@ > # define IRQ_BITMAP_BITS NR_IRQS > #endif > > -#define istate core_internal_state__do_not_mess_with_it > - > -extern int noirqdebug; > +extern bool noirqdebug; > > /* > * Bits used by threaded handlers: > diff -upr linux-3.2.16.orig/kernel/irq/irqdesc.c > linux-3.2.16.new/kernel/irq/irqdesc.c > --- linux-3.2.16.orig/kernel/irq/irqdesc.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/irqdesc.c 2012-04-29 16:00:01.792104426 +0200 > @@ -84,8 +84,10 @@ static void desc_set_defaults(unsigned i > irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); > desc->handle_irq = handle_bad_irq; > desc->depth = 1; > - desc->irq_count = 0; > - desc->irqs_unhandled = 0; > + desc->irqs_stuck_count = 0; > + desc->irqs_stuck_level = 0; > + desc->irqs_stuck_level_max = 0; > + desc->irqs_stuck_timeout = jiffies; > desc->name = NULL; > desc->owner = owner; > for_each_possible_cpu(cpu) > diff -upr linux-3.2.16.orig/kernel/irq/manage.c > linux-3.2.16.new/kernel/irq/manage.c > --- linux-3.2.16.orig/kernel/irq/manage.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/manage.c 2012-04-30 09:26:26.395627416 +0200 > @@ -1087,8 +1087,10 @@ __setup_irq(unsigned int irq, struct irq > *old_ptr = new; > > /* Reset broken irq detection when installing new handler */ > - desc->irq_count = 0; > - desc->irqs_unhandled = 0; > + desc->irqs_stuck_count = 0; > + desc->irqs_stuck_level = 0; > + desc->irqs_stuck_level_max = 0; > + desc->irqs_stuck_timeout = jiffies; > > /* > * Check whether we disabled the irq via the spurious handler > diff -upr linux-3.2.16.orig/kernel/irq/proc.c linux-3.2.16.new/kernel/irq/proc.c > --- linux-3.2.16.orig/kernel/irq/proc.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/proc.c 2012-04-29 16:34:17.642434577 +0200 > @@ -248,9 +248,9 @@ static int irq_spurious_proc_show(struct > { > struct irq_desc *desc = irq_to_desc((long) m->private); > > - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", > - desc->irq_count, desc->irqs_unhandled, > - jiffies_to_msecs(desc->last_unhandled)); > + seq_printf(m, "irq=%3d stuck_count=%3u stuck_level_max=%3u\n", > + desc->irq_data.irq, > + desc->irqs_stuck_count, desc->irqs_stuck_level_max); > return 0; > } > > diff -upr linux-3.2.16.orig/kernel/irq/spurious.c > linux-3.2.16.new/kernel/irq/spurious.c > --- linux-3.2.16.orig/kernel/irq/spurious.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/spurious.c 2012-04-29 18:26:59.848127250 +0200 > @@ -18,7 +18,12 @@ > > static int irqfixup __read_mostly; > > -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TIMEOUT_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TRIGGER 5 > +#define SPURIOUS_IRQ_REPORT_COUNT 5 > +#define SPURIOUS_IRQ_POLL_CYCLES 100 > +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) > + > static void poll_spurious_irqs(unsigned long dummy); > static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); > static int irq_poll_cpu; > @@ -141,14 +146,15 @@ out: > static void poll_spurious_irqs(unsigned long dummy) > { > struct irq_desc *desc; > - int i; > + int i, poll_again; > > if (atomic_inc_return(&irq_poll_active) != 1) > goto out; > irq_poll_cpu = smp_processor_id(); > > + poll_again = 0; /* Will stay false as long as no polling candidate is found */ > for_each_irq_desc(i, desc) { > - unsigned int state; > + unsigned int state, irq; > > if (!i) > continue; > @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned > barrier(); > if (!(state & IRQS_SPURIOUS_DISABLED)) > continue; > - > - local_irq_disable(); > - try_one_irq(i, desc, true); > - local_irq_enable(); > + > + /* We end up here with a disabled stuck interrupt. > + desc->irqs_stuck_level now tracks the number of times > + the interrupt has been polled */ > + > + irq = desc->irq_data.irq; > + if (unlikely(desc->irqs_stuck_level == 1)) > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) > + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); > + if (desc->irqs_stuck_level < SPURIOUS_IRQ_POLL_CYCLES) { > + local_irq_disable(); > + try_one_irq(i, desc, true); > + local_irq_enable(); > + desc->irqs_stuck_level++; > + poll_again = 1; > + } else { > + if (unlikely(desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT)) { > + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); > + if (desc->irqs_stuck_count >= SPURIOUS_IRQ_REPORT_COUNT) > + printk(KERN_EMERG "No more stuck interrupt reports for IRQ %d.\n", irq); > + } > + irq_enable(desc); /* Reenable the interrupt line */ > + desc->depth--; > + desc->istate &= (~IRQS_SPURIOUS_DISABLED); > + desc->irqs_stuck_level = 0; > + } > } > + if (poll_again) > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); > out: > atomic_dec(&irq_poll_active); > - mod_timer(&poll_spurious_irq_timer, > - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); > } > > static inline int bad_action_ret(irqreturn_t action_ret) > @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu > return 1; > } > > -/* > - * If 99,900 of the previous 100,000 interrupts have not been handled > - * then assume that the IRQ is stuck in some manner. Drop a diagnostic > - * and try to turn the IRQ off. > - * > - * (The other 100-of-100,000 interrupts may have been a correctly > - * functioning device sharing an IRQ with the failing one) > - */ > static void > __report_bad_irq(unsigned int irq, struct irq_desc *desc, > irqreturn_t action_ret) > @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st > if (desc->istate & IRQS_POLL_INPROGRESS) > return; > > - /* we get here again via the threaded handler */ > if (action_ret == IRQ_WAKE_THREAD) > return; > > @@ -281,55 +301,47 @@ void note_interrupt(unsigned int irq, st > return; > } > > - if (unlikely(action_ret == IRQ_NONE)) { > - /* > - * If we are seeing only the odd spurious IRQ caused by > - * bus asynchronicity then don't eventually trigger an error, > - * otherwise the counter becomes a doomsday timer for otherwise > - * working systems > - */ > - if (time_after(jiffies, desc->last_unhandled + HZ/10)) > - desc->irqs_unhandled = 1; > - else > - desc->irqs_unhandled++; > - desc->last_unhandled = jiffies; > - } > - > - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { > - int ok = misrouted_irq(irq); > - if (action_ret == IRQ_NONE) > - desc->irqs_unhandled -= ok; > + /* Adjust action_ret if an optional poll was successful. > + (See inlined try_misrouted_irq() for conditions (depending > + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not > + be set, since we wouldn't be here (note_interrupt()) > + at all in that case.) */ > + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) > + if (misrouted_irq(irq)) > + action_ret = IRQ_HANDLED; > + > + if (unlikely(action_ret == IRQ_NONE) && time_before(jiffies, > desc->irqs_stuck_timeout)) { > + desc->irqs_stuck_level++; > + if (desc->irqs_stuck_level > desc->irqs_stuck_level_max) > + desc->irqs_stuck_level_max = desc->irqs_stuck_level; > + if (desc->irqs_stuck_level >= SPURIOUS_IRQ_TRIGGER) { /* The > interrupt is stuck */ > + desc->irqs_stuck_count++; /* TODO: Prevent hypothetical overflow */ > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) { > + __report_bad_irq(irq, desc, action_ret); > + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); > + } > + desc->istate |= IRQS_SPURIOUS_DISABLED; > + desc->depth++; > + irq_disable(desc); > + /* TODO: Do a safe access to the timer. Now we may be extending a deadline > + for a polling system already running for another interrupt. */ > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ > + desc->irqs_stuck_level = 1; > + desc->irqs_stuck_level_max = 0; > + } > } > - > - desc->irq_count++; > - if (likely(desc->irq_count < 100000)) > - return; > - > - desc->irq_count = 0; > - if (unlikely(desc->irqs_unhandled > 99900)) { > - /* > - * The interrupt is stuck > - */ > - __report_bad_irq(irq, desc, action_ret); > - /* > - * Now kill the IRQ > - */ > - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); > - desc->istate |= IRQS_SPURIOUS_DISABLED; > - desc->depth++; > - irq_disable(desc); > - > - mod_timer(&poll_spurious_irq_timer, > - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); > + else { > + desc->irqs_stuck_timeout = jiffies + SPURIOUS_IRQ_TIMEOUT_INTERVAL; > + desc->irqs_stuck_level = 0; > } > - desc->irqs_unhandled = 0; > } > > -int noirqdebug __read_mostly; > +bool noirqdebug __read_mostly; > > int noirqdebug_setup(char *str) > { > - noirqdebug = 1; > + noirqdebug = true; > printk(KERN_INFO "IRQ lockup detection disabled\n"); > > return 1; > -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 10:21 ` Borislav Petkov @ 2012-04-30 11:35 ` Jeroen Van den Keybus 2012-05-29 23:36 ` Grant Likely 2012-05-30 0:07 ` Thomas Gleixner 0 siblings, 2 replies; 40+ messages in thread From: Jeroen Van den Keybus @ 2012-04-30 11:35 UTC (permalink / raw) To: Borislav Petkov Cc: Linus Torvalds, Thomas Gleixner, Josh Boyer, Clemens Ladisch, andymatei, Huang, Shane, linux-kernel (another try at supplying a noncorrupt patch with alpine/Gmail - sorry for the inconvenience) --- Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> diff -upr linux-3.2.16.orig/include/linux/irqdesc.h linux-3.2.16.new/include/linux/irqdesc.h --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/include/linux/irqdesc.h 2012-04-29 16:33:48.142332693 +0200 @@ -14,28 +14,29 @@ struct timer_rand_state; struct module; /** * struct irq_desc - interrupt descriptor - * @irq_data: per irq and chip data passed down to chip functions - * @timer_rand_state: pointer to timer rand state struct - * @kstat_irqs: irq stats per cpu - * @handle_irq: highlevel irq-events handler - * @preflow_handler: handler called before the flow handler (currently used by sparc) - * @action: the irq action chain - * @status: status information - * @core_internal_state__do_not_mess_with_it: core internal status information - * @depth: disable-depth, for nested irq_disable() calls - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers - * @irq_count: stats field to detect stalled irqs - * @last_unhandled: aging timer for unhandled count - * @irqs_unhandled: stats field for spurious unhandled interrupts - * @lock: locking for SMP - * @affinity_hint: hint to user space for preferred irq affinity - * @affinity_notify: context for notification of affinity changes - * @pending_mask: pending rebalanced interrupts - * @threads_oneshot: bitfield to handle shared oneshot threads - * @threads_active: number of irqaction threads currently running - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers - * @dir: /proc/irq/ procfs entry - * @name: flow handler name for /proc/interrupts output + * @irq_data: per irq and chip data passed down to chip functions + * @timer_rand_state: pointer to timer rand state struct + * @kstat_irqs: irq stats per cpu + * @handle_irq: highlevel irq-events handler + * @preflow_handler: handler called before the flow handler (currently used by sparc) + * @action: the irq action chain + * @status: status information + * @istate: core internal status information + * @depth: disable-depth, for nested irq_disable() calls + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers + * @irqs_stuck_count: stuck interrupt occurrence counter + * @irqs_stuck_level: used for stuck interrupt line detection and tracking poll cycle count + * @irqs_stuck_level_max: indicates the maximum irqs_stuck_level since last stuck interrupt occurrence + * @irqs_stuck_timeout: deadline for resetting irqs_stuck_level + * @lock: locking for SMP + * @affinity_hint: hint to user space for preferred irq affinity + * @affinity_notify: context for notification of affinity changes + * @pending_mask: pending rebalanced interrupts + * @threads_oneshot: bitfield to handle shared oneshot threads + * @threads_active: number of irqaction threads currently running + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers + * @dir: /proc/irq/ procfs entry + * @name: flow handler name for /proc/interrupts output */ struct irq_desc { struct irq_data irq_data; @@ -47,12 +48,13 @@ struct irq_desc { #endif struct irqaction *action; /* IRQ action list */ unsigned int status_use_accessors; - unsigned int core_internal_state__do_not_mess_with_it; + unsigned int istate; unsigned int depth; /* nested irq disables */ unsigned int wake_depth; /* nested wake enables */ - unsigned int irq_count; /* For detecting broken IRQs */ - unsigned long last_unhandled; /* Aging timer for unhandled count */ - unsigned int irqs_unhandled; + unsigned int irqs_stuck_count; + unsigned int irqs_stuck_level; + unsigned int irqs_stuck_level_max; + unsigned long irqs_stuck_timeout; raw_spinlock_t lock; struct cpumask *percpu_enabled; #ifdef CONFIG_SMP diff -upr linux-3.2.16.orig/kernel/irq/debug.h linux-3.2.16.new/kernel/irq/debug.h --- linux-3.2.16.orig/kernel/irq/debug.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/debug.h 2012-04-29 16:35:33.782919592 +0200 @@ -11,8 +11,9 @@ static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc) { - printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n", - irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled); + printk("irq %d, desc: %p, depth: %d, stuck_count: %d, stuck_level: %d, stuck_level_max: %d, stuck_timeout: %lu\n", + irq, desc, desc->depth, desc->irqs_stuck_count, desc->irqs_stuck_level, desc->irqs_stuck_level_max, + desc->irqs_stuck_timeout); printk("->handle_irq(): %p, ", desc->handle_irq); print_symbol("%s\n", (unsigned long)desc->handle_irq); printk("->irq_data.chip(): %p, ", desc->irq_data.chip); diff -upr linux-3.2.16.orig/kernel/irq/internals.h linux-3.2.16.new/kernel/irq/internals.h --- linux-3.2.16.orig/kernel/irq/internals.h 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/internals.h 2012-04-28 22:01:15.124279391 +0200 @@ -13,9 +13,7 @@ # define IRQ_BITMAP_BITS NR_IRQS #endif -#define istate core_internal_state__do_not_mess_with_it - -extern int noirqdebug; +extern bool noirqdebug; /* * Bits used by threaded handlers: diff -upr linux-3.2.16.orig/kernel/irq/irqdesc.c linux-3.2.16.new/kernel/irq/irqdesc.c --- linux-3.2.16.orig/kernel/irq/irqdesc.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/irqdesc.c 2012-04-29 16:00:01.792104426 +0200 @@ -84,8 +84,10 @@ static void desc_set_defaults(unsigned i irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); desc->handle_irq = handle_bad_irq; desc->depth = 1; - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_stuck_count = 0; + desc->irqs_stuck_level = 0; + desc->irqs_stuck_level_max = 0; + desc->irqs_stuck_timeout = jiffies; desc->name = NULL; desc->owner = owner; for_each_possible_cpu(cpu) diff -upr linux-3.2.16.orig/kernel/irq/manage.c linux-3.2.16.new/kernel/irq/manage.c --- linux-3.2.16.orig/kernel/irq/manage.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/manage.c 2012-04-30 09:26:26.395627416 +0200 @@ -1087,8 +1087,10 @@ __setup_irq(unsigned int irq, struct irq *old_ptr = new; /* Reset broken irq detection when installing new handler */ - desc->irq_count = 0; - desc->irqs_unhandled = 0; + desc->irqs_stuck_count = 0; + desc->irqs_stuck_level = 0; + desc->irqs_stuck_level_max = 0; + desc->irqs_stuck_timeout = jiffies; /* * Check whether we disabled the irq via the spurious handler diff -upr linux-3.2.16.orig/kernel/irq/proc.c linux-3.2.16.new/kernel/irq/proc.c --- linux-3.2.16.orig/kernel/irq/proc.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/proc.c 2012-04-29 16:34:17.642434577 +0200 @@ -248,9 +248,9 @@ static int irq_spurious_proc_show(struct { struct irq_desc *desc = irq_to_desc((long) m->private); - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", - desc->irq_count, desc->irqs_unhandled, - jiffies_to_msecs(desc->last_unhandled)); + seq_printf(m, "irq=%3d stuck_count=%3u stuck_level_max=%3u\n", + desc->irq_data.irq, + desc->irqs_stuck_count, desc->irqs_stuck_level_max); return 0; } diff -upr linux-3.2.16.orig/kernel/irq/spurious.c linux-3.2.16.new/kernel/irq/spurious.c --- linux-3.2.16.orig/kernel/irq/spurious.c 2012-04-23 00:31:32.000000000 +0200 +++ linux-3.2.16.new/kernel/irq/spurious.c 2012-04-30 13:29:01.107319326 +0200 @@ -18,7 +18,12 @@ static int irqfixup __read_mostly; -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) +#define SPURIOUS_IRQ_TIMEOUT_INTERVAL (HZ/10) +#define SPURIOUS_IRQ_TRIGGER 5 +#define SPURIOUS_IRQ_REPORT_COUNT 5 +#define SPURIOUS_IRQ_POLL_CYCLES 100 +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) + static void poll_spurious_irqs(unsigned long dummy); static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); static int irq_poll_cpu; @@ -141,14 +146,15 @@ out: static void poll_spurious_irqs(unsigned long dummy) { struct irq_desc *desc; - int i; + int i, poll_again; if (atomic_inc_return(&irq_poll_active) != 1) goto out; irq_poll_cpu = smp_processor_id(); + poll_again = 0; /* Will stay false as long as no polling candidate is found */ for_each_irq_desc(i, desc) { - unsigned int state; + unsigned int state, irq; if (!i) continue; @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned barrier(); if (!(state & IRQS_SPURIOUS_DISABLED)) continue; - - local_irq_disable(); - try_one_irq(i, desc, true); - local_irq_enable(); + + /* We end up here with a disabled stuck interrupt. + desc->irqs_stuck_level now tracks the number of times + the interrupt has been polled */ + + irq = desc->irq_data.irq; + if (unlikely(desc->irqs_stuck_level == 1)) + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); + if (desc->irqs_stuck_level < SPURIOUS_IRQ_POLL_CYCLES) { + local_irq_disable(); + try_one_irq(i, desc, true); + local_irq_enable(); + desc->irqs_stuck_level++; + poll_again = 1; + } else { + if (unlikely(desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT)) { + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); + if (desc->irqs_stuck_count >= SPURIOUS_IRQ_REPORT_COUNT) + printk(KERN_EMERG "No more stuck interrupt reports for IRQ %d.\n", irq); + } + irq_enable(desc); /* Reenable the interrupt line */ + desc->depth--; + desc->istate &= (~IRQS_SPURIOUS_DISABLED); + desc->irqs_stuck_level = 0; + } } + if (poll_again) + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); out: atomic_dec(&irq_poll_active); - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); } static inline int bad_action_ret(irqreturn_t action_ret) @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu return 1; } -/* - * If 99,900 of the previous 100,000 interrupts have not been handled - * then assume that the IRQ is stuck in some manner. Drop a diagnostic - * and try to turn the IRQ off. - * - * (The other 100-of-100,000 interrupts may have been a correctly - * functioning device sharing an IRQ with the failing one) - */ static void __report_bad_irq(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret) @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st if (desc->istate & IRQS_POLL_INPROGRESS) return; - /* we get here again via the threaded handler */ if (action_ret == IRQ_WAKE_THREAD) return; @@ -281,55 +301,47 @@ void note_interrupt(unsigned int irq, st return; } - if (unlikely(action_ret == IRQ_NONE)) { - /* - * If we are seeing only the odd spurious IRQ caused by - * bus asynchronicity then don't eventually trigger an error, - * otherwise the counter becomes a doomsday timer for otherwise - * working systems - */ - if (time_after(jiffies, desc->last_unhandled + HZ/10)) - desc->irqs_unhandled = 1; - else - desc->irqs_unhandled++; - desc->last_unhandled = jiffies; - } - - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { - int ok = misrouted_irq(irq); - if (action_ret == IRQ_NONE) - desc->irqs_unhandled -= ok; + /* Adjust action_ret if an optional poll was successful. + (See inlined try_misrouted_irq() for conditions (depending + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not + be set, since we wouldn't be here (note_interrupt()) + at all in that case.) */ + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) + if (misrouted_irq(irq)) + action_ret = IRQ_HANDLED; + + if (unlikely((action_ret == IRQ_NONE) && time_before(jiffies, desc->irqs_stuck_timeout))) { + desc->irqs_stuck_level++; + if (desc->irqs_stuck_level > desc->irqs_stuck_level_max) + desc->irqs_stuck_level_max = desc->irqs_stuck_level; + if (desc->irqs_stuck_level >= SPURIOUS_IRQ_TRIGGER) { /* The interrupt is stuck */ + desc->irqs_stuck_count++; /* TODO: Prevent hypothetical overflow */ + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) { + __report_bad_irq(irq, desc, action_ret); + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); + } + desc->istate |= IRQS_SPURIOUS_DISABLED; + desc->depth++; + irq_disable(desc); + /* TODO: Do a safe access to the timer. Now we may be extending a deadline + for a polling system already running for another interrupt. */ + mod_timer(&poll_spurious_irq_timer, + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ + desc->irqs_stuck_level = 1; + desc->irqs_stuck_level_max = 0; + } } - - desc->irq_count++; - if (likely(desc->irq_count < 100000)) - return; - - desc->irq_count = 0; - if (unlikely(desc->irqs_unhandled > 99900)) { - /* - * The interrupt is stuck - */ - __report_bad_irq(irq, desc, action_ret); - /* - * Now kill the IRQ - */ - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); - desc->istate |= IRQS_SPURIOUS_DISABLED; - desc->depth++; - irq_disable(desc); - - mod_timer(&poll_spurious_irq_timer, - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); + else { + desc->irqs_stuck_timeout = jiffies + SPURIOUS_IRQ_TIMEOUT_INTERVAL; + desc->irqs_stuck_level = 0; } - desc->irqs_unhandled = 0; } -int noirqdebug __read_mostly; +bool noirqdebug __read_mostly; int noirqdebug_setup(char *str) { - noirqdebug = 1; + noirqdebug = true; printk(KERN_INFO "IRQ lockup detection disabled\n"); return 1; ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 11:35 ` Jeroen Van den Keybus @ 2012-05-29 23:36 ` Grant Likely 2012-05-30 0:07 ` Thomas Gleixner 1 sibling, 0 replies; 40+ messages in thread From: Grant Likely @ 2012-05-29 23:36 UTC (permalink / raw) To: Jeroen Van den Keybus, Borislav Petkov Cc: Linus Torvalds, Thomas Gleixner, Josh Boyer, Clemens Ladisch, andymatei, Huang, Shane, linux-kernel Hi Jeroen, Comments below. I've written a lot of things below, some of which are trivial. The trivial stuff won't prevent me from acking a patch, but sorting those issues out also smooths the path for getting a change merged. On Mon, 30 Apr 2012 13:35:47 +0200 (CEST), Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> wrote: > (another try at supplying a noncorrupt patch with alpine/Gmail - sorry > for the inconvenience) > > > --- First and foremost; every patch needs a description of the changes. What is the problem? How is it solved? What testing has been done? Are there any details about this change that a reader should be looking out for? etc. All of those details help me to understand what the patch is supposed to be doing, and if the code matches what you describe it to be. > Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> S-o-b line must appear above the '---' line so that that patch apply tools pick it up. It also helps to have a diffstat below the '---' line so that I've got an at-a-glance view of the size of the change. (git send-email will do this for you automatically) > > diff -upr linux-3.2.16.orig/include/linux/irqdesc.h linux-3.2.16.new/include/linux/irqdesc.h > --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/include/linux/irqdesc.h 2012-04-29 16:33:48.142332693 +0200 > @@ -14,28 +14,29 @@ struct timer_rand_state; > struct module; > /** > * struct irq_desc - interrupt descriptor > - * @irq_data: per irq and chip data passed down to chip functions > - * @timer_rand_state: pointer to timer rand state struct > - * @kstat_irqs: irq stats per cpu > - * @handle_irq: highlevel irq-events handler > - * @preflow_handler: handler called before the flow handler (currently used by sparc) > - * @action: the irq action chain > - * @status: status information > - * @core_internal_state__do_not_mess_with_it: core internal status information > - * @depth: disable-depth, for nested irq_disable() calls > - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > - * @irq_count: stats field to detect stalled irqs > - * @last_unhandled: aging timer for unhandled count > - * @irqs_unhandled: stats field for spurious unhandled interrupts > - * @lock: locking for SMP > - * @affinity_hint: hint to user space for preferred irq affinity > - * @affinity_notify: context for notification of affinity changes > - * @pending_mask: pending rebalanced interrupts > - * @threads_oneshot: bitfield to handle shared oneshot threads > - * @threads_active: number of irqaction threads currently running > - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > - * @dir: /proc/irq/ procfs entry > - * @name: flow handler name for /proc/interrupts output > + * @irq_data: per irq and chip data passed down to chip functions > + * @timer_rand_state: pointer to timer rand state struct > + * @kstat_irqs: irq stats per cpu > + * @handle_irq: highlevel irq-events handler > + * @preflow_handler: handler called before the flow handler (currently used by sparc) > + * @action: the irq action chain > + * @status: status information > + * @istate: core internal status information > + * @depth: disable-depth, for nested irq_disable() calls > + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > + * @irqs_stuck_count: stuck interrupt occurrence counter > + * @irqs_stuck_level: used for stuck interrupt line detection and tracking poll cycle count > + * @irqs_stuck_level_max: indicates the maximum irqs_stuck_level since last stuck interrupt occurrence > + * @irqs_stuck_timeout: deadline for resetting irqs_stuck_level > + * @lock: locking for SMP > + * @affinity_hint: hint to user space for preferred irq affinity > + * @affinity_notify: context for notification of affinity changes > + * @pending_mask: pending rebalanced interrupts > + * @threads_oneshot: bitfield to handle shared oneshot threads > + * @threads_active: number of irqaction threads currently running > + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > + * @dir: /proc/irq/ procfs entry > + * @name: flow handler name for /proc/interrupts output As commented on in my previous email, this hunk adds a lot of noise. By changing the whitespace on every line I have no idea what has actually been changed in this block without doing extra work. I did notice the rename of core_internal_state__do_not_mess_with_it, mostly because it is a really long line, but I cannot tell at a glance if anything else has changed, been added or been removed. > */ > struct irq_desc { > struct irq_data irq_data; > @@ -47,12 +48,13 @@ struct irq_desc { > #endif > struct irqaction *action; /* IRQ action list */ > unsigned int status_use_accessors; > - unsigned int core_internal_state__do_not_mess_with_it; > + unsigned int istate; As already commented by tglx; don't do this. It was named like this on purpose. > unsigned int depth; /* nested irq disables */ > unsigned int wake_depth; /* nested wake enables */ > - unsigned int irq_count; /* For detecting broken IRQs */ I think irq_count is useful on it's own apart from detecting broken irqs, and including it's removal adds a bit of collateral damage to this patch. If you do think it needs to be dropped, then split this change into another patch. > - unsigned long last_unhandled; /* Aging timer for unhandled count */ > - unsigned int irqs_unhandled; > + unsigned int irqs_stuck_count; > + unsigned int irqs_stuck_level; > + unsigned int irqs_stuck_level_max; > + unsigned long irqs_stuck_timeout; > raw_spinlock_t lock; > struct cpumask *percpu_enabled; > #ifdef CONFIG_SMP > diff -upr linux-3.2.16.orig/kernel/irq/debug.h linux-3.2.16.new/kernel/irq/debug.h > --- linux-3.2.16.orig/kernel/irq/debug.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/debug.h 2012-04-29 16:35:33.782919592 +0200 > @@ -11,8 +11,9 @@ > > static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc) > { > - printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n", > - irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled); > + printk("irq %d, desc: %p, depth: %d, stuck_count: %d, stuck_level: %d, stuck_level_max: %d, stuck_timeout: %lu\n", > + irq, desc, desc->depth, desc->irqs_stuck_count, desc->irqs_stuck_level, desc->irqs_stuck_level_max, > + desc->irqs_stuck_timeout); > printk("->handle_irq(): %p, ", desc->handle_irq); > print_symbol("%s\n", (unsigned long)desc->handle_irq); > printk("->irq_data.chip(): %p, ", desc->irq_data.chip); > diff -upr linux-3.2.16.orig/kernel/irq/internals.h linux-3.2.16.new/kernel/irq/internals.h > --- linux-3.2.16.orig/kernel/irq/internals.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/internals.h 2012-04-28 22:01:15.124279391 +0200 > @@ -13,9 +13,7 @@ > # define IRQ_BITMAP_BITS NR_IRQS > #endif > > -#define istate core_internal_state__do_not_mess_with_it > - > -extern int noirqdebug; > +extern bool noirqdebug; This is a discrete change; split into a separate patch with the associated fixups. > > /* > * Bits used by threaded handlers: > diff -upr linux-3.2.16.orig/kernel/irq/irqdesc.c linux-3.2.16.new/kernel/irq/irqdesc.c > --- linux-3.2.16.orig/kernel/irq/irqdesc.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/irqdesc.c 2012-04-29 16:00:01.792104426 +0200 > @@ -84,8 +84,10 @@ static void desc_set_defaults(unsigned i > irqd_set(&desc->irq_data, IRQD_IRQ_DISABLED); > desc->handle_irq = handle_bad_irq; > desc->depth = 1; > - desc->irq_count = 0; > - desc->irqs_unhandled = 0; > + desc->irqs_stuck_count = 0; > + desc->irqs_stuck_level = 0; > + desc->irqs_stuck_level_max = 0; > + desc->irqs_stuck_timeout = jiffies; > desc->name = NULL; > desc->owner = owner; > for_each_possible_cpu(cpu) > diff -upr linux-3.2.16.orig/kernel/irq/manage.c linux-3.2.16.new/kernel/irq/manage.c > --- linux-3.2.16.orig/kernel/irq/manage.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/manage.c 2012-04-30 09:26:26.395627416 +0200 > @@ -1087,8 +1087,10 @@ __setup_irq(unsigned int irq, struct irq > *old_ptr = new; > > /* Reset broken irq detection when installing new handler */ > - desc->irq_count = 0; > - desc->irqs_unhandled = 0; > + desc->irqs_stuck_count = 0; > + desc->irqs_stuck_level = 0; > + desc->irqs_stuck_level_max = 0; > + desc->irqs_stuck_timeout = jiffies; > > /* > * Check whether we disabled the irq via the spurious handler > diff -upr linux-3.2.16.orig/kernel/irq/proc.c linux-3.2.16.new/kernel/irq/proc.c > --- linux-3.2.16.orig/kernel/irq/proc.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/proc.c 2012-04-29 16:34:17.642434577 +0200 > @@ -248,9 +248,9 @@ static int irq_spurious_proc_show(struct > { > struct irq_desc *desc = irq_to_desc((long) m->private); > > - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", > - desc->irq_count, desc->irqs_unhandled, > - jiffies_to_msecs(desc->last_unhandled)); > + seq_printf(m, "irq=%3d stuck_count=%3u stuck_level_max=%3u\n", > + desc->irq_data.irq, > + desc->irqs_stuck_count, desc->irqs_stuck_level_max); > return 0; > } > > diff -upr linux-3.2.16.orig/kernel/irq/spurious.c linux-3.2.16.new/kernel/irq/spurious.c > --- linux-3.2.16.orig/kernel/irq/spurious.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/spurious.c 2012-04-30 13:29:01.107319326 +0200 > @@ -18,7 +18,12 @@ > > static int irqfixup __read_mostly; > > -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TIMEOUT_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TRIGGER 5 > +#define SPURIOUS_IRQ_REPORT_COUNT 5 > +#define SPURIOUS_IRQ_POLL_CYCLES 100 > +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) > + > static void poll_spurious_irqs(unsigned long dummy); > static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); > static int irq_poll_cpu; > @@ -141,14 +146,15 @@ out: > static void poll_spurious_irqs(unsigned long dummy) > { > struct irq_desc *desc; > - int i; > + int i, poll_again; > > if (atomic_inc_return(&irq_poll_active) != 1) > goto out; > irq_poll_cpu = smp_processor_id(); > > + poll_again = 0; /* Will stay false as long as no polling candidate is found */ > for_each_irq_desc(i, desc) { > - unsigned int state; > + unsigned int state, irq; Don't need to do this; the irq number is already stored in 'i' by for_each_irq_desc. > > if (!i) > continue; > @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned > barrier(); > if (!(state & IRQS_SPURIOUS_DISABLED)) > continue; > - > - local_irq_disable(); > - try_one_irq(i, desc, true); > - local_irq_enable(); > + > + /* We end up here with a disabled stuck interrupt. > + desc->irqs_stuck_level now tracks the number of times > + the interrupt has been polled */ "... now tracks ..."? Does it track something different elsewhere in the code? > + > + irq = desc->irq_data.irq; > + if (unlikely(desc->irqs_stuck_level == 1)) > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) It isn't obvious how the irqs_stuck_* values work together. Can you add a comment blurb to the code describing it? I believe you described how it works in a previous email, but future readers probably won't see that. It needs to be documented here. > + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); Nit: pr_emerg() > + if (desc->irqs_stuck_level < SPURIOUS_IRQ_POLL_CYCLES) { > + local_irq_disable(); > + try_one_irq(i, desc, true); > + local_irq_enable(); > + desc->irqs_stuck_level++; > + poll_again = 1; > + } else { > + if (unlikely(desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT)) { > + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); > + if (desc->irqs_stuck_count >= SPURIOUS_IRQ_REPORT_COUNT) > + printk(KERN_EMERG "No more stuck interrupt reports for IRQ %d.\n", irq); > + } > + irq_enable(desc); /* Reenable the interrupt line */ > + desc->depth--; > + desc->istate &= (~IRQS_SPURIOUS_DISABLED); > + desc->irqs_stuck_level = 0; > + } > } > + if (poll_again) > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); Is there a reason that the mod_timer is moved out of the out: path? Won't this mean that if other code already has irq_poll_active incremented (misrouted_irq()) that the timer won't get reset and poll_spurious_irqs() will stall? > out: > atomic_dec(&irq_poll_active); > - mod_timer(&poll_spurious_irq_timer, > - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); > } > > static inline int bad_action_ret(irqreturn_t action_ret) > @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu > return 1; > } > > -/* > - * If 99,900 of the previous 100,000 interrupts have not been handled > - * then assume that the IRQ is stuck in some manner. Drop a diagnostic > - * and try to turn the IRQ off. > - * > - * (The other 100-of-100,000 interrupts may have been a correctly > - * functioning device sharing an IRQ with the failing one) > - */ > static void > __report_bad_irq(unsigned int irq, struct irq_desc *desc, > irqreturn_t action_ret) > @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st > if (desc->istate & IRQS_POLL_INPROGRESS) > return; > > - /* we get here again via the threaded handler */ > if (action_ret == IRQ_WAKE_THREAD) > return; > > @@ -281,55 +301,47 @@ void note_interrupt(unsigned int irq, st > return; > } > > - if (unlikely(action_ret == IRQ_NONE)) { > - /* > - * If we are seeing only the odd spurious IRQ caused by > - * bus asynchronicity then don't eventually trigger an error, > - * otherwise the counter becomes a doomsday timer for otherwise > - * working systems > - */ > - if (time_after(jiffies, desc->last_unhandled + HZ/10)) > - desc->irqs_unhandled = 1; > - else > - desc->irqs_unhandled++; > - desc->last_unhandled = jiffies; > - } > - > - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { > - int ok = misrouted_irq(irq); > - if (action_ret == IRQ_NONE) > - desc->irqs_unhandled -= ok; > + /* Adjust action_ret if an optional poll was successful. > + (See inlined try_misrouted_irq() for conditions (depending > + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not > + be set, since we wouldn't be here (note_interrupt()) > + at all in that case.) */ Nit: kernel style for multiline blocks is: /* * ... * ... */ > + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) > + if (misrouted_irq(irq)) > + action_ret = IRQ_HANDLED; Tip: if the braces around this block had been left as-is, then then diff would have kept the related lines of code together. > + > + if (unlikely((action_ret == IRQ_NONE) && time_before(jiffies, desc->irqs_stuck_timeout))) { > + desc->irqs_stuck_level++; > + if (desc->irqs_stuck_level > desc->irqs_stuck_level_max) > + desc->irqs_stuck_level_max = desc->irqs_stuck_level; > + if (desc->irqs_stuck_level >= SPURIOUS_IRQ_TRIGGER) { /* The interrupt is stuck */ > + desc->irqs_stuck_count++; /* TODO: Prevent hypothetical overflow */ > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) { > + __report_bad_irq(irq, desc, action_ret); > + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); > + } > + desc->istate |= IRQS_SPURIOUS_DISABLED; > + desc->depth++; > + irq_disable(desc); > + /* TODO: Do a safe access to the timer. Now we may be extending a deadline > + for a polling system already running for another interrupt. */ > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ This isn't any different from where the existing code sets up the timer. What is your concern here about extending the deadline? Is it just because the bad irq condition gets hit much earlier with this code (5 instead of 99900)? > + desc->irqs_stuck_level = 1; > + desc->irqs_stuck_level_max = 0; > + } > } > - > - desc->irq_count++; > - if (likely(desc->irq_count < 100000)) > - return; > - > - desc->irq_count = 0; > - if (unlikely(desc->irqs_unhandled > 99900)) { > - /* > - * The interrupt is stuck > - */ > - __report_bad_irq(irq, desc, action_ret); > - /* > - * Now kill the IRQ > - */ > - printk(KERN_EMERG "Disabling IRQ #%d\n", irq); > - desc->istate |= IRQS_SPURIOUS_DISABLED; > - desc->depth++; > - irq_disable(desc); > - > - mod_timer(&poll_spurious_irq_timer, > - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); > + else { > + desc->irqs_stuck_timeout = jiffies + SPURIOUS_IRQ_TIMEOUT_INTERVAL; > + desc->irqs_stuck_level = 0; > } > - desc->irqs_unhandled = 0; > } > > -int noirqdebug __read_mostly; > +bool noirqdebug __read_mostly; > > int noirqdebug_setup(char *str) > { > - noirqdebug = 1; > + noirqdebug = true; > printk(KERN_INFO "IRQ lockup detection disabled\n"); > > return 1; > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Grant Likely, B.Sc, P.Eng. Secret Lab Technologies, Ltd. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-04-30 11:35 ` Jeroen Van den Keybus 2012-05-29 23:36 ` Grant Likely @ 2012-05-30 0:07 ` Thomas Gleixner 2012-05-30 10:44 ` Borislav Petkov 1 sibling, 1 reply; 40+ messages in thread From: Thomas Gleixner @ 2012-05-30 0:07 UTC (permalink / raw) To: Jeroen Van den Keybus Cc: Borislav Petkov, Linus Torvalds, Josh Boyer, Clemens Ladisch, andymatei, Huang, Shane, linux-kernel On Mon, 30 Apr 2012, Jeroen Van den Keybus wrote: > (another try at supplying a noncorrupt patch with alpine/Gmail - sorry > for the inconvenience) > > > --- This is missing a proper changelog. And a very lenghty one. > Signed-off-by: Jeroen Van den Keybus <jeroen.vandenkeybus@gmail.com> > > diff -upr linux-3.2.16.orig/include/linux/irqdesc.h linux-3.2.16.new/include/linux/irqdesc.h > --- linux-3.2.16.orig/include/linux/irqdesc.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/include/linux/irqdesc.h 2012-04-29 16:33:48.142332693 +0200 > @@ -14,28 +14,29 @@ struct timer_rand_state; > struct module; > /** > * struct irq_desc - interrupt descriptor > - * @irq_data: per irq and chip data passed down to chip functions > - * @timer_rand_state: pointer to timer rand state struct > - * @kstat_irqs: irq stats per cpu > - * @handle_irq: highlevel irq-events handler > - * @preflow_handler: handler called before the flow handler (currently used by sparc) > - * @action: the irq action chain > - * @status: status information > - * @core_internal_state__do_not_mess_with_it: core internal status information > - * @depth: disable-depth, for nested irq_disable() calls > - * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > - * @irq_count: stats field to detect stalled irqs > - * @last_unhandled: aging timer for unhandled count > - * @irqs_unhandled: stats field for spurious unhandled interrupts > - * @lock: locking for SMP > - * @affinity_hint: hint to user space for preferred irq affinity > - * @affinity_notify: context for notification of affinity changes > - * @pending_mask: pending rebalanced interrupts > - * @threads_oneshot: bitfield to handle shared oneshot threads > - * @threads_active: number of irqaction threads currently running > - * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > - * @dir: /proc/irq/ procfs entry > - * @name: flow handler name for /proc/interrupts output > + * @irq_data: per irq and chip data passed down to chip functions > + * @timer_rand_state: pointer to timer rand state struct > + * @kstat_irqs: irq stats per cpu > + * @handle_irq: highlevel irq-events handler > + * @preflow_handler: handler called before the flow handler (currently used by sparc) > + * @action: the irq action chain > + * @status: status information > + * @istate: core internal status information > + * @depth: disable-depth, for nested irq_disable() calls > + * @wake_depth: enable depth, for multiple irq_set_irq_wake() callers > + * @irqs_stuck_count: stuck interrupt occurrence counter > + * @irqs_stuck_level: used for stuck interrupt line detection and tracking poll cycle count > + * @irqs_stuck_level_max: indicates the maximum irqs_stuck_level since last stuck interrupt occurrence > + * @irqs_stuck_timeout: deadline for resetting irqs_stuck_level I can only guestimate what this is for due to the lack of a proper changelog which explains how this new mechanism is supposed to work. > + * @lock: locking for SMP > + * @affinity_hint: hint to user space for preferred irq affinity > + * @affinity_notify: context for notification of affinity changes > + * @pending_mask: pending rebalanced interrupts > + * @threads_oneshot: bitfield to handle shared oneshot threads > + * @threads_active: number of irqaction threads currently running > + * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers > + * @dir: /proc/irq/ procfs entry > + * @name: flow handler name for /proc/interrupts output > */ If you want to move the column of the comment, then please make this a separate patch and add your members on top. > struct irq_desc { > struct irq_data irq_data; > @@ -47,12 +48,13 @@ struct irq_desc { > #endif > struct irqaction *action; /* IRQ action list */ > unsigned int status_use_accessors; > - unsigned int core_internal_state__do_not_mess_with_it; > + unsigned int istate; What's the point of this change? How is this related to your problem ? > diff -upr linux-3.2.16.orig/kernel/irq/internals.h linux-3.2.16.new/kernel/irq/internals.h > --- linux-3.2.16.orig/kernel/irq/internals.h 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/internals.h 2012-04-28 22:01:15.124279391 +0200 > @@ -13,9 +13,7 @@ > # define IRQ_BITMAP_BITS NR_IRQS > #endif > > -#define istate core_internal_state__do_not_mess_with_it > - No. See above. > -extern int noirqdebug; > +extern bool noirqdebug; Unrelated. Please do not mix functional changes with random cleanups. > @@ -248,9 +248,9 @@ static int irq_spurious_proc_show(struct > { > struct irq_desc *desc = irq_to_desc((long) m->private); > > - seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n", > - desc->irq_count, desc->irqs_unhandled, > - jiffies_to_msecs(desc->last_unhandled)); > + seq_printf(m, "irq=%3d stuck_count=%3u stuck_level_max=%3u\n", > + desc->irq_data.irq, > + desc->irqs_stuck_count, desc->irqs_stuck_level_max); So this changes the output of /proc/irq/*/spurious from: count 85 unhandled 0 last_unhandled 0 ms to irq= 16 stuck_count= 81 stuck_level_max= 0 I don't know whether we have tools depending on that, but you discard valuable information: - The number of handled interrupts vs. the number of unhandled ones - The last occurence of an unhandled one And you add redundant information: - The irq number, which is already part of the file path Even if no tool depends on that this wants to be documented somewhere. Aside of that I really can't understand the output: irq= 16 stuck_count= 81 stuck_level_max= 0 It got stuck 81 times, and the max level is 0 ? > return 0; > } > diff -upr linux-3.2.16.orig/kernel/irq/spurious.c linux-3.2.16.new/kernel/irq/spurious.c > --- linux-3.2.16.orig/kernel/irq/spurious.c 2012-04-23 00:31:32.000000000 +0200 > +++ linux-3.2.16.new/kernel/irq/spurious.c 2012-04-30 13:29:01.107319326 +0200 > @@ -18,7 +18,12 @@ > > static int irqfixup __read_mostly; > > -#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TIMEOUT_INTERVAL (HZ/10) > +#define SPURIOUS_IRQ_TRIGGER 5 > +#define SPURIOUS_IRQ_REPORT_COUNT 5 > +#define SPURIOUS_IRQ_POLL_CYCLES 100 > +#define SPURIOUS_IRQ_POLL_INTERVAL (HZ/100) > + > static void poll_spurious_irqs(unsigned long dummy); > static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0); > static int irq_poll_cpu; > @@ -141,14 +146,15 @@ out: > static void poll_spurious_irqs(unsigned long dummy) > { > struct irq_desc *desc; > - int i; > + int i, poll_again; > > if (atomic_inc_return(&irq_poll_active) != 1) > goto out; > irq_poll_cpu = smp_processor_id(); > > + poll_again = 0; /* Will stay false as long as no polling candidate is found */ Please do not put comments at the end of the code line. That really makes it hard to read. And this comment is pretty pointless. I wish the real functionality would have been commented proper. > for_each_irq_desc(i, desc) { > - unsigned int state; > + unsigned int state, irq; > > if (!i) > continue; > @@ -158,15 +164,38 @@ static void poll_spurious_irqs(unsigned > barrier(); > if (!(state & IRQS_SPURIOUS_DISABLED)) > continue; > - > - local_irq_disable(); > - try_one_irq(i, desc, true); > - local_irq_enable(); > + > + /* We end up here with a disabled stuck interrupt. > + desc->irqs_stuck_level now tracks the number of times > + the interrupt has been polled */ Please follow the coding style. Multiline comments are /* * ..... */ > + irq = desc->irq_data.irq; Huch? "i" has already the irq number. > + if (unlikely(desc->irqs_stuck_level == 1)) > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) > + printk(KERN_EMERG "Polling handlers for IRQ %d.\n", irq); This is definitely not an emergency message. And why do we want to print that over and over? > + if (desc->irqs_stuck_level < SPURIOUS_IRQ_POLL_CYCLES) { > + local_irq_disable(); > + try_one_irq(i, desc, true); > + local_irq_enable(); > + desc->irqs_stuck_level++; > + poll_again = 1; So now we poll 100 times. > + } else { > + if (unlikely(desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT)) { > + printk(KERN_EMERG "Reenabling IRQ %d.\n", irq); > + if (desc->irqs_stuck_count >= SPURIOUS_IRQ_REPORT_COUNT) > + printk(KERN_EMERG "No more stuck interrupt reports for IRQ %d.\n", irq); I really can't grok that logic. For the first five rounds where we polled 100 times and started over we print "Reenabling ..." On the 5th round we report: "No more stuck interrupt ...." What's the point of this? Random debugging leftovers? So the last message in dmesg is: "No more stuck interrupt reports..." while the interrupt still is in polling mode or happily bouncing back and forth. > + } And then we unconditionally reenable the interrupts after 100 polls to start over? > + irq_enable(desc); /* Reenable the interrupt line */ So now the interrupt can come in, or other functions can fiddle with istate / depth. What protects the access to the following two fields ? > + desc->depth--; > + desc->istate &= (~IRQS_SPURIOUS_DISABLED); On the disable path we are protected, but not here. > + desc->irqs_stuck_level = 0; And that one is not protected against another interrupt coming in and setting it to 1. > + } > } > + if (poll_again) > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); > out: > atomic_dec(&irq_poll_active); > - mod_timer(&poll_spurious_irq_timer, > - jiffies + POLL_SPURIOUS_IRQ_INTERVAL); So in case the above if (atomic_inc_return(&irq_poll_active) != 1) goto out; code path is taken, then we don't rearm the timer. This can happen when one CPU runs misrouted_irq() and the other runs the timer callback. If the misrouted_irq() call succeeds and there are still interrupts which depend on polling then nothing rearms the timer. > } > > static inline int bad_action_ret(irqreturn_t action_ret) > @@ -176,14 +205,6 @@ static inline int bad_action_ret(irqretu > return 1; > } > > -/* > - * If 99,900 of the previous 100,000 interrupts have not been handled > - * then assume that the IRQ is stuck in some manner. Drop a diagnostic > - * and try to turn the IRQ off. > - * > - * (The other 100-of-100,000 interrupts may have been a correctly > - * functioning device sharing an IRQ with the failing one) > - */ Instead of removing the comment there should be one which is explaining what the new rules are. > static void > __report_bad_irq(unsigned int irq, struct irq_desc *desc, > irqreturn_t action_ret) > @@ -272,7 +293,6 @@ void note_interrupt(unsigned int irq, st > if (desc->istate & IRQS_POLL_INPROGRESS) > return; > > - /* we get here again via the threaded handler */ The comment is not brilliant, but removing it in the context of this change is just wrong. > if (action_ret == IRQ_WAKE_THREAD) > return; > > @@ -281,55 +301,47 @@ void note_interrupt(unsigned int irq, st > return; > } > > - if (unlikely(action_ret == IRQ_NONE)) { > - /* > - * If we are seeing only the odd spurious IRQ caused by > - * bus asynchronicity then don't eventually trigger an error, > - * otherwise the counter becomes a doomsday timer for otherwise > - * working systems > - */ Again, this is useful information. Why are you removing it ? > - if (time_after(jiffies, desc->last_unhandled + HZ/10)) > - desc->irqs_unhandled = 1; > - else > - desc->irqs_unhandled++; > - desc->last_unhandled = jiffies; > - } > - > - if (unlikely(try_misrouted_irq(irq, desc, action_ret))) { > - int ok = misrouted_irq(irq); > - if (action_ret == IRQ_NONE) > - desc->irqs_unhandled -= ok; > + /* Adjust action_ret if an optional poll was successful. > + (See inlined try_misrouted_irq() for conditions (depending > + on 'irqfixup' and 'irqpoll'), and 'noirqdebug' must not > + be set, since we wouldn't be here (note_interrupt()) > + at all in that case.) */ The missing information is _WHY_ we adjust action_ret. The interested reader will look into try_misrouted_irq() and figure out when it returns != 0. > + if (unlikely(try_misrouted_irq(irq, desc, action_ret))) > + if (misrouted_irq(irq)) > + action_ret = IRQ_HANDLED; > + if (a) { if (b) bla; } Please > + if (unlikely((action_ret == IRQ_NONE) && time_before(jiffies, desc->irqs_stuck_timeout))) { All this likely/unlikely making is just wrong. And please keep the code within 80 chars. > + desc->irqs_stuck_level++; What's level doing ? > + if (desc->irqs_stuck_level > desc->irqs_stuck_level_max) > + desc->irqs_stuck_level_max = desc->irqs_stuck_level; That's purely stats, right ? > + if (desc->irqs_stuck_level >= SPURIOUS_IRQ_TRIGGER) { /* The interrupt is stuck */ Please don't comment the obvious instead of commenting the interesting bits. > + desc->irqs_stuck_count++; /* TODO: Prevent hypothetical overflow */ Why should this overflow? Because this thing is bouncing back and forth from poll mode to interrupt mode forever ? > + if (desc->irqs_stuck_count <= SPURIOUS_IRQ_REPORT_COUNT) { > + __report_bad_irq(irq, desc, action_ret); > + printk(KERN_EMERG "Disabling IRQ %d.\n", irq); > + } We report the interrupt as bad 5 times. Then we stop, but what tells us that we disabled it forever. Or don't we do that anymore? No we don't, AFAICT. That would mean that on a machine which has a real poll issue this bounces back and forth for ever. Not a real improvement. > + desc->istate |= IRQS_SPURIOUS_DISABLED; > + desc->depth++; > + irq_disable(desc); > + /* TODO: Do a safe access to the timer. Now we may be extending a deadline > + for a polling system already running for another interrupt. */ And who cares ? > + mod_timer(&poll_spurious_irq_timer, > + jiffies + SPURIOUS_IRQ_POLL_INTERVAL); /* Schedule a poll cycle */ We know that we schedule a poll cycle by arming the timer. > + desc->irqs_stuck_level = 1; > + desc->irqs_stuck_level_max = 0; I'm confused. What's the rule for this "max" statistics? > + else { > + desc->irqs_stuck_timeout = jiffies + SPURIOUS_IRQ_TIMEOUT_INTERVAL; > + desc->irqs_stuck_level = 0; > } > - desc->irqs_unhandled = 0; > } > > -int noirqdebug __read_mostly; > +bool noirqdebug __read_mostly; See above. I have no objections to improve the spurious/poll handling but this has do be done more carefully. Also I'm very interested to figure out the root cause of this problem first. You provided quite some information already, but I'm still waiting for some more. Aside of that it would be helpful if the AMD folks could try to get some more information about this issue on the documentation/errata level. Boris ??? Thanks, tglx ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Unhandled IRQs on AMD E-450 2012-05-30 0:07 ` Thomas Gleixner @ 2012-05-30 10:44 ` Borislav Petkov 0 siblings, 0 replies; 40+ messages in thread From: Borislav Petkov @ 2012-05-30 10:44 UTC (permalink / raw) To: Thomas Gleixner Cc: Jeroen Van den Keybus, Borislav Petkov, Linus Torvalds, Josh Boyer, Clemens Ladisch, andymatei, Huang, Shane, linux-kernel, Müller Keve On Wed, May 30, 2012 at 02:07:20AM +0200, Thomas Gleixner wrote: > I have no objections to improve the spurious/poll handling but this > has do be done more carefully. > > Also I'm very interested to figure out the root cause of this problem > first. You provided quite some information already, but I'm still > waiting for some more. > > Aside of that it would be helpful if the AMD folks could try to get > some more information about this issue on the documentation/errata > level. Boris ??? Hmm, from reading what people have debugged so far, this looks like a problem with the Asmedia ASM1083 PCIe-PCI bridge: http://marc.info/?l=linux-kernel&m=132793745319418 and both Intel and AMD boards which are integrating this thing are affected. There's even a great message from Asmedia at the end of the thread, it is a lot of fun reading. As always, I'll try to ask around but can't promise anything. We probably need to get some Intel folk involved too. @Müller: can you send me the mail address of that Asmedia contact privately? If ASUS are on CC, them too. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach GM: Alberto Bozzo Reg: Dornach, Landkreis Muenchen HRB Nr. 43632 WEEE Registernr: 129 19551 ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <fa.Tzg9rJm1oEMGIL8eap99R7gLU4Q@ifi.uio.no>]
[parent not found: <fa.Yw7gRhZrXlfCxofC1BHK22C+oTk@ifi.uio.no>]
[parent not found: <fa.l7CBcHbzr+l317AuKP87w9mccUk@ifi.uio.no>]
[parent not found: <fa.VXfk4ts2TBVKqgBQtfGn6RQHemg@ifi.uio.no>]
* Re: Unhandled IRQs on AMD E-450 [not found] ` <fa.VXfk4ts2TBVKqgBQtfGn6RQHemg@ifi.uio.no> @ 2012-04-25 8:48 ` andymatei 0 siblings, 0 replies; 40+ messages in thread From: andymatei @ 2012-04-25 8:48 UTC (permalink / raw) To: fa.linux.kernel Cc: andymatei, Huang, Shane, Borislav Petkov, Nguyen, Dong, linux-kernel Hmm... this means change the motherboard... great, a lot of money for a shitty Asus board. ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2012-05-30 10:43 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-29 21:44 Unhandled IRQs on AMD E-450 Jeroen Van den Keybus
2011-11-30 8:30 ` Clemens Ladisch
2011-11-30 15:44 ` Borislav Petkov
2011-12-01 8:01 ` Huang, Shane
2011-12-03 20:36 ` Jeroen Van den Keybus
2011-12-04 12:48 ` Clemens Ladisch
2011-12-04 13:36 ` Jeroen Van den Keybus
2011-12-04 13:54 ` Jeroen Van den Keybus
2011-12-04 14:08 ` Jeroen Van den Keybus
2011-12-04 15:06 ` Jeroen Van den Keybus
2011-12-04 16:59 ` Clemens Ladisch
2011-12-06 0:06 ` Jeroen Van den Keybus
2011-12-08 11:33 ` Jeroen Van den Keybus
2011-12-08 12:45 ` Clemens Ladisch
2011-12-08 21:27 ` Jeroen Van den Keybus
2011-12-09 8:22 ` Clemens Ladisch
2011-12-09 11:17 ` Jeroen Van den Keybus
2011-12-09 12:55 ` Clemens Ladisch
2011-12-10 12:10 ` Jeroen Van den Keybus
2011-12-10 17:58 ` Clemens Ladisch
2011-12-11 15:28 ` Jeroen Van den Keybus
[not found] <fa.CZQqvHf3CBfYWzhSDPNOWxTTD9w@ifi.uio.no>
[not found] ` <fa.Vmg5vDod2/oKvwyy9BcalhoT+Lo@ifi.uio.no>
2012-04-25 8:35 ` andymatei
2012-04-25 8:48 ` Clemens Ladisch
2012-04-27 8:22 ` Borislav Petkov
2012-04-27 8:29 ` Andrei Matei
2012-04-27 11:46 ` Jeroen Van den Keybus
2012-04-27 13:06 ` Josh Boyer
2012-04-27 13:28 ` Jeroen Van den Keybus
2012-04-27 13:49 ` Josh Boyer
2012-04-30 8:29 ` Jeroen Van den Keybus
2012-04-30 9:57 ` Clemens Ladisch
2012-04-30 10:41 ` Jeroen Van den Keybus
2012-04-30 12:47 ` Clemens Ladisch
2012-05-29 22:20 ` Grant Likely
2012-04-30 10:21 ` Borislav Petkov
2012-04-30 11:35 ` Jeroen Van den Keybus
2012-05-29 23:36 ` Grant Likely
2012-05-30 0:07 ` Thomas Gleixner
2012-05-30 10:44 ` Borislav Petkov
[not found] <fa.Tzg9rJm1oEMGIL8eap99R7gLU4Q@ifi.uio.no>
[not found] ` <fa.Yw7gRhZrXlfCxofC1BHK22C+oTk@ifi.uio.no>
[not found] ` <fa.l7CBcHbzr+l317AuKP87w9mccUk@ifi.uio.no>
[not found] ` <fa.VXfk4ts2TBVKqgBQtfGn6RQHemg@ifi.uio.no>
2012-04-25 8:48 ` andymatei
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox