* Timer interrupt lost on some x86_64 systems @ 2007-11-07 14:00 Neil Horman 2007-11-12 4:49 ` Vivek Goyal 0 siblings, 1 reply; 10+ messages in thread From: Neil Horman @ 2007-11-07 14:00 UTC (permalink / raw) To: kexec; +Cc: nhorman Hey all- I've been getting reports of some x86_64 systems that, on kdump kernel boot get stuck in calibrate_delay(), in both RHEL kernels and upstream kernels. The current thinking is that the lapic timer interrupt is no longer getting delivered, likely because we handle a crash condition on a cpu that isn't the boot cpu. One known offender is this motherboard: http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm My current thought is that the TIMER_LVT entry is masked on all but the boot cpu on this system (which is strange, as I was under the impression that the timer interrupt was supposed to be enabled on all CPU's nominally. At any rate, I was going to try to read/write the TIMER_LVT on the crashing processor before we jump to purgatory, or in purgatory itself, to see if that fixes the problem, but I wanted to report the issue here to see if anyone had any alternate thoughts. I know that intel ioapics had a timer related problem recently that caused the same issue, but the fix doesn't seem to help in this case. Thanks & Regards Neil -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-07 14:00 Timer interrupt lost on some x86_64 systems Neil Horman @ 2007-11-12 4:49 ` Vivek Goyal 2007-11-12 15:17 ` Neil Horman 0 siblings, 1 reply; 10+ messages in thread From: Vivek Goyal @ 2007-11-12 4:49 UTC (permalink / raw) To: Neil Horman; +Cc: kexec On Wed, Nov 07, 2007 at 09:00:06AM -0500, Neil Horman wrote: > Hey all- > I've been getting reports of some x86_64 systems that, on kdump kernel > boot get stuck in calibrate_delay(), in both RHEL kernels and upstream kernels. > The current thinking is that the lapic timer interrupt is no longer getting > delivered, likely because we handle a crash condition on a cpu that isn't the > boot cpu. One known offender is this motherboard: > http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm > My current thought is that the TIMER_LVT entry is masked on all but the boot cpu > on this system (which is strange, as I was under the impression that the timer > interrupt was supposed to be enabled on all CPU's nominally. I also thought that LAPIC timer interrupts are enabled on all cpus. > At any rate, I was > going to try to read/write the TIMER_LVT on the crashing processor before we > jump to purgatory, or in purgatory itself, to see if that fixes the problem, but I think calibrate_dealy() depends on external timer interrupt coming and not the local APIC timer interrupt. Generally it is 8254 timer chip. Now a days motherboards seems to be having HPET and I know somebody has reported problems with HPET where HPET interrupts are not coming in second kernel and system hangs in second kernel. I suspect that same might be the issue here. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-12 4:49 ` Vivek Goyal @ 2007-11-12 15:17 ` Neil Horman 2007-11-12 15:41 ` Neil Horman 2007-11-13 7:22 ` Vivek Goyal 0 siblings, 2 replies; 10+ messages in thread From: Neil Horman @ 2007-11-12 15:17 UTC (permalink / raw) To: Vivek Goyal; +Cc: Neil Horman, kexec On Mon, Nov 12, 2007 at 10:19:03AM +0530, Vivek Goyal wrote: > On Wed, Nov 07, 2007 at 09:00:06AM -0500, Neil Horman wrote: > > Hey all- > > I've been getting reports of some x86_64 systems that, on kdump kernel > > boot get stuck in calibrate_delay(), in both RHEL kernels and upstream kernels. > > The current thinking is that the lapic timer interrupt is no longer getting > > delivered, likely because we handle a crash condition on a cpu that isn't the > > boot cpu. One known offender is this motherboard: > > http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm > > My current thought is that the TIMER_LVT entry is masked on all but the boot cpu > > on this system (which is strange, as I was under the impression that the timer > > interrupt was supposed to be enabled on all CPU's nominally. > > I also thought that LAPIC timer interrupts are enabled on all cpus. > That doesn't appear to be the case. The configuration I've seen is that only one lapic has timer interrupts enabled, and the interrupt handler for the timer interrupt broadcasts the interrupt to all the other processors via IPI > > At any rate, I was > > going to try to read/write the TIMER_LVT on the crashing processor before we > > jump to purgatory, or in purgatory itself, to see if that fixes the problem, but > > I think calibrate_dealy() depends on external timer interrupt coming and > not the local APIC timer interrupt. Generally it is 8254 timer chip. Now a > days motherboards seems to be having HPET and I know somebody has reported > problems with HPET where HPET interrupts are not coming in second kernel and > system hangs in second kernel. I suspect that same might be the issue here. > Perhaps, do you have a pointer to any list discussions on the subject? I've not seen any yet. Thanks Neil > Thanks > Vivek -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-12 15:17 ` Neil Horman @ 2007-11-12 15:41 ` Neil Horman 2007-11-13 7:31 ` Vivek Goyal 2007-11-13 7:22 ` Vivek Goyal 1 sibling, 1 reply; 10+ messages in thread From: Neil Horman @ 2007-11-12 15:41 UTC (permalink / raw) To: Neil Horman; +Cc: Vivek Goyal, kexec On Mon, Nov 12, 2007 at 10:17:21AM -0500, Neil Horman wrote: > On Mon, Nov 12, 2007 at 10:19:03AM +0530, Vivek Goyal wrote: > > On Wed, Nov 07, 2007 at 09:00:06AM -0500, Neil Horman wrote: > > > Hey all- > > > I've been getting reports of some x86_64 systems that, on kdump kernel > > > boot get stuck in calibrate_delay(), in both RHEL kernels and upstream kernels. > > > The current thinking is that the lapic timer interrupt is no longer getting > > > delivered, likely because we handle a crash condition on a cpu that isn't the > > > boot cpu. One known offender is this motherboard: > > > http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm > > > My current thought is that the TIMER_LVT entry is masked on all but the boot cpu > > > on this system (which is strange, as I was under the impression that the timer > > > interrupt was supposed to be enabled on all CPU's nominally. > > > > I also thought that LAPIC timer interrupts are enabled on all cpus. > > > That doesn't appear to be the case. The configuration I've seen is that only > one lapic has timer interrupts enabled, and the interrupt handler for the timer > interrupt broadcasts the interrupt to all the other processors via IPI > > > > At any rate, I was > > > going to try to read/write the TIMER_LVT on the crashing processor before we > > > jump to purgatory, or in purgatory itself, to see if that fixes the problem, but > > > > I think calibrate_dealy() depends on external timer interrupt coming and > > not the local APIC timer interrupt. Generally it is 8254 timer chip. Now a > > days motherboards seems to be having HPET and I know somebody has reported > > problems with HPET where HPET interrupts are not coming in second kernel and > > system hangs in second kernel. I suspect that same might be the issue here. > > > Perhaps, do you have a pointer to any list discussions on the subject? I've not > seen any yet. > > Thanks > Neil > > > Thanks > > Vivek > Although, as I look at it, it would appear that time_init from start_kernel does seem to init the hpet if its available, and it silently fails if that doesn't work, moving on to the pmtimer and pit. I wonder if there is some extra magic to resetting the hpet to run on a different cpu for some systems... Neil > -- > /*************************************************** > *Neil Horman > *Software Engineer > *Red Hat, Inc. > *nhorman@redhat.com > *gpg keyid: 1024D / 0x92A74FA1 > *http://pgp.mit.edu > ***************************************************/ -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-12 15:41 ` Neil Horman @ 2007-11-13 7:31 ` Vivek Goyal 2007-11-13 14:33 ` Neil Horman 0 siblings, 1 reply; 10+ messages in thread From: Vivek Goyal @ 2007-11-13 7:31 UTC (permalink / raw) To: Neil Horman; +Cc: kexec On Mon, Nov 12, 2007 at 10:41:19AM -0500, Neil Horman wrote: [..] > > > Although, as I look at it, it would appear that time_init from start_kernel does > seem to init the hpet if its available, and it silently fails if that doesn't > work, moving on to the pmtimer and pit. I wonder if there is some extra magic > to resetting the hpet to run on a different cpu for some systems... > Neil > Any idea what kind of timer devices this motherborad has got? Which timer device gets activated in first kernel? Then we can focus on why the interrupts from same device are not coming in second kernel. In the past I have found issues with interrupt routing on IOPAPIC and interrupt lockup on LAPIC. But these issues are already solved. I would also think of priting LAPIC and IOAPIC entries to see how timer interrupt routing changes from first kernel to second. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-13 7:31 ` Vivek Goyal @ 2007-11-13 14:33 ` Neil Horman 2007-11-14 6:39 ` Vivek Goyal 0 siblings, 1 reply; 10+ messages in thread From: Neil Horman @ 2007-11-13 14:33 UTC (permalink / raw) To: Vivek Goyal; +Cc: Neil Horman, kexec On Tue, Nov 13, 2007 at 01:01:28PM +0530, Vivek Goyal wrote: > On Mon, Nov 12, 2007 at 10:41:19AM -0500, Neil Horman wrote: > [..] > > > > > > Although, as I look at it, it would appear that time_init from start_kernel does > > seem to init the hpet if its available, and it silently fails if that doesn't > > work, moving on to the pmtimer and pit. I wonder if there is some extra magic > > to resetting the hpet to run on a different cpu for some systems... > > Neil > > > > Any idea what kind of timer devices this motherborad has got? Which timer > device gets activated in first kernel? Then we can focus on why the > interrupts from same device are not coming in second kernel. > Not sure, thats a course of investigation I've got planned to pursue when our on site guy gets back from a conference next week. > In the past I have found issues with interrupt routing on IOPAPIC and > interrupt lockup on LAPIC. But these issues are already solved. I would > also think of priting LAPIC and IOAPIC entries to see how timer interrupt > routing changes from first kernel to second. > I recently read the ioapic section in the opteron processor guide and noted the ioapic routing field in the config registers, so I'll be looking at that. We also not that in the failing case on the systems in question the boot cpu is _not_ the cpu that boots the kdump kernel, and its APIC ID is 1 not 0, IIRC Thanks Neil > Thanks > Vivek -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-13 14:33 ` Neil Horman @ 2007-11-14 6:39 ` Vivek Goyal 2007-11-14 11:51 ` Neil Horman 0 siblings, 1 reply; 10+ messages in thread From: Vivek Goyal @ 2007-11-14 6:39 UTC (permalink / raw) To: Neil Horman; +Cc: kexec On Tue, Nov 13, 2007 at 09:33:30AM -0500, Neil Horman wrote: [..] > > In the past I have found issues with interrupt routing on IOPAPIC and > > interrupt lockup on LAPIC. But these issues are already solved. I would > > also think of priting LAPIC and IOAPIC entries to see how timer interrupt > > routing changes from first kernel to second. > > > I recently read the ioapic section in the opteron processor guide and noted the > ioapic routing field in the config registers, so I'll be looking at that. We > also not that in the failing case on the systems in question the boot cpu is > _not_ the cpu that boots the kdump kernel, and its APIC ID is 1 not 0, IIRC > Failing on non-boot cpu should not be an issue. I had fixed an issue in the past where non-boot cpu was not receiving the timer interrupts because of IOAPIC settings where timer interrupts were always routed to boot cpu (cpu0). Now it has been modified and while going down we determine which cpu we are crashing on and setup IOAPIC entry accordingly. See disable_IO_APIC(). Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-14 6:39 ` Vivek Goyal @ 2007-11-14 11:51 ` Neil Horman 2007-11-22 20:04 ` Eric W. Biederman 0 siblings, 1 reply; 10+ messages in thread From: Neil Horman @ 2007-11-14 11:51 UTC (permalink / raw) To: Vivek Goyal; +Cc: Neil Horman, kexec On Wed, Nov 14, 2007 at 12:09:39PM +0530, Vivek Goyal wrote: > On Tue, Nov 13, 2007 at 09:33:30AM -0500, Neil Horman wrote: > > [..] > > > In the past I have found issues with interrupt routing on IOPAPIC and > > > interrupt lockup on LAPIC. But these issues are already solved. I would > > > also think of priting LAPIC and IOAPIC entries to see how timer interrupt > > > routing changes from first kernel to second. > > > > > I recently read the ioapic section in the opteron processor guide and noted the > > ioapic routing field in the config registers, so I'll be looking at that. We > > also not that in the failing case on the systems in question the boot cpu is > > _not_ the cpu that boots the kdump kernel, and its APIC ID is 1 not 0, IIRC > > > > Failing on non-boot cpu should not be an issue. I had fixed an issue in the > past where non-boot cpu was not receiving the timer interrupts because of > IOAPIC settings where timer interrupts were always routed to boot cpu (cpu0). > > Now it has been modified and while going down we determine which cpu we > are crashing on and setup IOAPIC entry accordingly. See disable_IO_APIC(). > I see the call to it in machine_crash_shutdown, but for whatever reason, it doesn't seem to be having the desired effect in this case....hmmmmm... Thanks Neil > Thanks > Vivek -- /*************************************************** *Neil Horman *Software Engineer *Red Hat, Inc. *nhorman@redhat.com *gpg keyid: 1024D / 0x92A74FA1 *http://pgp.mit.edu ***************************************************/ _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-14 11:51 ` Neil Horman @ 2007-11-22 20:04 ` Eric W. Biederman 0 siblings, 0 replies; 10+ messages in thread From: Eric W. Biederman @ 2007-11-22 20:04 UTC (permalink / raw) To: Neil Horman; +Cc: Vivek Goyal, kexec Neil Horman <nhorman@redhat.com> writes: > On Wed, Nov 14, 2007 at 12:09:39PM +0530, Vivek Goyal wrote: >> On Tue, Nov 13, 2007 at 09:33:30AM -0500, Neil Horman wrote: >> >> [..] >> > > In the past I have found issues with interrupt routing on IOPAPIC and >> > > interrupt lockup on LAPIC. But these issues are already solved. I would >> > > also think of priting LAPIC and IOAPIC entries to see how timer interrupt >> > > routing changes from first kernel to second. >> > > >> > I recently read the ioapic section in the opteron processor guide and noted > the >> > ioapic routing field in the config registers, so I'll be looking at that. > We >> > also not that in the failing case on the systems in question the boot cpu is >> > _not_ the cpu that boots the kdump kernel, and its APIC ID is 1 not 0, IIRC >> > >> >> Failing on non-boot cpu should not be an issue. I had fixed an issue in the >> past where non-boot cpu was not receiving the timer interrupts because of >> IOAPIC settings where timer interrupts were always routed to boot cpu (cpu0). >> >> Now it has been modified and while going down we determine which cpu we >> are crashing on and setup IOAPIC entry accordingly. See disable_IO_APIC(). >> > > I see the call to it in machine_crash_shutdown, but for whatever reason, it > doesn't seem to be having the desired effect in this case....hmmmmm... I don't know if anything has happened. However a lot of this looks like going back to the current todo list item of getting the kernel to come up initially in ioapic mode. That simultaneously removes the need for machine_kexec to reprogram interrupts in virtual wire mode and it should ultimately simplify and make more robust irq initialization. At the very least reducing the amount of magic in early irq processing. Eric _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Timer interrupt lost on some x86_64 systems 2007-11-12 15:17 ` Neil Horman 2007-11-12 15:41 ` Neil Horman @ 2007-11-13 7:22 ` Vivek Goyal 1 sibling, 0 replies; 10+ messages in thread From: Vivek Goyal @ 2007-11-13 7:22 UTC (permalink / raw) To: Neil Horman; +Cc: kexec On Mon, Nov 12, 2007 at 10:17:21AM -0500, Neil Horman wrote: > On Mon, Nov 12, 2007 at 10:19:03AM +0530, Vivek Goyal wrote: > > On Wed, Nov 07, 2007 at 09:00:06AM -0500, Neil Horman wrote: > > > Hey all- > > > I've been getting reports of some x86_64 systems that, on kdump kernel > > > boot get stuck in calibrate_delay(), in both RHEL kernels and upstream kernels. > > > The current thinking is that the lapic timer interrupt is no longer getting > > > delivered, likely because we handle a crash condition on a cpu that isn't the > > > boot cpu. One known offender is this motherboard: > > > http://www.supermicro.com/Aplus/motherboard/Opteron8000/MCP55/H8QM8-2.cfm > > > My current thought is that the TIMER_LVT entry is masked on all but the boot cpu > > > on this system (which is strange, as I was under the impression that the timer > > > interrupt was supposed to be enabled on all CPU's nominally. > > > > I also thought that LAPIC timer interrupts are enabled on all cpus. > > > That doesn't appear to be the case. The configuration I've seen is that only > one lapic has timer interrupts enabled, and the interrupt handler for the timer > interrupt broadcasts the interrupt to all the other processors via IPI > > > > At any rate, I was > > > going to try to read/write the TIMER_LVT on the crashing processor before we > > > jump to purgatory, or in purgatory itself, to see if that fixes the problem, but > > > > I think calibrate_dealy() depends on external timer interrupt coming and > > not the local APIC timer interrupt. Generally it is 8254 timer chip. Now a > > days motherboards seems to be having HPET and I know somebody has reported > > problems with HPET where HPET interrupts are not coming in second kernel and > > system hangs in second kernel. I suspect that same might be the issue here. > > > Perhaps, do you have a pointer to any list discussions on the subject? I've not > seen any yet. > http://lkml.org/lkml/2007/8/20/155 Reading through the thread again looks like this guy faced issue with i386 machines and not x86_64 machines. Thanks Vivek _______________________________________________ kexec mailing list kexec@lists.infradead.org http://lists.infradead.org/mailman/listinfo/kexec ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2007-11-22 20:05 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-11-07 14:00 Timer interrupt lost on some x86_64 systems Neil Horman 2007-11-12 4:49 ` Vivek Goyal 2007-11-12 15:17 ` Neil Horman 2007-11-12 15:41 ` Neil Horman 2007-11-13 7:31 ` Vivek Goyal 2007-11-13 14:33 ` Neil Horman 2007-11-14 6:39 ` Vivek Goyal 2007-11-14 11:51 ` Neil Horman 2007-11-22 20:04 ` Eric W. Biederman 2007-11-13 7:22 ` Vivek Goyal
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.