* 2.6.16 fails to resume after INIT in user space
@ 2006-04-05 4:28 Keith Owens
2006-04-05 12:16 ` Francois WELLENREITER
0 siblings, 1 reply; 2+ messages in thread
From: Keith Owens @ 2006-04-05 4:28 UTC (permalink / raw)
To: linux-ia64
2.6.16 on SN2, compiled with gcc 3.3.3, no KDB.
The SN2 controller 'NMI' command sends INIT to all processors, one as
monarch, the rest as slaves. If all the processors are in kernel space
(including idle) then INIT resumes after dumping the process list. If
any of the processors are in user space then INIT claims to resume but
gets something wrong, the system becomes dead.
Send first NMI
Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
cpu 0, INIT occurred in user space, original stack not modified
Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
Delaying for 5 seconds...
Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
... process dump ...
INIT dump complete. Monarch on cpu 1 returning to normal service.
Slave on cpu 0 returning to normal service.
Slave on cpu 3 returning to normal service.
Slave on cpu 2 returning to normal service.
... No response ...
Send second NMI
Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
cpu 0, INIT inconsistent previous current and r13, original stack not modified
Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
Delaying for 5 seconds...
Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
cpu 0 was running in user space during the first NMI, so the original
stack was not modified. On the second NMI, current for cpu 0 does not
match r13. Which means that something went wrong when processing the
first NMI while the process was in user space.
I am still investigating this problem, but any other eyes on the code
would be appreciated.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: 2.6.16 fails to resume after INIT in user space
2006-04-05 4:28 2.6.16 fails to resume after INIT in user space Keith Owens
@ 2006-04-05 12:16 ` Francois WELLENREITER
0 siblings, 0 replies; 2+ messages in thread
From: Francois WELLENREITER @ 2006-04-05 12:16 UTC (permalink / raw)
To: linux-ia64
Hi Keith and all,
concerning this issue, it works well on Bull Novascale 5160.
However, have you tested INIT feature with a 2.6.15 kernel ?
Indeed, since this kernel version, I have noticed that on Intel Tiger
machines,
the behavior was exactly the same than the description you are giving
here below.
After a more detailed investigation with an ITP, I have seen that the
trouble ever happens
when executing the following code :
________________________________________
ia64_old_stack:
add regs=MCA_PT_REGS_OFFSET, r3
mov b0=r2 // save return address
GET_IA64_MCA_DATA(temp2)
LOAD_PHYSICAL(p0,temp1,1f)
;;
mov cr.ipsr=r0
mov cr.ifs=r0
mov cr.iip=temp1
;;
invala
rfi <---------------------------------------
________________________________________
After rfi instruction, the kernel INIT handler is called again instead
of executing the code
located at "temp1" address.
Since we provide our own SAL version on NS5160 machines, I think that
the problem might be located at the SAL level,
My comprehension is that there might be a misfunctioning in the SAL
concerning INIT event management
and when psr.mc bit is forced to 0 again, the previous INIT signal is
not filtered anymore, and the entire INIT call chain
is executed again. But it is just a personal interpretation and I have
no proof about this.
This point has been submitted to Intel gurus and is under investigation.
Best regards,
Francois WELLENREITER
>2.6.16 on SN2, compiled with gcc 3.3.3, no KDB.
>
>The SN2 controller 'NMI' command sends INIT to all processors, one as
>monarch, the rest as slaves. If all the processors are in kernel space
>(including idle) then INIT resumes after dumping the process list. If
>any of the processors are in user space then INIT claims to resume but
>gets something wrong, the system becomes dead.
>
>Send first NMI
>
> Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
> cpu 0, INIT occurred in user space, original stack not modified
> Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
> Delaying for 5 seconds...
> Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
>
> ... process dump ...
>
> INIT dump complete. Monarch on cpu 1 returning to normal service.
> Slave on cpu 0 returning to normal service.
> Slave on cpu 3 returning to normal service.
> Slave on cpu 2 returning to normal service.
>
> ... No response ...
>
>Send second NMI
>
> Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
> cpu 0, INIT inconsistent previous current and r13, original stack not modified
> Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
> Delaying for 5 seconds...
> Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
>
>cpu 0 was running in user space during the first NMI, so the original
>stack was not modified. On the second NMI, current for cpu 0 does not
>match r13. Which means that something went wrong when processing the
>first NMI while the process was in user space.
>
>I am still investigating this problem, but any other eyes on the code
>would be appreciated.
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2006-04-05 12:16 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-05 4:28 2.6.16 fails to resume after INIT in user space Keith Owens
2006-04-05 12:16 ` Francois WELLENREITER
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox