* Re: 2.6.16 fails to resume after INIT in user space
2006-04-05 4:28 2.6.16 fails to resume after INIT in user space Keith Owens
@ 2006-04-05 12:16 ` Francois WELLENREITER
0 siblings, 0 replies; 2+ messages in thread
From: Francois WELLENREITER @ 2006-04-05 12:16 UTC (permalink / raw)
To: linux-ia64
Hi Keith and all,
concerning this issue, it works well on Bull Novascale 5160.
However, have you tested INIT feature with a 2.6.15 kernel ?
Indeed, since this kernel version, I have noticed that on Intel Tiger
machines,
the behavior was exactly the same than the description you are giving
here below.
After a more detailed investigation with an ITP, I have seen that the
trouble ever happens
when executing the following code :
________________________________________
ia64_old_stack:
add regs=MCA_PT_REGS_OFFSET, r3
mov b0=r2 // save return address
GET_IA64_MCA_DATA(temp2)
LOAD_PHYSICAL(p0,temp1,1f)
;;
mov cr.ipsr=r0
mov cr.ifs=r0
mov cr.iip=temp1
;;
invala
rfi <---------------------------------------
________________________________________
After rfi instruction, the kernel INIT handler is called again instead
of executing the code
located at "temp1" address.
Since we provide our own SAL version on NS5160 machines, I think that
the problem might be located at the SAL level,
My comprehension is that there might be a misfunctioning in the SAL
concerning INIT event management
and when psr.mc bit is forced to 0 again, the previous INIT signal is
not filtered anymore, and the entire INIT call chain
is executed again. But it is just a personal interpretation and I have
no proof about this.
This point has been submitted to Intel gurus and is under investigation.
Best regards,
Francois WELLENREITER
>2.6.16 on SN2, compiled with gcc 3.3.3, no KDB.
>
>The SN2 controller 'NMI' command sends INIT to all processors, one as
>monarch, the rest as slaves. If all the processors are in kernel space
>(including idle) then INIT resumes after dumping the process list. If
>any of the processors are in user space then INIT claims to resume but
>gets something wrong, the system becomes dead.
>
>Send first NMI
>
> Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
> cpu 0, INIT occurred in user space, original stack not modified
> Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
> Delaying for 5 seconds...
> Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
>
> ... process dump ...
>
> INIT dump complete. Monarch on cpu 1 returning to normal service.
> Slave on cpu 0 returning to normal service.
> Slave on cpu 3 returning to normal service.
> Slave on cpu 2 returning to normal service.
>
> ... No response ...
>
>Send second NMI
>
> Entered OS INIT handler. PSPÿe301a0 cpu=3 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=0 monarch=0
> cpu 0, INIT inconsistent previous current and r13, original stack not modified
> Entered OS INIT handler. PSPÿe301a0 cpu=2 monarch=0
> Entered OS INIT handler. PSPÿe301a0 cpu=1 monarch=1
> Delaying for 5 seconds...
> Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000)
>
>cpu 0 was running in user space during the first NMI, so the original
>stack was not modified. On the second NMI, current for cpu 0 does not
>match r13. Which means that something went wrong when processing the
>first NMI while the process was in user space.
>
>I am still investigating this problem, but any other eyes on the code
>would be appreciated.
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
^ permalink raw reply [flat|nested] 2+ messages in thread