From mboxrd@z Thu Jan 1 00:00:00 1970 From: Francois WELLENREITER Date: Wed, 05 Apr 2006 12:16:17 +0000 Subject: Re: 2.6.16 fails to resume after INIT in user space Message-Id: <4433B511.1010301@bull.net> List-Id: References: <12848.1144211334@kao2.melbourne.sgi.com> In-Reply-To: <12848.1144211334@kao2.melbourne.sgi.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: linux-ia64@vger.kernel.org Hi Keith and all, concerning this issue, it works well on Bull Novascale 5160. However, have you tested INIT feature with a 2.6.15 kernel ? Indeed, since this kernel version, I have noticed that on Intel Tiger machines, the behavior was exactly the same than the description you are giving here below. After a more detailed investigation with an ITP, I have seen that the trouble ever happens when executing the following code : ________________________________________ ia64_old_stack: add regs=3DMCA_PT_REGS_OFFSET, r3 mov b0=3Dr2 // save return address GET_IA64_MCA_DATA(temp2) LOAD_PHYSICAL(p0,temp1,1f) ;; mov cr.ipsr=3Dr0 mov cr.ifs=3Dr0 mov cr.iip=3Dtemp1 ;; invala rfi <--------------------------------------- ________________________________________ After rfi instruction, the kernel INIT handler is called again instead of executing the code located at "temp1" address. Since we provide our own SAL version on NS5160 machines, I think that the problem might be located at the SAL level, My comprehension is that there might be a misfunctioning in the SAL concerning INIT event management and when psr.mc bit is forced to 0 again, the previous INIT signal is not filtered anymore, and the entire INIT call chain is executed again. But it is just a personal interpretation and I have no proof about this. This point has been submitted to Intel gurus and is under investigation. Best regards, =20 Francois WELLENREITER >2.6.16 on SN2, compiled with gcc 3.3.3, no KDB. > >The SN2 controller 'NMI' command sends INIT to all processors, one as >monarch, the rest as slaves. If all the processors are in kernel space >(including idle) then INIT resumes after dumping the process list. If >any of the processors are in user space then INIT claims to resume but >gets something wrong, the system becomes dead. > >Send first NMI > > Entered OS INIT handler. PSP=FFe301a0 cpu=3D0 monarch=3D0 > cpu 0, INIT occurred in user space, original stack not modified > Entered OS INIT handler. PSP=FFe301a0 cpu=3D3 monarch=3D0 > Entered OS INIT handler. PSP=FFe301a0 cpu=3D2 monarch=3D0 > Entered OS INIT handler. PSP=FFe301a0 cpu=3D1 monarch=3D1 > Delaying for 5 seconds... > Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu= 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000) > > ... process dump ... > > INIT dump complete. Monarch on cpu 1 returning to normal service. > Slave on cpu 0 returning to normal service. > Slave on cpu 3 returning to normal service. > Slave on cpu 2 returning to normal service. > > ... No response ... > >Send second NMI > > Entered OS INIT handler. PSP=FFe301a0 cpu=3D3 monarch=3D0 > Entered OS INIT handler. PSP=FFe301a0 cpu=3D0 monarch=3D0 > cpu 0, INIT inconsistent previous current and r13, original stack not mo= dified > Entered OS INIT handler. PSP=FFe301a0 cpu=3D2 monarch=3D0 > Entered OS INIT handler. PSP=FFe301a0 cpu=3D1 monarch=3D1 > Delaying for 5 seconds... > Processes interrupted by INIT - 0 (cpu 1 task 0xe00000b47a4b8000) 0 (cpu= 2 task 0xe00000b47a4e8000) 0 (cpu 3 task 0xe00000b47a500000) > >cpu 0 was running in user space during the first NMI, so the original >stack was not modified. On the second NMI, current for cpu 0 does not >match r13. Which means that something went wrong when processing the >first NMI while the process was in user space. > >I am still investigating this problem, but any other eyes on the code >would be appreciated. > >- >To unsubscribe from this list: send the line "unsubscribe linux-ia64" in >the body of a message to majordomo@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html > > =20 >