From mboxrd@z Thu Jan 1 00:00:00 1970 From: Keith Owens Date: Thu, 01 Sep 2005 03:20:46 +0000 Subject: Re: testing mca/init patch Message-Id: <12725.1125544846@kao2.melbourne.sgi.com> List-Id: References: <200508312343.j7VNhFOZ012157@agluck-lia64.sc.intel.com> In-Reply-To: <200508312343.j7VNhFOZ012157@agluck-lia64.sc.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable To: linux-ia64@vger.kernel.org On Wed, 31 Aug 2005 16:43:15 -0700,=20 tony.luck@intel.com wrote: >To make life easier for testers, I've applied Keith's patches >and put them into my test tree. > >I locally applied one insignificant change in ia64_init_handler() >to print the cpu number and the state of sos->monarch in the >initial printk in that routine (because there was once a bug where >Tiger SAL sent all cpus to the registered "master" INIT entry point. >The output below shows that I have a good SAL that didn't do this). >I haven't applied this change to my GIT tree. > > >Here's a trimmed down version of the output after I hit the INIT >button with some comments by me about timing inside {} brackets: > >{nothing for a few seconds ... felt like more than 5} That delay is coming from your SAL, nothing I can do about it. >CPU1: Entered OS INIT handler. PSP=FFe301a0 monarch=3D1 >Delaying for 5 seconds... >{another delay, perhaps this one was 5 seconds} Probably 10 seconds. 5 in ia64_init_handler(), then another 5 in ia64_wait_for_slaves(). See below. >Processes interrupted by INIT - 0 (cpu 1 task 0xe0000001ffe90000) Only one cpu was entered for INIT, not good. >INIT dump complete. Monarch on cpu 1 returning to normal service. >{another several second delay} This is wrong. The slave INIT handler was not invoked when the monarch was delivered, instead the slave events were delivered _after_ the monarch returned to the interrupted context. It works for me on SGI's SAL, all the cpus enter INIT at the same time, without any noticeable delay. There is no delay nor lockout in the INIT handler code before it gets to the first printk, so all the delay and out of order execution has to be coming from your SAL. >CPU3: Entered OS INIT handler. PSP=FFe301a0 monarch=3D0 >{another several second delay} >CPU2: Entered OS INIT handler. PSP=FFe301a0 monarch=3D0 >{another several second delay} >CPU0: Entered OS INIT handler. PSP=FFe301a0 monarch=3D0 >{another several second delay} >CPU1: Entered OS INIT handler. PSP=FFe301a0 monarch=3D0 >cpu 1, INIT inconsistent r12 and r13, original stack not modified And why was cpu 1 entered again, this time as a slave and with wrong registers? Looks like another SAL error. >{system hung} Because all 4 cpus are driven as slaves. All the slaves are waiting for the monarch to arrive. All of the above tells me that the OS code is working fine, SAL is not.