* Re: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
@ 2005-09-01 1:38 ` david mosberger
2005-09-01 3:20 ` Keith Owens
` (4 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: david mosberger @ 2005-09-01 1:38 UTC (permalink / raw)
To: linux-ia64
On 8/31/05, tony.luck@intel.com <tony.luck@intel.com> wrote:
> To make life easier for testers, I've applied Keith's patches
> and put them into my test tree.
>
> [snip...]
>
> The above is at least as good as the previous behaivour, but something
> bad happened there at the end. There were also more delays than I
> was expecting.
I tried this quickly on a zx2000 and didn't see any additional delays.
Output looked like this:
---------------------------------------------------------
login: (
cli>toc s
Sending TOC/INIT.
Entered OS INIT handler. PSPÿe301a0
Delaying for 5 seconds...
Processes interrupted by INIT - 0 (cpu 0 task 0xa000000100808000)
Backtrace of pid 1 (init)
Call Trace:
[<a0000001006bca70>] schedule+0x950/0x10c0
spà00000001167d90 bspà00000001161088
[<a0000001006bde10>] schedule_timeout+0xd0/0x1a0
spà00000001167d90 bspà00000001161058
[<a00000010015c000>] do_select+0x3e0/0x7c0
spà00000001167dd0 bspà00000001160f08
[<a00000010015d1d0>] sys_select+0x4f0/0x900
spà00000001167df0 bspà00000001160e40
[<a00000010000b320>] ia64_ret_from_syscall+0x0/0x20
spà00000001167e30 bspà00000001160e40
[<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
spà00000001168000 bspà00000001160e40
[snip...]
Backtrace of pid 4352 (bash)
Call Trace:
[<a0000001006bca70>] schedule+0x950/0x10c0
spà0000404574fdb0 bspà00004045749308
[<a0000001006bde70>] schedule_timeout+0x130/0x1a0
spà0000404574fdb0 bspà000040457492d8
[<a0000001003c2040>] read_chan+0x560/0x13e0
spà0000404574fdf0 bspà000040457491a8
[<a0000001003b51b0>] tty_read+0x130/0x1a0
spà0000404574fe20 bspà00004045749158
[<a00000010012d160>] vfs_read+0x1e0/0x340
spà0000404574fe20 bspà00004045749100
[<a00000010012e3b0>] sys_read+0x70/0xe0
spà0000404574fe20 bspà00004045749088
[<a00000010000b320>] ia64_ret_from_syscall+0x0/0x20
spà0000404574fe30 bspà00004045749088
[<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
spà00004045750000 bspà00004045749088
INIT dump complete. Monarch on cpu 0 returning to normal service.
---------------------------------------------------------
I didn't see any delays other than the expected 5 sec delay.
--david
--
Mosberger Consulting LLC, voice/fax: 510-744-9372,
http://www.mosberger-consulting.com/
35706 Runckel Lane, Fremont, CA 94536
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
2005-09-01 1:38 ` david mosberger
@ 2005-09-01 3:20 ` Keith Owens
2005-09-01 4:58 ` david mosberger
` (3 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: Keith Owens @ 2005-09-01 3:20 UTC (permalink / raw)
To: linux-ia64
On Wed, 31 Aug 2005 16:43:15 -0700,
tony.luck@intel.com wrote:
>To make life easier for testers, I've applied Keith's patches
>and put them into my test tree.
>
>I locally applied one insignificant change in ia64_init_handler()
>to print the cpu number and the state of sos->monarch in the
>initial printk in that routine (because there was once a bug where
>Tiger SAL sent all cpus to the registered "master" INIT entry point.
>The output below shows that I have a good SAL that didn't do this).
>I haven't applied this change to my GIT tree.
>
>
>Here's a trimmed down version of the output after I hit the INIT
>button with some comments by me about timing inside {} brackets:
>
>{nothing for a few seconds ... felt like more than 5}
That delay is coming from your SAL, nothing I can do about it.
>CPU1: Entered OS INIT handler. PSPÿe301a0 monarch=1
>Delaying for 5 seconds...
>{another delay, perhaps this one was 5 seconds}
Probably 10 seconds. 5 in ia64_init_handler(), then another 5 in
ia64_wait_for_slaves(). See below.
>Processes interrupted by INIT - 0 (cpu 1 task 0xe0000001ffe90000)
Only one cpu was entered for INIT, not good.
>INIT dump complete. Monarch on cpu 1 returning to normal service.
>{another several second delay}
This is wrong. The slave INIT handler was not invoked when the monarch
was delivered, instead the slave events were delivered _after_ the
monarch returned to the interrupted context. It works for me on SGI's
SAL, all the cpus enter INIT at the same time, without any noticeable
delay. There is no delay nor lockout in the INIT handler code before
it gets to the first printk, so all the delay and out of order
execution has to be coming from your SAL.
>CPU3: Entered OS INIT handler. PSPÿe301a0 monarch=0
>{another several second delay}
>CPU2: Entered OS INIT handler. PSPÿe301a0 monarch=0
>{another several second delay}
>CPU0: Entered OS INIT handler. PSPÿe301a0 monarch=0
>{another several second delay}
>CPU1: Entered OS INIT handler. PSPÿe301a0 monarch=0
>cpu 1, INIT inconsistent r12 and r13, original stack not modified
And why was cpu 1 entered again, this time as a slave and with wrong
registers? Looks like another SAL error.
>{system hung}
Because all 4 cpus are driven as slaves. All the slaves are waiting
for the monarch to arrive. All of the above tells me that the OS code
is working fine, SAL is not.
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
2005-09-01 1:38 ` david mosberger
2005-09-01 3:20 ` Keith Owens
@ 2005-09-01 4:58 ` david mosberger
2005-09-01 5:30 ` Keith Owens
` (2 subsequent siblings)
5 siblings, 0 replies; 7+ messages in thread
From: david mosberger @ 2005-09-01 4:58 UTC (permalink / raw)
To: linux-ia64
Keith,
I took a closer look at your patch set and, by and large, I like it a
lot. I suppose there will be more debate as to whether treating
MCA/INIT events as an asynchronous task-switch is the right way to go
about it, but so far it looks to me that it does indeed simplify a lot
while creating only a managable set of new headaches. I guess a key
point is whether or not the generic scheduler mods will be accepted...
Some other points:
- In several places there are checks of the form:
+ if ((r12 & -KERNEL_STACK_SIZE) != r13) {
I don't understand why you're doing this. You should check for (r12
- r13) < KERNEL_STACK_SIZE. That does the same without relying on
alignments (which is something we have been careful to avoid in all
other ia64 code). This should also let you drop the patch to
vmlinux.lds.S.
- In ia64_os_mca_virtual_begin(), I'd suggest to use the idiom:
.save rp, r0
to terminate the call-chain. That should simplify this function.
- I worry about copying machine-state from the MCA/INIT stack to the
"old" (process) stack. What if an MCA was caused by an ECC error on
that old stack? Wouldn't this prevent you from recovering from what
might otherwise be a recoverable error?
Thanks,
--david
--
Mosberger Consulting LLC, voice/fax: 510-744-9372,
http://www.mosberger-consulting.com/
35706 Runckel Lane, Fremont, CA 94536
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
` (2 preceding siblings ...)
2005-09-01 4:58 ` david mosberger
@ 2005-09-01 5:30 ` Keith Owens
2005-09-01 16:43 ` Luck, Tony
2005-09-01 19:35 ` david mosberger
5 siblings, 0 replies; 7+ messages in thread
From: Keith Owens @ 2005-09-01 5:30 UTC (permalink / raw)
To: linux-ia64
On Wed, 31 Aug 2005 21:58:58 -0700,
david mosberger <dmosberger@gmail.com> wrote:
>Keith,
>
>I took a closer look at your patch set and, by and large, I like it a
>lot. I suppose there will be more debate as to whether treating
>MCA/INIT events as an asynchronous task-switch is the right way to go
>about it, but so far it looks to me that it does indeed simplify a lot
>while creating only a managable set of new headaches. I guess a key
>point is whether or not the generic scheduler mods will be accepted...
>
>Some other points:
>
>- In several places there are checks of the form:
>
>+ if ((r12 & -KERNEL_STACK_SIZE) != r13) {
>
> I don't understand why you're doing this. You should check for (r12
>- r13) < KERNEL_STACK_SIZE. That does the same without relying on
>alignments (which is something we have been careful to avoid in all
>other ia64 code). This should also let you drop the patch to
>vmlinux.lds.S.
Paranoia. I have seen several MCA/INIT events fail because they were
delivered while the cpu was in PAL/SAL. r12 and r13 are preserved
around calls to PAL/SAL, but they are not preserved _within_ PAL/SAL.
I am trying to verify as much as possible of the original stack before
updating it, the alignment check is something useful that I can test
for.
>- In ia64_os_mca_virtual_begin(), I'd suggest to use the idiom:
>
> .save rp, r0
>
> to terminate the call-chain. That should simplify this function.
Good idea.
>- I worry about copying machine-state from the MCA/INIT stack to the
>"old" (process) stack. What if an MCA was caused by an ECC error on
>that old stack? Wouldn't this prevent you from recovering from what
>might otherwise be a recoverable error?
At the moment, nobody expects to recover from an error in kernel
memory, so recovery is not relevant here. Using the old stack is the
only way to get a backtrace of the failing process, all its state is on
the old stack.
^ permalink raw reply [flat|nested] 7+ messages in thread* RE: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
` (3 preceding siblings ...)
2005-09-01 5:30 ` Keith Owens
@ 2005-09-01 16:43 ` Luck, Tony
2005-09-01 19:35 ` david mosberger
5 siblings, 0 replies; 7+ messages in thread
From: Luck, Tony @ 2005-09-01 16:43 UTC (permalink / raw)
To: linux-ia64
Keith wrote:
>That delay is coming from your SAL, nothing I can do about it.
>
...
>This is wrong. The slave INIT handler was not invoked when the monarch
>was delivered, instead the slave events were delivered _after_ the
>monarch returned to the interrupted context. It works for me on SGI's
>SAL, all the cpus enter INIT at the same time, without any noticeable
>delay. There is no delay nor lockout in the INIT handler code before
>it gets to the first printk, so all the delay and out of order
>execution has to be coming from your SAL.
...
>Because all 4 cpus are driven as slaves. All the slaves are waiting
>for the monarch to arrive. All of the above tells me that the OS code
>is working fine, SAL is not.
That sounds all too plausible :-(
I'll track down some SAL people ask ask them what is going on.
-Tony
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: testing mca/init patch
2005-08-31 23:43 testing mca/init patch tony.luck
` (4 preceding siblings ...)
2005-09-01 16:43 ` Luck, Tony
@ 2005-09-01 19:35 ` david mosberger
5 siblings, 0 replies; 7+ messages in thread
From: david mosberger @ 2005-09-01 19:35 UTC (permalink / raw)
To: linux-ia64
Keith,
On 8/31/05, Keith Owens <kaos@sgi.com> wrote:
> On Wed, 31 Aug 2005 21:58:58 -0700,
> david mosberger <dmosberger@gmail.com> wrote:
> >- In several places there are checks of the form:
> >
> >+ if ((r12 & -KERNEL_STACK_SIZE) != r13) {
> >
> > I don't understand why you're doing this. You should check for (r12
> >- r13) < KERNEL_STACK_SIZE. That does the same without relying on
> >alignments (which is something we have been careful to avoid in all
> >other ia64 code). This should also let you drop the patch to
> >vmlinux.lds.S.
>
> Paranoia. I have seen several MCA/INIT events fail because they were
> delivered while the cpu was in PAL/SAL. r12 and r13 are preserved
> around calls to PAL/SAL, but they are not preserved _within_ PAL/SAL.
> I am trying to verify as much as possible of the original stack before
> updating it, the alignment check is something useful that I can test
> for.
I'm not suggesting to get rid of the test completely, I suggesting to
replace it with a range check of the form:
(r12 - r13) < KERNEL_STACK_SIZE
that should be about as tight a check as the original without making
alignment-assumptions.
--david
--
Mosberger Consulting LLC, voice/fax: 510-744-9372,
http://www.mosberger-consulting.com/
35706 Runckel Lane, Fremont, CA 94536
^ permalink raw reply [flat|nested] 7+ messages in thread