public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
@ 2005-02-08  8:25 Keith Owens
  2005-02-08 18:11 ` Luck, Tony
                   ` (19 more replies)
  0 siblings, 20 replies; 21+ messages in thread
From: Keith Owens @ 2005-02-08  8:25 UTC (permalink / raw)
  To: linux-ia64

arch/ia64/kernel/mca_asm.S is treating per_cpu__ia64_mca_data as the
start of the mca data, instead of as a pointer to the mca data.  It
ends up overwriting the rest of the per cpu area with the MCA stack and
bspstore.  Since we dereference ia64_mca_data several times, make it a
macro.

Signed-off-by: Keith Owens <kaos@sgi.com>

Index: linux/arch/ia64/kernel/mca_asm.S
=================================--- linux.orig/arch/ia64/kernel/mca_asm.S	2005-02-08 18:02:43.000000000 +1100
+++ linux/arch/ia64/kernel/mca_asm.S	2005-02-08 19:08:25.000000000 +1100
@@ -101,6 +101,11 @@
 	ld8	tmp=[sal_to_os_handoff];;				\
 	st8     [os_to_sal_handoff]=tmp;;
 
+#define GET_IA64_MCA_DATA(reg)						\
+	GET_THIS_PADDR(reg, ia64_mca_data)				\
+	;;								\
+	ld8 reg=[reg]
+
 	.global ia64_os_mca_dispatch
 	.global ia64_os_mca_dispatch_end
 	.global ia64_sal_to_os_handoff_state
@@ -309,14 +314,14 @@ err:
 done_tlb_purge_and_reload:
 
 	// Setup new stack frame for OS_MCA handling
-	GET_THIS_PADDR(r2, ia64_mca_data)
+	GET_IA64_MCA_DATA(r2)
 	;;
 	add r3 = IA64_MCA_CPU_STACKFRAME_OFFSET, r2
 	add r2 = IA64_MCA_CPU_RBSTORE_OFFSET, r2
 	;;
 	rse_switch_context(r6,r3,r2);;	// RSC management in this new context
 
-	GET_THIS_PADDR(r2, ia64_mca_data)
+	GET_IA64_MCA_DATA(r2)
 	;;
 	add r2 = IA64_MCA_CPU_STACK_OFFSET+IA64_MCA_STACK_SIZE-16, r2
 	;;
@@ -336,7 +341,7 @@ ia64_os_mca_virtual_begin:
 ia64_os_mca_virtual_end:
 
 	// restore the original stack frame here
-	GET_THIS_PADDR(r2, ia64_mca_data)
+	GET_IA64_MCA_DATA(r2)
 	;;
 	add r2 = IA64_MCA_CPU_STACKFRAME_OFFSET, r2
 	;;
@@ -380,7 +385,7 @@ ia64_os_mca_dispatch_end:
 ia64_os_mca_proc_state_dump:
 // Save bank 1 GRs 16-31 which will be used by c-language code when we switch
 //  to virtual addressing mode.
-	GET_THIS_PADDR(r2, ia64_mca_data)
+	GET_IA64_MCA_DATA(r2)
 	;;
 	add r2 = IA64_MCA_CPU_PROC_STATE_DUMP_OFFSET, r2
 	;;
@@ -613,7 +618,7 @@ end_os_mca_dump:
 ia64_os_mca_proc_state_restore:
 
 // Restore bank1 GR16-31
-	GET_THIS_PADDR(r2, ia64_mca_data)
+	GET_IA64_MCA_DATA(r2)
 	;;
 	add r2 = IA64_MCA_CPU_PROC_STATE_DUMP_OFFSET, r2
 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
@ 2005-02-08 18:11 ` Luck, Tony
  2005-02-08 18:14 ` David Mosberger
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-08 18:11 UTC (permalink / raw)
  To: linux-ia64

Keith wrote:
>arch/ia64/kernel/mca_asm.S is treating per_cpu__ia64_mca_data as the
>start of the mca data, instead of as a pointer to the mca data.  It
>ends up overwriting the rest of the per cpu area with the MCA stack and
>bspstore.  Since we dereference ia64_mca_data several times, make it a
>macro.

That's the combination of my patch to (almost) correctly do the
dereference, and Russ's patch (which fixes the spot where I managed
to deref after adding the IA64_MCA_CPU_STACKFRAME_OFFSET).  The
macro is cleaner, and avoids the possibility of making a dumb
mistake (like mine :-( )

Applied, and pushed.

This just survived me inserting a TLB error ... the resulting MCA
was fixed and logged.  The system is still ticking on all four cpus.

But Russ's last report of running with effectively the same patch
wasn't positive:

> The patch (below) helps in that they system gets through the
> MCA code and back to the error injection app, but then
> the system dies.

Russ: Any detail on why the system died?

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
  2005-02-08 18:11 ` Luck, Tony
@ 2005-02-08 18:14 ` David Mosberger
  2005-02-08 18:42 ` Luck, Tony
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: David Mosberger @ 2005-02-08 18:14 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Tue, 8 Feb 2005 10:11:24 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  >> The patch (below) helps in that they system gets through the MCA
  >> code and back to the error injection app, but then the system
  >> dies.

  Tony> Russ: Any detail on why the system died?

I assume you already checked that I didn't introduce any other stupid
dereferencing bugs?  I'd do it myself, but I'm still catching up with
my email backlog...

	--david

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
  2005-02-08 18:11 ` Luck, Tony
  2005-02-08 18:14 ` David Mosberger
@ 2005-02-08 18:42 ` Luck, Tony
  2005-02-08 19:35 ` Robin Holt
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-08 18:42 UTC (permalink / raw)
  To: linux-ia64

>I assume you already checked that I didn't introduce any other stupid
>dereferencing bugs?  I'd do it myself, but I'm still catching up with
>my email backlog...

I haven't spotted any others yet.  Still looking.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (2 preceding siblings ...)
  2005-02-08 18:42 ` Luck, Tony
@ 2005-02-08 19:35 ` Robin Holt
  2005-02-08 23:45 ` Luck, Tony
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Robin Holt @ 2005-02-08 19:35 UTC (permalink / raw)
  To: linux-ia64

> Russ: Any detail on why the system died?

Russ is out in the woods for the next couple days.  He should be
back on Thursday.

Robin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (3 preceding siblings ...)
  2005-02-08 19:35 ` Robin Holt
@ 2005-02-08 23:45 ` Luck, Tony
  2005-02-09  0:59 ` Grant Grundler
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-08 23:45 UTC (permalink / raw)
  To: linux-ia64

>I assume you already checked that I didn't introduce any other stupid
>dereferencing bugs?  I'd do it myself, but I'm still catching up with
>my email backlog...

Whatever bugs are left are subtle, not stupid.  I've been injecting
errors every 20 seconds[1].  I've seen 2 bad things happen so far:

1) My ethernet connection goes down within a dozen errors consistently.
But I think this is an artifact of the error injection technology ... the
same issue happens with 2.6.9

2) I saw one user process failure ... while stressing the system with a
"make -j16" kernel build, one of the complilations error'd out with a
bizarre internal gcc error.  So possibly a register was corrupted???

-Tony

[1] I haven't been running the injector continuously all day at that
rate ... just on and off with different tests running.  But I have
seen a few hundred MCAs recovered in this time.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (4 preceding siblings ...)
  2005-02-08 23:45 ` Luck, Tony
@ 2005-02-09  0:59 ` Grant Grundler
  2005-02-10  0:18 ` Luck, Tony
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Grant Grundler @ 2005-02-09  0:59 UTC (permalink / raw)
  To: linux-ia64

On Tue, Feb 08, 2005 at 03:45:25PM -0800, Luck, Tony wrote:
> 2) I saw one user process failure ... while stressing the system with a
> "make -j16" kernel build, one of the complilations error'd out with a
> bizarre internal gcc error.  So possibly a register was corrupted???

Likely yes.
That's one of two errors I see on parisc-linux when a register
is corrupted (segfaults are the symptom). And unfortunately
the context switching has a corner case the tickles that bug
on a regular basis. :^(

hth,
grant

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (5 preceding siblings ...)
  2005-02-09  0:59 ` Grant Grundler
@ 2005-02-10  0:18 ` Luck, Tony
  2005-02-10  0:44 ` Luck, Tony
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-10  0:18 UTC (permalink / raw)
  To: linux-ia64

>Likely yes.
>That's one of two errors I see on parisc-linux when a register
>is corrupted (segfaults are the symptom). And unfortunately
>the context switching has a corner case the tickles that bug
>on a regular basis. :^(

I wrote a test program that loads up random values into registers
(just r1-r31, a bunch of stacked registers, and f2-f127 for now)
and then checks that all the registers haven't changed value a
few thousand times, before reloading with a new set of random
values.

I'm running that along with an MCA every 15 seconds.

It just went bang, saying that "f6" was corrupted ... but I'm
not doing the right things to tickle the corner case on a regular
basis ... so I only have the one data point so far.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (6 preceding siblings ...)
  2005-02-10  0:18 ` Luck, Tony
@ 2005-02-10  0:44 ` Luck, Tony
  2005-02-10  0:54 ` David Mosberger
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-10  0:44 UTC (permalink / raw)
  To: linux-ia64

>It just went bang, saying that "f6" was corrupted ... but I'm
>not doing the right things to tickle the corner case on a regular
>basis ... so I only have the one data point so far.

Hmmm ... I'm not seeing anyplace where f6 is saved in the MCA path
before we call up to ia64_mca_ucmc_handler() ... which is bad as
f6 is designated as a scratch register.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (7 preceding siblings ...)
  2005-02-10  0:44 ` Luck, Tony
@ 2005-02-10  0:54 ` David Mosberger
  2005-02-10  1:05 ` Luck, Tony
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: David Mosberger @ 2005-02-10  0:54 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 9 Feb 2005 16:44:44 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  >> It just went bang, saying that "f6" was corrupted ... but I'm not
  >> doing the right things to tickle the corner case on a regular
  >> basis ... so I only have the one data point so far.

  Tony> Hmmm ... I'm not seeing anyplace where f6 is saved in the MCA
  Tony> path before we call up to ia64_mca_ucmc_handler() ... which is
  Tony> bad as f6 is designated as a scratch register.

That certainly would do the trick.

Am I seeing this right: the path doesn't save practically nothing
other than what is saved in the PAL min-state area?  The path
presumably also ought to switch the register-backing store (I think
Keith alluded to this previously).

	--david

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (8 preceding siblings ...)
  2005-02-10  0:54 ` David Mosberger
@ 2005-02-10  1:05 ` Luck, Tony
  2005-02-10  1:13 ` David Mosberger
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-10  1:05 UTC (permalink / raw)
  To: linux-ia64

>Am I seeing this right: the path doesn't save practically nothing
>other than what is saved in the PAL min-state area?  The path
>presumably also ought to switch the register-backing store (I think
>Keith alluded to this previously).

No ... we jump down to ia64_os_mca_proc_state_dump that saves a lot
of stuff ... but not apparently the right stuff.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (9 preceding siblings ...)
  2005-02-10  1:05 ` Luck, Tony
@ 2005-02-10  1:13 ` David Mosberger
  2005-02-10 23:59 ` Russ Anderson
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: David Mosberger @ 2005-02-10  1:13 UTC (permalink / raw)
  To: linux-ia64

>>>>> On Wed, 9 Feb 2005 17:05:52 -0800, "Luck, Tony" <tony.luck@intel.com> said:

  >> Am I seeing this right: the path doesn't save practically nothing
  >> other than what is saved in the PAL min-state area?  The path
  >> presumably also ought to switch the register-backing store (I
  >> think Keith alluded to this previously).

  Tony> No ... we jump down to ia64_os_mca_proc_state_dump that saves
  Tony> a lot of stuff ... but not apparently the right stuff.

Ah, yes, that looks (slightly) better.  That code is really hard to
follow and could use serious cleanup.

BTW: just as a heads-up: as I'm working on integrating libunwind, I'm
changing the way INIT stack-dumps are done.  The idea is to do things
in such a way that we avoid having to create a (dummy) switch-stack
structure before invoking the INIT-handler.  Basically, it'll just
simplify the assembly code.  Hopefully, this will also make it easier
to create a stack trace from an MCA-handler, but I'm not there yet.

	--david

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (10 preceding siblings ...)
  2005-02-10  1:13 ` David Mosberger
@ 2005-02-10 23:59 ` Russ Anderson
  2005-02-11  6:57 ` Luck, Tony
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Russ Anderson @ 2005-02-10 23:59 UTC (permalink / raw)
  To: linux-ia64

Tony Luck wrote:
> 
> But Russ's last report of running with effectively the same patch
> wasn't positive:
> 
> > The patch (below) helps in that they system gets through the
> > MCA code and back to the error injection app, but then
> > the system dies.
> 
> Russ: Any detail on why the system died?

No real additional information.  It still fails with the latest 
release-2.6.11 tree.  It sometimes recovers once or twice, but
not more than that.  Something is getting trashed or not restored
correctly.  At least it is easily reproduceable, which makes 
problems much easier to track down.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (11 preceding siblings ...)
  2005-02-10 23:59 ` Russ Anderson
@ 2005-02-11  6:57 ` Luck, Tony
  2005-02-11  7:33 ` Keith Owens
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-11  6:57 UTC (permalink / raw)
  To: linux-ia64

[-- Attachment #1: Type: text/plain, Size: 503 bytes --]


>No real additional information.  It still fails with the latest 
>release-2.6.11 tree.  It sometimes recovers once or twice, but
>not more than that.  Something is getting trashed or not restored
>correctly.  At least it is easily reproduceable, which makes 
>problems much easier to track down.

I've established that the scratch FP registers need to be saved, but
nobody is doing that.

Attached (untested) patch does that (not paticularly elegantly, but
should be functional).

-Tony

[-- Attachment #2: fpsave.patch --]
[-- Type: application/octet-stream, Size: 759 bytes --]

===== arch/ia64/kernel/mca_asm.S 1.20 vs edited =====
--- 1.20/arch/ia64/kernel/mca_asm.S	2005-02-08 09:57:59 -08:00
+++ edited/arch/ia64/kernel/mca_asm.S	2005-02-10 17:04:52 -08:00
@@ -599,6 +599,15 @@
 	add		r4=1,r4
 	br.cloop.sptk.few	cStRR
 	;;
+
+// save scratch FP regs
+	stf.spill [r2]=f6,16;;
+	stf.spill [r2]=f7,16;;
+	stf.spill [r2]=f8,16;;
+	stf.spill [r2]=f9,16;;
+	stf.spill [r2]=f10,16;;
+	stf.spill [r2]=f11,16;;
+
 end_os_mca_dump:
 	br	ia64_os_mca_done_dump;;
 
@@ -834,6 +843,15 @@
 	;;
 	mov		ar.lc=r5
 	;;
+
+// restore scratch FP regs
+	ldf.fill	f6=[r2],16;;
+	ldf.fill	f7=[r2],16;;
+	ldf.fill	f8=[r2],16;;
+	ldf.fill	f9=[r2],16;;
+	ldf.fill	f10=[r2],16;;
+	ldf.fill	f11=[r2],16;;
+
 end_os_mca_restore:
 	br	ia64_os_mca_done_restore;;
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (12 preceding siblings ...)
  2005-02-11  6:57 ` Luck, Tony
@ 2005-02-11  7:33 ` Keith Owens
  2005-02-11 14:45 ` Luck, Tony
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Keith Owens @ 2005-02-11  7:33 UTC (permalink / raw)
  To: linux-ia64

On Thu, 10 Feb 2005 22:57:08 -0800, 
"Luck, Tony" <tony.luck@intel.com> wrote:
>I've established that the scratch FP registers need to be saved, but
>nobody is doing that.
>
>Attached (untested) patch does that (not paticularly elegantly, but
>should be functional).

As davidm has pointed out, if the OS MCA handler saves the scratch and
preserved registers before calling C and restores the values before
returning to SAL then we have no problems.  By definition, the only
registers that matter are those in struct pt_regs.  Think of an MCA as
just another type of interrupt, with exactly the same requirements for
saving registers.

I am completely dropping the proc_state_dump data area and 90% of the
code in os_mca_dump/restore.  Instead of saving everything in sight, I
create a struct pt_regs from the current registers plus some data from
the min_state_save area.  That has three benefits - it gets rid of the
special case RSE stack frame, it gives a real pt_regs for unwinding
through the MCA handler and it guarantees that we save the required set
of registers.

Work in progress, I should have a patch by Monday.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (13 preceding siblings ...)
  2005-02-11  7:33 ` Keith Owens
@ 2005-02-11 14:45 ` Luck, Tony
  2005-02-11 14:53 ` Russ Anderson
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-02-11 14:45 UTC (permalink / raw)
  To: linux-ia64

>I am completely dropping the proc_state_dump data area and 90% of the
>code in os_mca_dump/restore.  Instead of saving everything in sight, I
>create a struct pt_regs from the current registers plus some data from
>the min_state_save area.  That has three benefits - it gets rid of the
>special case RSE stack frame, it gives a real pt_regs for unwinding
>through the MCA handler and it guarantees that we save the required set
>of registers.
>
>Work in progress, I should have a patch by Monday.

I should have added that my ugly fp-saving band-aid patch was
just intended for temporary use (small enough to take in this
late stage of 2.6.11-rcN development).  Treating MCA as just
another interrupt sounds to be *much* cleaner.  I look forward
to seeing the patch.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (14 preceding siblings ...)
  2005-02-11 14:45 ` Luck, Tony
@ 2005-02-11 14:53 ` Russ Anderson
  2005-03-01 23:32 ` Luck, Tony
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Russ Anderson @ 2005-02-11 14:53 UTC (permalink / raw)
  To: linux-ia64

Keith Owens wrote:
> 
> As davidm has pointed out, if the OS MCA handler saves the scratch and
> preserved registers before calling C and restores the values before
> returning to SAL then we have no problems.  By definition, the only
> registers that matter are those in struct pt_regs.  Think of an MCA as
> just another type of interrupt, with exactly the same requirements for
> saving registers.

That sounds like the right approach.

> I am completely dropping the proc_state_dump data area and 90% of the
> code in os_mca_dump/restore.  Instead of saving everything in sight, I
> create a struct pt_regs from the current registers plus some data from
> the min_state_save area.  That has three benefits - it gets rid of the
> special case RSE stack frame, it gives a real pt_regs for unwinding
> through the MCA handler and it guarantees that we save the required set
> of registers.

Thanks, Keith.

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (15 preceding siblings ...)
  2005-02-11 14:53 ` Russ Anderson
@ 2005-03-01 23:32 ` Luck, Tony
  2005-03-04 18:44 ` Russ Anderson
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-03-01 23:32 UTC (permalink / raw)
  To: linux-ia64

Back on February 9th, I wrote:
>I wrote a test program that loads up random values into registers
>(just r1-r31, a bunch of stacked registers, and f2-f127 for now)
>and then checks that all the registers haven't changed value a
>few thousand times, before reloading with a new set of random
>values.

A few people asked whether I could post the program ... it took
a while to get sign-off ... but that gave me time to add "branch",
"predicate" and half a dozen "application" registers to the mix,
plus make it print the name of the register that was nuked (instead
of a number that required manual translation).

I've tested it by using a debugger to zap one of each class of register
that is being monitored to check that it works.

http://www.kernel.org/pub/linux/kernel/people/aegl/ia64regcheck.tgz 

Usage ... compile, and run a few copies.  If they all "exit(0)" (which
may take a couple of days) the test passed.  Otherwise you should see
the name of the register printed to stderr, and exit code 1.

Apart from the MCA case, I haven't seen it report a problem yet ... but
I've only run a few hours.

-Tony


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (16 preceding siblings ...)
  2005-03-01 23:32 ` Luck, Tony
@ 2005-03-04 18:44 ` Russ Anderson
  2005-03-04 18:55 ` Russ Anderson
  2005-03-04 19:28 ` Luck, Tony
  19 siblings, 0 replies; 21+ messages in thread
From: Russ Anderson @ 2005-03-04 18:44 UTC (permalink / raw)
  To: linux-ia64

Tony Luck wrote:
> 
> Back on February 9th, I wrote:
> >I wrote a test program that loads up random values into registers
> >(just r1-r31, a bunch of stacked registers, and f2-f127 for now)
> >and then checks that all the registers haven't changed value a
> >few thousand times, before reloading with a new set of random
> >values.
> 
> A few people asked whether I could post the program ... it took
> a while to get sign-off ... but that gave me time to add "branch",
> "predicate" and half a dozen "application" registers to the mix,
> plus make it print the name of the register that was nuked (instead
> of a number that required manual translation).
> 
> I've tested it by using a debugger to zap one of each class of register
> that is being monitored to check that it works.
> 
> http://www.kernel.org/pub/linux/kernel/people/aegl/ia64regcheck.tgz 
> 
> Usage ... compile, and run a few copies.  If they all "exit(0)" (which
> may take a couple of days) the test passed.  Otherwise you should see
> the name of the register printed to stderr, and exit code 1.
> 
> Apart from the MCA case, I haven't seen it report a problem yet ... but
> I've only run a few hours.

I've started running multiple copies of checker with the error injection
code and so far none have indicated any errors on Altix systems.


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (17 preceding siblings ...)
  2005-03-04 18:44 ` Russ Anderson
@ 2005-03-04 18:55 ` Russ Anderson
  2005-03-04 19:28 ` Luck, Tony
  19 siblings, 0 replies; 21+ messages in thread
From: Russ Anderson @ 2005-03-04 18:55 UTC (permalink / raw)
  To: linux-ia64

Tony Luck wrote:
> 
> I've established that the scratch FP registers need to be saved, but
> nobody is doing that.
> 
> Attached (untested) patch does that (not paticularly elegantly, but
> should be functional).

When I try that patch, the system wedges in the MCA code (on Altix).


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data
  2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
                   ` (18 preceding siblings ...)
  2005-03-04 18:55 ` Russ Anderson
@ 2005-03-04 19:28 ` Luck, Tony
  19 siblings, 0 replies; 21+ messages in thread
From: Luck, Tony @ 2005-03-04 19:28 UTC (permalink / raw)
  To: linux-ia64

>I've started running multiple copies of checker with the error injection
>code and so far none have indicated any errors on Altix systems.

That's a promising start.  Which errors are you injecting ... I saw the
"f6" problem on a TLB error (which could well have been detected and
reported in the context of the checker program).

If you are still doing 2xECC memory errors, then they'll either be in
the context of some other process, which will be killed.  Or if you
are in the context of one of the "checker" programs, it will be killed
before it can see whether all the registers had the right value. So
this might not be telling us much.

>> Attached (untested) patch does that (not paticularly elegantly, but
>> should be functional).
>
>When I try that patch, the system wedges in the MCA code (on Altix).

:-(   Well I'll wait for Keith to do it properly with pt_regs etc.

-Tony

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2005-03-04 19:28 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-02-08  8:25 [patch 2.6.11-rc3-bk4] Correctly dereference ia64_mca_data Keith Owens
2005-02-08 18:11 ` Luck, Tony
2005-02-08 18:14 ` David Mosberger
2005-02-08 18:42 ` Luck, Tony
2005-02-08 19:35 ` Robin Holt
2005-02-08 23:45 ` Luck, Tony
2005-02-09  0:59 ` Grant Grundler
2005-02-10  0:18 ` Luck, Tony
2005-02-10  0:44 ` Luck, Tony
2005-02-10  0:54 ` David Mosberger
2005-02-10  1:05 ` Luck, Tony
2005-02-10  1:13 ` David Mosberger
2005-02-10 23:59 ` Russ Anderson
2005-02-11  6:57 ` Luck, Tony
2005-02-11  7:33 ` Keith Owens
2005-02-11 14:45 ` Luck, Tony
2005-02-11 14:53 ` Russ Anderson
2005-03-01 23:32 ` Luck, Tony
2005-03-04 18:44 ` Russ Anderson
2005-03-04 18:55 ` Russ Anderson
2005-03-04 19:28 ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox