* Re: [patch] Cache error recovery
2006-12-19 17:05 [patch] Cache error recovery Russ Anderson
@ 2006-12-20 1:33 ` Hidetoshi Seto
2006-12-20 17:32 ` Russ Anderson
2006-12-21 1:58 ` Hidetoshi Seto
2 siblings, 0 replies; 4+ messages in thread
From: Hidetoshi Seto @ 2006-12-20 1:33 UTC (permalink / raw)
To: linux-ia64
Russ Anderson wrote:
> @@ -688,11 +690,11 @@ recover_from_processor_error(int platfor
> * The cache check and bus check bits have four possible states
> * cc bc
> * 0 0 Weird record, not recovered
> - * 1 0 Cache error, not recovered
> + * 1 0 Cache error, attempt recovered
> * 0 1 I/O error, attempt recovery
> * 1 1 Memory error, attempt recovery
> */
Which is right, attempt-"recovered" or "recovery"?
> - if (psp->bc = 0 || pbci = NULL)
> + if (psp->cc = 0 && (psp->bc = 0 || pbci = NULL))
> return fatal_mca("No bus check");
The message should be replaced by more appropriate one...
"No recoverable check" or just "Weird record"?
And also there are some comments need to be fixed since this
patch makes it incorrect, ex.
> /*
> * Well, here is only one bus error.
> */
Thanks,
H.Seto
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [patch] Cache error recovery
2006-12-19 17:05 [patch] Cache error recovery Russ Anderson
2006-12-20 1:33 ` Hidetoshi Seto
@ 2006-12-20 17:32 ` Russ Anderson
2006-12-21 1:58 ` Hidetoshi Seto
2 siblings, 0 replies; 4+ messages in thread
From: Russ Anderson @ 2006-12-20 17:32 UTC (permalink / raw)
To: linux-ia64
Hidetoshi Seto wrote:
> Russ Anderson wrote:
> > @@ -688,11 +690,11 @@ recover_from_processor_error(int platfor
> > * The cache check and bus check bits have four possible states
> > * cc bc
> > * 0 0 Weird record, not recovered
> > - * 1 0 Cache error, not recovered
> > + * 1 0 Cache error, attempt recovered
> > * 0 1 I/O error, attempt recovery
> > * 1 1 Memory error, attempt recovery
> > */
>
> Which is right, attempt-"recovered" or "recovery"?
>
> > - if (psp->bc = 0 || pbci = NULL)
> > + if (psp->cc = 0 && (psp->bc = 0 || pbci = NULL))
> > return fatal_mca("No bus check");
>
> The message should be replaced by more appropriate one...
> "No recoverable check" or just "Weird record"?
>
> And also there are some comments need to be fixed since this
> patch makes it incorrect, ex.
>
> > /*
> > * Well, here is only one bus error.
> > */
>
> Thanks,
> H.Seto
Here is an updated patch with comment clean up.
[patch] Cache error recovery
Similar to memory error recovery, when a cache error is consumed
by a user process terminate the user instead of crashing the system.
Signed-off-by: Russ Anderson (rja@sgi.com)
---
arch/ia64/kernel/mca_drv.c | 32 +++++++++++---------------------
1 file changed, 11 insertions(+), 21 deletions(-)
Index: test/arch/ia64/kernel/mca_drv.c
=================================--- test.orig/arch/ia64/kernel/mca_drv.c 2006-12-19 10:28:36.000000000 -0600
+++ test/arch/ia64/kernel/mca_drv.c 2006-12-20 11:16:19.091608933 -0600
@@ -602,6 +602,8 @@ recover_from_platform_error(slidx_table_
default:
break;
}
+ } else if (psp->cc && !psp->bc) { /* Cache error */
+ status = recover_from_read_error(slidx, peidx, pbci, sos);
}
return status;
@@ -645,13 +647,6 @@ recover_from_tlb_check(peidx_table_t *pe
* Return value:
* 1 on Success / 0 on Failure
*/
-/*
- * Later we try to recover when below all conditions are satisfied.
- * 1. Only one processor error section is exist.
- * 2. BUS_CHECK is exist and the others are not exist.(Except TLB_CHECK)
- * 3. The entry of BUS_CHECK_INFO is 1.
- * 4. "External bus error" flag is set and the others are not set.
- */
static int
recover_from_processor_error(int platform, slidx_table_t *slidx,
@@ -687,36 +682,31 @@ recover_from_processor_error(int platfor
/*
* The cache check and bus check bits have four possible states
* cc bc
- * 0 0 Weird record, not recovered
- * 1 0 Cache error, not recovered
- * 0 1 I/O error, attempt recovery
* 1 1 Memory error, attempt recovery
+ * 1 0 Cache error, attempt recovery
+ * 0 1 I/O error, attempt recovery
+ * 0 0 Other error type, not recovered
*/
- if (psp->bc = 0 || pbci = NULL)
- return fatal_mca("No bus check");
+ if (psp->cc = 0 && (psp->bc = 0 || pbci = NULL))
+ return fatal_mca("No cache or bus check");
/*
- * Sorry, we cannot handle so many.
+ * Cannot handle more than one bus check.
*/
if (peidx_bus_check_num(peidx) > 1)
return fatal_mca("Too many bus checks");
- /*
- * Well, here is only one bus error.
- */
+
if (pbci->ib)
return fatal_mca("Internal Bus error");
- if (pbci->cc)
- return fatal_mca("Cache-cache error");
if (pbci->eb && pbci->bsi > 0)
return fatal_mca("External bus check fatal status");
/*
- * This is a local MCA and estimated as recoverble external bus error.
- * (e.g. a load from poisoned memory)
- * This means "there are some platform errors".
+ * This is a local MCA and estimated as a recoverble error.
*/
if (platform)
return recover_from_platform_error(slidx, peidx, pbci, sos);
+
/*
* On account of strange SAL error record, we cannot recover.
*/
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc rja@sgi.com
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [patch] Cache error recovery
2006-12-19 17:05 [patch] Cache error recovery Russ Anderson
2006-12-20 1:33 ` Hidetoshi Seto
2006-12-20 17:32 ` Russ Anderson
@ 2006-12-21 1:58 ` Hidetoshi Seto
2 siblings, 0 replies; 4+ messages in thread
From: Hidetoshi Seto @ 2006-12-21 1:58 UTC (permalink / raw)
To: linux-ia64
Looks good. Thanks!
Acked-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Russ Anderson wrote:
> Here is an updated patch with comment clean up.
>
> [patch] Cache error recovery
>
> Similar to memory error recovery, when a cache error is consumed
> by a user process terminate the user instead of crashing the system.
>
> Signed-off-by: Russ Anderson (rja@sgi.com)
>
> ---
> arch/ia64/kernel/mca_drv.c | 32 +++++++++++---------------------
> 1 file changed, 11 insertions(+), 21 deletions(-)
>
> Index: test/arch/ia64/kernel/mca_drv.c
> =================================> --- test.orig/arch/ia64/kernel/mca_drv.c 2006-12-19 10:28:36.000000000 -0600
> +++ test/arch/ia64/kernel/mca_drv.c 2006-12-20 11:16:19.091608933 -0600
> @@ -602,6 +602,8 @@ recover_from_platform_error(slidx_table_
> default:
> break;
> }
> + } else if (psp->cc && !psp->bc) { /* Cache error */
> + status = recover_from_read_error(slidx, peidx, pbci, sos);
> }
>
> return status;
> @@ -645,13 +647,6 @@ recover_from_tlb_check(peidx_table_t *pe
> * Return value:
> * 1 on Success / 0 on Failure
> */
> -/*
> - * Later we try to recover when below all conditions are satisfied.
> - * 1. Only one processor error section is exist.
> - * 2. BUS_CHECK is exist and the others are not exist.(Except TLB_CHECK)
> - * 3. The entry of BUS_CHECK_INFO is 1.
> - * 4. "External bus error" flag is set and the others are not set.
> - */
>
> static int
> recover_from_processor_error(int platform, slidx_table_t *slidx,
> @@ -687,36 +682,31 @@ recover_from_processor_error(int platfor
> /*
> * The cache check and bus check bits have four possible states
> * cc bc
> - * 0 0 Weird record, not recovered
> - * 1 0 Cache error, not recovered
> - * 0 1 I/O error, attempt recovery
> * 1 1 Memory error, attempt recovery
> + * 1 0 Cache error, attempt recovery
> + * 0 1 I/O error, attempt recovery
> + * 0 0 Other error type, not recovered
> */
> - if (psp->bc = 0 || pbci = NULL)
> - return fatal_mca("No bus check");
> + if (psp->cc = 0 && (psp->bc = 0 || pbci = NULL))
> + return fatal_mca("No cache or bus check");
>
> /*
> - * Sorry, we cannot handle so many.
> + * Cannot handle more than one bus check.
> */
> if (peidx_bus_check_num(peidx) > 1)
> return fatal_mca("Too many bus checks");
> - /*
> - * Well, here is only one bus error.
> - */
> +
> if (pbci->ib)
> return fatal_mca("Internal Bus error");
> - if (pbci->cc)
> - return fatal_mca("Cache-cache error");
> if (pbci->eb && pbci->bsi > 0)
> return fatal_mca("External bus check fatal status");
>
> /*
> - * This is a local MCA and estimated as recoverble external bus error.
> - * (e.g. a load from poisoned memory)
> - * This means "there are some platform errors".
> + * This is a local MCA and estimated as a recoverble error.
> */
> if (platform)
> return recover_from_platform_error(slidx, peidx, pbci, sos);
> +
> /*
> * On account of strange SAL error record, we cannot recover.
> */
>
^ permalink raw reply [flat|nested] 4+ messages in thread