public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 2.4] salinfo patch
@ 2003-12-23  1:04 Ben Woodard
  2003-12-23  2:12 ` Keith Owens
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Ben Woodard @ 2003-12-23  1:04 UTC (permalink / raw)
  To: linux-ia64

It looks to me like the salinfo.c and mca.c code can try to clear CPEs
and CMCs twice the first time in ia64_mca_log_sal_error_record and then
later when the clear message is sent to the /proc/sal/{cpe,cmc}/data
file. Did I miss something?

Here is a trivially little patch that fixes it one way. 

diff -u -r1.1.24.1.68.1 salinfo.c
--- salinfo.c   16 Dec 2003 22:14:31 -0000      1.1.24.1.68.1
+++ salinfo.c   23 Dec 2003 00:51:27 -0000
@@ -448,7 +450,10 @@
                data->saved_num = 0;
                spin_unlock_irqrestore(&data_saved_lock, flags);
        }
-       call_on_cpu(cpu, salinfo_log_clear_cpu, data);
+
+       if (!data->type = SAL_INFO_TYPE_CPE && !data->type = SAL_INFO_TYPE_CMC)
+         /*  ia64_mca_log_sal_error_record already cleared CPE and CMC errors */
+         call_on_cpu(cpu, salinfo_log_clear_cpu, data);
  
        /* clearing a record may make a new record visible */
        salinfo_log_new_read(cpu, data);

Another way to fix it would be to remove the clearing the SAL error in
ia64_mca_log_sal_error_record. Though, I have to say I am extremely
uncomfortable depending on and waiting for something in user space to
acknowledge system problems before other errors can be reported.

While I'm at it, something that I've been completely unable to
understand is why salinfo.c provides two distinct files, event and data.
It seems like the event file is virtually useless. Why not just have the
same blocking read on the data file?

-ben



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2.4] salinfo patch
  2003-12-23  1:04 [PATCH 2.4] salinfo patch Ben Woodard
@ 2003-12-23  2:12 ` Keith Owens
  2003-12-24  1:37 ` Ben Woodard
  2003-12-24  6:26 ` Keith Owens
  2 siblings, 0 replies; 4+ messages in thread
From: Keith Owens @ 2003-12-23  2:12 UTC (permalink / raw)
  To: linux-ia64

On Mon, 22 Dec 2003 17:04:38 -0800, 
Ben Woodard <ben@zork.net> wrote:
>It looks to me like the salinfo.c and mca.c code can try to clear CPEs
>and CMCs twice the first time in ia64_mca_log_sal_error_record and then
>later when the clear message is sent to the /proc/sal/{cpe,cmc}/data
>file. Did I miss something?
>
>Here is a trivially little patch that fixes it one way. 
>
>diff -u -r1.1.24.1.68.1 salinfo.c
>--- salinfo.c   16 Dec 2003 22:14:31 -0000      1.1.24.1.68.1
>+++ salinfo.c   23 Dec 2003 00:51:27 -0000
>@@ -448,7 +450,10 @@
>                data->saved_num = 0;
>                spin_unlock_irqrestore(&data_saved_lock, flags);
>        }
>-       call_on_cpu(cpu, salinfo_log_clear_cpu, data);
>+
>+       if (!data->type = SAL_INFO_TYPE_CPE && !data->type = SAL_INFO_TYPE_CMC)
>+         /*  ia64_mca_log_sal_error_record already cleared CPE and CMC errors */
>+         call_on_cpu(cpu, salinfo_log_clear_cpu, data);
>  
>        /* clearing a record may make a new record visible */
>        salinfo_log_new_read(cpu, data);

It was originally that way, but broke when a machine generated a
CMC/CPE record but mca.c did not have a chance to clear it due to other
problems.  On reboot, there was already a CMC/CPE record in prom, SAL
does not generate an interrupt for that case so mca.c was not called
and did not clear the CMC/CPE record.  Without the second clear in
salinfo, it could not read any more records.

Calling ia64_sal_clear_state_info() twice runs the small risk of
clearing two records instead of one and losing data.  That will only
occur if there are two CMC/CPE events very close together on the same
cpu, in which case the first is more useful anyway.  I did the double
call to ensure that records were always cleared and salinfo could
proceed, accepting a small risk of lost and non-critical data.

>Another way to fix it would be to remove the clearing the SAL error in
>ia64_mca_log_sal_error_record. Though, I have to say I am extremely
>uncomfortable depending on and waiting for something in user space to
>acknowledge system problems before other errors can be reported.

CMC/CPE records have to be copied and cleared before returning from the
interrupt handler.  When the record is not cleared in the interrupt
handler, I have seen the CMC/CPE interrupt immediately reasserted
again, for the same record.

>While I'm at it, something that I've been completely unable to
>understand is why salinfo.c provides two distinct files, event and data.
>It seems like the event file is virtually useless. Why not just have the
>same blocking read on the data file?

Having separate files simplifies the state transitions.  Try coding
salinfo.c to use a single file, and worry about when to return data,
when to return EOF, and when to block.  Do not forget to cater for code
that does lseek() on the data file after hitting EOF.  Separate event
and data source follows the poll() model, which also separates events
and data.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2.4] salinfo patch
  2003-12-23  1:04 [PATCH 2.4] salinfo patch Ben Woodard
  2003-12-23  2:12 ` Keith Owens
@ 2003-12-24  1:37 ` Ben Woodard
  2003-12-24  6:26 ` Keith Owens
  2 siblings, 0 replies; 4+ messages in thread
From: Ben Woodard @ 2003-12-24  1:37 UTC (permalink / raw)
  To: linux-ia64

Thank you for explaining this to me.

On Mon, 2003-12-22 at 18:12, Keith Owens wrote:
> On Mon, 22 Dec 2003 17:04:38 -0800, 
> Ben Woodard <ben@zork.net> wrote:
> >It looks to me like the salinfo.c and mca.c code can try to clear CPEs
> >and CMCs twice the first time in ia64_mca_log_sal_error_record and then
> >later when the clear message is sent to the /proc/sal/{cpe,cmc}/data
> >file. Did I miss something?
> >
> >Here is a trivially little patch that fixes it one way. 
> >
> >diff -u -r1.1.24.1.68.1 salinfo.c
> >--- salinfo.c   16 Dec 2003 22:14:31 -0000      1.1.24.1.68.1
> >+++ salinfo.c   23 Dec 2003 00:51:27 -0000
> >@@ -448,7 +450,10 @@
> >                data->saved_num = 0;
> >                spin_unlock_irqrestore(&data_saved_lock, flags);
> >        }
> >-       call_on_cpu(cpu, salinfo_log_clear_cpu, data);
> >+
> >+       if (!data->type = SAL_INFO_TYPE_CPE && !data->type = SAL_INFO_TYPE_CMC)
> >+         /*  ia64_mca_log_sal_error_record already cleared CPE and CMC errors */
> >+         call_on_cpu(cpu, salinfo_log_clear_cpu, data);
> >  
> >        /* clearing a record may make a new record visible */
> >        salinfo_log_new_read(cpu, data);
> 
> It was originally that way, but broke when a machine generated a
> CMC/CPE record but mca.c did not have a chance to clear it due to other
> problems.  On reboot, there was already a CMC/CPE record in prom, SAL
> does not generate an interrupt for that case so mca.c was not called
> and did not clear the CMC/CPE record.  Without the second clear in
> salinfo, it could not read any more records.

Wouldn't a better way to fix this potential problem to add:

diff -u -r1.1.24.1.68.1 salinfo.c
--- salinfo.c   16 Dec 2003 22:14:31 -0000      1.1.24.1.68.1
+++ salinfo.c   24 Dec 2003 01:19:15 -0000
@@ -356,6 +358,8 @@
 {
        struct salinfo_data *data = context;
        data->log_size = ia64_sal_get_state_info(data->type, (u64 *) data->log_buffer);
+       if (data->type = SAL_INFO_TYPE_CPE || data->type = SAL_INFO_TYPE_CMC)
+              ia64_sal_clear_state_info(data->type);
 }
  
 static void
@@ -448,7 +452,9 @@
                data->saved_num = 0;
                spin_unlock_irqrestore(&data_saved_lock, flags);
        }
-       call_on_cpu(cpu, salinfo_log_clear_cpu, data);
+       if (!data->type = SAL_INFO_TYPE_CPE && !data->type = SAL_INFO_TYPE_CMC)
+         /*  ia64_mca_log_sal_error_record already cleared CPE and CMC errors */
+         call_on_cpu(cpu, salinfo_log_clear_cpu, data);
  
        /* clearing a record may make a new record visible */
        salinfo_log_new_read(cpu, data);


That way each and every ia64_sal_get_state_info() has a corresponding
ia64_sal_clear_state_info().

The one in mca.c will clear all the CPEs and CMCs that come in through
the interrupt handlers and the one that I'm proposing adding will clear
all the ones fetched post-reboot. All the other types of events are
cleared by salinfo_log_clear().

That way there is no potential data loss.


> 
> Calling ia64_sal_clear_state_info() twice runs the small risk of
> clearing two records instead of one and losing data.  That will only
> occur if there are two CMC/CPE events very close together on the same
> cpu, in which case the first is more useful anyway.  I did the double
> call to ensure that records were always cleared and salinfo could
> proceed, accepting a small risk of lost and non-critical data.

The reason why I have been so focused on this piece of code is we
believe that we are seeing this data loss at least in our test
environment. So I would argue that this is not quite as unlikely as you
suspect.

> 
> >Another way to fix it would be to remove the clearing the SAL error in
> >ia64_mca_log_sal_error_record. Though, I have to say I am extremely
> >uncomfortable depending on and waiting for something in user space to
> >acknowledge system problems before other errors can be reported.
> 
> CMC/CPE records have to be copied and cleared before returning from the
> interrupt handler.  When the record is not cleared in the interrupt
> handler, I have seen the CMC/CPE interrupt immediately reasserted
> again, for the same record.

Ah yes. We have seen exactly this problem on our tiger 4 nodes. That is
why I tried to fix it over in salinfo.c rather than in the mca.c

Happy holidays.
-ben


> 
> >While I'm at it, something that I've been completely unable to
> >understand is why salinfo.c provides two distinct files, event and data.
> >It seems like the event file is virtually useless. Why not just have the
> >same blocking read on the data file?
> 
> Having separate files simplifies the state transitions.  Try coding
> salinfo.c to use a single file, and worry about when to return data,
> when to return EOF, and when to block.  Do not forget to cater for code
> that does lseek() on the data file after hitting EOF.  Separate event
> and data source follows the poll() model, which also separates events
> and data.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2.4] salinfo patch
  2003-12-23  1:04 [PATCH 2.4] salinfo patch Ben Woodard
  2003-12-23  2:12 ` Keith Owens
  2003-12-24  1:37 ` Ben Woodard
@ 2003-12-24  6:26 ` Keith Owens
  2 siblings, 0 replies; 4+ messages in thread
From: Keith Owens @ 2003-12-24  6:26 UTC (permalink / raw)
  To: linux-ia64

On Tue, 23 Dec 2003 17:37:48 -0800, 
Ben Woodard <ben@zork.net> wrote:
>Wouldn't a better way to fix this potential problem to add:
>...
>That way each and every ia64_sal_get_state_info() has a corresponding
>ia64_sal_clear_state_info().

The idea is right, the patch needed a bit of cleaning up.
Bjorn/DavidM, please apply, the patch fits both 2.4 and 2.6 (with
offsets).  Credit to Ben Woodard <ben@zork.net>.

Avoid double clear of CMC/CPE records.

--- linux/arch/ia64/kernel/salinfo.c	Wed Dec 24 17:16:12 2003
+++ linux/arch/ia64/kernel/salinfo.c	Wed Dec 24 17:15:08 2003
@@ -356,6 +356,8 @@
 {
 	struct salinfo_data *data = context;
 	data->log_size = ia64_sal_get_state_info(data->type, (u64 *) data->log_buffer);
+	if (data->type = SAL_INFO_TYPE_CPE || data->type = SAL_INFO_TYPE_CMC)
+		ia64_sal_clear_state_info(data->type);
 }
 
 static void
@@ -448,8 +450,11 @@
 		data->saved_num = 0;
 		spin_unlock_irqrestore(&data_saved_lock, flags);
 	}
-	call_on_cpu(cpu, salinfo_log_clear_cpu, data);
-
+	/* ia64_mca_log_sal_error_record or salinfo_log_read_cpu already cleared
+	 * CPE and CMC errors
+	 */
+	if (data->type != SAL_INFO_TYPE_CPE && data->type != SAL_INFO_TYPE_CMC)
+		call_on_cpu(cpu, salinfo_log_clear_cpu, data);
 	/* clearing a record may make a new record visible */
 	salinfo_log_new_read(cpu, data);
 	if (data->state = STATE_LOG_RECORD &&


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-12-24  6:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-12-23  1:04 [PATCH 2.4] salinfo patch Ben Woodard
2003-12-23  2:12 ` Keith Owens
2003-12-24  1:37 ` Ben Woodard
2003-12-24  6:26 ` Keith Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox