public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
From: Ben Woodard <woodard@redhat.com>
To: linux-ia64@vger.kernel.org
Subject: Re: [patch] 0/5 2.4.25-pre7 mca.c cleanup
Date: Wed, 04 Feb 2004 01:19:50 +0000	[thread overview]
Message-ID: <1075857590.11543.89.camel@xenophanes> (raw)
In-Reply-To: <6503.1075705201@kao2.melbourne.sgi.com>

Keith,

I just tested your changes and unfortunately though they definitely
eliminate the possibility that a race condition occurs in the printing
of the errors by the mca handlers, it does not fix our problem here.

There still seems to be some sort of race condition (or other error) in
the salinfo /proc file system handling code. The reason that I believe
this is that when I trigger the memory errors with the salinfo_decode
daemon running the machine immediately hangs. If I kill the
salinfo_decode daemons first and then trigger the errors, the computer
survives. I can then start the salinfo_decode daemons and everything
works fine and I get some errors stored in the decoded area.

One thing that I have noticed in the errors was that in some of the
cases I would see something like:

CPU 3: SAL log contains CPE error record
CPU 1: SAL log contains CPE error record

Then the computer would hang. Whereas in the cases where I wasn't
running the salinfo_decode daemon what I see is:

CPU 3: SAL log contains CPE error record
CPU 3: SAL log contains CPE error record
CPU 3: SAL log contains CPE error record
CPU 3: SAL log contains CPE error record
CPU 1: SAL log contains CPE error record
...<for a long time>

This leads me to believe that it is a race condition somewhere either in
the way that the salinfo /proc file system is implemented or in the
interaction between the mca handler and the salinfo.

One last thing, we are running a slightly patched version of
salinfo_decode here. There are two distinct changes. One of which you
are exceptionally familiar with and the other is of my own contrivance.

--- salinfo-0.4/mca.c   2003-12-04 12:03:18.000000000 -0800
+++ salinfo-0.4-new/mca.c       2004-01-29 14:13:25.000000000 -0800
@@ -834,7 +834,7 @@
                iprintf("Invalid PCI Component Error Record format: length = %d, "
                       " Size PCI Data = %ld, Num Mem-Map/IO-Map Regs = %d/%d\n",
                       pcei->header.len, n_pci_data, n_mem_regs, n_io_regs);
-               return;
+               goto out;
        }
  
        if (n_mem_regs) {
@@ -857,6 +857,8 @@
        }
        if (pcei->valid.oem_data)
                platform_pci_comp_err_print(&pcei->header, p_oem_data);
+ out:
+       --indent;
 }
  
 /* Format and log the platform specifie error record section data */
diff -ru salinfo-0.4/salinfo_decode.c salinfo-0.4-new/salinfo_decode.c
--- salinfo-0.4/salinfo_decode.c        2003-11-24 14:37:28.000000000 -0800
+++ salinfo-0.4-new/salinfo_decode.c    2004-01-29 15:14:50.000000000 -0800
@@ -276,10 +276,15 @@
                        cpu,
                        type,
                        suffix);
-               if (!(freopen(filename, "w", stdout) && freopen(filename, "w", stderr))) {
-                       perror(filename);
+               if ((fd = open(filename, O_WRONLY|O_CREAT|O_EXCL, S_IRUSR|S_IWUSR)) < 0){
+                       perror(filename);
                        goto out;
                }
+               if ( dup2(1,fd) != 1 && dup2(2,fd) != 2){
+                       perror(filename);
+                       goto out;
+               }
+               close(fd);
  
                printf("BEGIN HARDWARE ERROR STATE from %s on cpu %d\n", type, cpu);
                platform_info_print(buffer, 1, fd_data, cpu, oemdata_fd);

The changes in salinfo_decode.c fix a problem where the salinfo_decode
daemon crashes when it receives the second error since stdout and stderr
are already closed causing the freopen to fail.

I'm going to begin looking for the race condition. Let me know via
private email if you find it before I do even if you do not have a fix
for it yet, so that I don't waste my time looking for something you've
already found. I'll do the same for you. When we have the fix, we'll
return to the linux-ia64 mailing list.

-ben

On Mon, 2004-02-02 at 17:27, Keith Owens wrote:
> On 02 Feb 2004 16:36:17 -0800, 
> Ben Woodard <woodard@redhat.com> wrote:
> >Jim Garlick and I have both been trying to make the smallest changes
> >possible to fix the problems we have been seeing.
> 
> My changes delete all of the unnecessary and racy use of printk() from
> mca.c, which should go a long way to fixing the problems that you and
> Jim are seeing.  If you still see hangs or problems after my clean up,
> please let me know ASAP.


  parent reply	other threads:[~2004-02-04  1:19 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-02-02  7:00 [patch] 0/5 2.4.25-pre7 mca.c cleanup Keith Owens
2004-02-03  0:36 ` Ben Woodard
2004-02-03  1:27 ` Keith Owens
2004-02-04  1:19 ` Ben Woodard [this message]
2004-02-04  2:45 ` Ben Woodard
2004-02-04  7:18 ` Hidetoshi Seto
2004-02-04 20:59 ` Keith Owens
2004-02-04 23:27 ` Bjorn Helgaas
2004-02-04 23:42 ` Keith Owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1075857590.11543.89.camel@xenophanes \
    --to=woodard@redhat.com \
    --cc=linux-ia64@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox