All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ron Rechenmacher <ron@fnal.gov>
To: linux-acpi@vger.kernel.org
Cc: ron@fnal.gov
Subject: suspend/resume memory corruption on Dell Latitude D830 -- help please
Date: Mon, 14 Apr 2008 17:41:10 -0500	[thread overview]
Message-ID: <4803DD86.9060105@fnal.gov> (raw)

Hi,

Kernel: 2.6.24.3 x86_64
Apologizes in advance that I cannot start a bug report as I do not know 
how to reproduce this outside of nvidia/X :(

I've written a small test program which allocates memory, initializes it 
and prompts to "suspend/resume and press enter to verify".
When I suspend(to RAM, not disk)/resume and press enter, I sometimes see 
memory corruption. If the test does not see corruption, it free memory, 
reallocs,initializes and prompts. If the test see corruption it does not
free/realloc. It just reinitializes.  (code included at end)

So, when I see corruption, it is then re-corrupted in the same exact way 
every time I suspend/resume.

I've worked with Dell and have replace the memory, system board and 
processor.  So, I'm thinking there is a 99.99% probability that the 
problem is software (including BIOS).

For memory, I have 2 times 2 GB DIMMS and if both are installed,
I can run the same test under x86_64 linux and 32 bit vista and see the 
same exact corruption signature:
     2 16 bit values get corrupted:
         1 at offset 0x09a gets changed to 0x0047
         1 at offset 0x0a2 gets changed to 0x1200
(Actually the same exact address until I exist the test and start over.)
It appears the same physical page of memory is getting written to all 
the time.

So, it appears, so far, that the problem is not with Linux per se.

Some other facts:
   o  If I run with 2 GB under 32bit Vista the test seems to always pass.
   o  If I run with 2 GB under x86_64 Linux, I still get failures.
   o  There are a couple of other failure signatures, at least one of 
which is again, exactly the same on Vista as on Linux. This one happens 
much less often, but involves at least one contiguous chunk of 152 bytes 
being corrupted. (I can supply the pattern if anyone is interested.)

Here's where I need help... (As I do not now much about the real details 
of acpi suspend/resume)
Does, or is, the OS supposed to leave some memory alone for the acpi 
BIOS to use? Since the porblem happens with 2GB and x86_64 linux but not 
with 2GB and Vista, could there be a problem with Linux?
Is this obviously a Dell BIOS problem?
I'm wondering if there is nvidia BIOS involved?
What other tests can I do?  If I suspend/resume at the text console, I 
run into the blank video problem. (Is vbetool post,... the best/only way 
to re-init the video? The executable I have seems to have some 
problems.)  Can I safely ignore the blanked video and do testing for a 
serial console?  What else could it be?  Do both Linux and Vista have bugs?

What other info should I collect?

Apologies if the consensus is that this is not a linux-acpi devel issue 
:( -- but, I'm thinking that some people on this list have contacts with 
Dell BIOS people and if it is a Dell BIOS issue, this would (either way) 
be the most efficient way to get to real problem.

As a bit of background...
For the past 8 months, I've been getting intermittent crashes of the OS 
and/or applications.  I thought there was a problem with the HW so I did 
lots of tests, mainly Dell's diagnostics. But since the crashes seemed 
to most often happen shortly after resume (within a few seconds), I 
decided to try and suspend/resume during a memory test.  Dell's tests do 
not support this, so I had to come up with my own. I first was using 
"lucifer" http://www.ibiblio.org/pub/linux/utils/lucifer-1.0.tar.gz
but now I wrote my own (included below).

Please, any discussion or comments would be helpful.

Thanks,
Ron

The program I'm using:
#include <stdio.h>              /* printf */
#include <stdlib.h>             /* strtoul */
#include <unistd.h>             /* sleep */

#define USAGE "\
   usage: %s <mem_bytes>\n\
example: %s 11000000000\n\
", argv[0], argv[0]

#define MEM_VAL 0xdeadbeef

static void
bell( int bell_cnt )
{
     if (bell_cnt <= 0) return;
     printf("\007"); fflush(stdout);
     while (--bell_cnt)
     {   sleep(1);
         printf("\007"); fflush(stdout);
     }
     return;
}

int
main(  int      argc
      , char     *argv[] )
{
         long unsigned int       mem_bytes;
         long unsigned int       mem_dwords;
         long unsigned int       ii;
         unsigned int            *ptr;
         char                    buf[80];
         int                     error_found=0;

     if (argc <= 1) { printf( USAGE );exit(0); }

     mem_bytes = strtoul( argv[1], 0, 0 );
     printf( "testing %lu bytes\n", mem_bytes );
     mem_dwords = mem_bytes/4;
     printf( "testing %lu dwords (32bit)\n", mem_dwords );

     while (1)
     {
         if (!error_found) ptr = (unsigned int *)malloc( mem_dwords*4 );
         if (ptr == NULL)
         {   printf("malloc failed (too much mem?)\n"); exit(1);
         }

         printf( "initializing the mem..." ); fflush(stdout);
         for (ii=0; ii<mem_dwords; ii++) *(ptr+ii) = MEM_VAL;
         printf( " done\n" );

         printf( "susp/resume/press enter to test\n" ); bell(1);
         fgets( buf, sizeof(buf), stdin );

         printf( "verifying the mem..." ); fflush(stdout);
         for (ii=0; ii<mem_dwords; ii++)
         {   if (*(ptr+ii) != MEM_VAL)
             {   printf( "error at %p: found 0x%08x (should be 0x%08x)\n"
                        , ptr+ii, *(ptr+ii), MEM_VAL );
                 error_found = 1;
             }
         }
         printf( " done\n" );

         if (!error_found) { free( ptr ); ptr = 0; }
         else              bell(2);
     }
     return (0);
}   /* main */



             reply	other threads:[~2008-04-14 23:41 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-04-14 22:41 Ron Rechenmacher [this message]
2008-04-15  3:53 ` suspend/resume memory corruption on Dell Latitude D830 Ron Rechenmacher
2008-04-16 15:16   ` Ron Rechenmacher

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4803DD86.9060105@fnal.gov \
    --to=ron@fnal.gov \
    --cc=linux-acpi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.