* suspend/resume memory corruption on Dell Latitude D830 -- help please
@ 2008-04-14 22:41 Ron Rechenmacher
2008-04-15 3:53 ` suspend/resume memory corruption on Dell Latitude D830 Ron Rechenmacher
0 siblings, 1 reply; 3+ messages in thread
From: Ron Rechenmacher @ 2008-04-14 22:41 UTC (permalink / raw)
To: linux-acpi; +Cc: ron
Hi,
Kernel: 2.6.24.3 x86_64
Apologizes in advance that I cannot start a bug report as I do not know
how to reproduce this outside of nvidia/X :(
I've written a small test program which allocates memory, initializes it
and prompts to "suspend/resume and press enter to verify".
When I suspend(to RAM, not disk)/resume and press enter, I sometimes see
memory corruption. If the test does not see corruption, it free memory,
reallocs,initializes and prompts. If the test see corruption it does not
free/realloc. It just reinitializes. (code included at end)
So, when I see corruption, it is then re-corrupted in the same exact way
every time I suspend/resume.
I've worked with Dell and have replace the memory, system board and
processor. So, I'm thinking there is a 99.99% probability that the
problem is software (including BIOS).
For memory, I have 2 times 2 GB DIMMS and if both are installed,
I can run the same test under x86_64 linux and 32 bit vista and see the
same exact corruption signature:
2 16 bit values get corrupted:
1 at offset 0x09a gets changed to 0x0047
1 at offset 0x0a2 gets changed to 0x1200
(Actually the same exact address until I exist the test and start over.)
It appears the same physical page of memory is getting written to all
the time.
So, it appears, so far, that the problem is not with Linux per se.
Some other facts:
o If I run with 2 GB under 32bit Vista the test seems to always pass.
o If I run with 2 GB under x86_64 Linux, I still get failures.
o There are a couple of other failure signatures, at least one of
which is again, exactly the same on Vista as on Linux. This one happens
much less often, but involves at least one contiguous chunk of 152 bytes
being corrupted. (I can supply the pattern if anyone is interested.)
Here's where I need help... (As I do not now much about the real details
of acpi suspend/resume)
Does, or is, the OS supposed to leave some memory alone for the acpi
BIOS to use? Since the porblem happens with 2GB and x86_64 linux but not
with 2GB and Vista, could there be a problem with Linux?
Is this obviously a Dell BIOS problem?
I'm wondering if there is nvidia BIOS involved?
What other tests can I do? If I suspend/resume at the text console, I
run into the blank video problem. (Is vbetool post,... the best/only way
to re-init the video? The executable I have seems to have some
problems.) Can I safely ignore the blanked video and do testing for a
serial console? What else could it be? Do both Linux and Vista have bugs?
What other info should I collect?
Apologies if the consensus is that this is not a linux-acpi devel issue
:( -- but, I'm thinking that some people on this list have contacts with
Dell BIOS people and if it is a Dell BIOS issue, this would (either way)
be the most efficient way to get to real problem.
As a bit of background...
For the past 8 months, I've been getting intermittent crashes of the OS
and/or applications. I thought there was a problem with the HW so I did
lots of tests, mainly Dell's diagnostics. But since the crashes seemed
to most often happen shortly after resume (within a few seconds), I
decided to try and suspend/resume during a memory test. Dell's tests do
not support this, so I had to come up with my own. I first was using
"lucifer" http://www.ibiblio.org/pub/linux/utils/lucifer-1.0.tar.gz
but now I wrote my own (included below).
Please, any discussion or comments would be helpful.
Thanks,
Ron
The program I'm using:
#include <stdio.h> /* printf */
#include <stdlib.h> /* strtoul */
#include <unistd.h> /* sleep */
#define USAGE "\
usage: %s <mem_bytes>\n\
example: %s 11000000000\n\
", argv[0], argv[0]
#define MEM_VAL 0xdeadbeef
static void
bell( int bell_cnt )
{
if (bell_cnt <= 0) return;
printf("\007"); fflush(stdout);
while (--bell_cnt)
{ sleep(1);
printf("\007"); fflush(stdout);
}
return;
}
int
main( int argc
, char *argv[] )
{
long unsigned int mem_bytes;
long unsigned int mem_dwords;
long unsigned int ii;
unsigned int *ptr;
char buf[80];
int error_found=0;
if (argc <= 1) { printf( USAGE );exit(0); }
mem_bytes = strtoul( argv[1], 0, 0 );
printf( "testing %lu bytes\n", mem_bytes );
mem_dwords = mem_bytes/4;
printf( "testing %lu dwords (32bit)\n", mem_dwords );
while (1)
{
if (!error_found) ptr = (unsigned int *)malloc( mem_dwords*4 );
if (ptr == NULL)
{ printf("malloc failed (too much mem?)\n"); exit(1);
}
printf( "initializing the mem..." ); fflush(stdout);
for (ii=0; ii<mem_dwords; ii++) *(ptr+ii) = MEM_VAL;
printf( " done\n" );
printf( "susp/resume/press enter to test\n" ); bell(1);
fgets( buf, sizeof(buf), stdin );
printf( "verifying the mem..." ); fflush(stdout);
for (ii=0; ii<mem_dwords; ii++)
{ if (*(ptr+ii) != MEM_VAL)
{ printf( "error at %p: found 0x%08x (should be 0x%08x)\n"
, ptr+ii, *(ptr+ii), MEM_VAL );
error_found = 1;
}
}
printf( " done\n" );
if (!error_found) { free( ptr ); ptr = 0; }
else bell(2);
}
return (0);
} /* main */
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: suspend/resume memory corruption on Dell Latitude D830
2008-04-14 22:41 suspend/resume memory corruption on Dell Latitude D830 -- help please Ron Rechenmacher
@ 2008-04-15 3:53 ` Ron Rechenmacher
2008-04-16 15:16 ` Ron Rechenmacher
0 siblings, 1 reply; 3+ messages in thread
From: Ron Rechenmacher @ 2008-04-15 3:53 UTC (permalink / raw)
To: Ron Rechenmacher; +Cc: linux-acpi
More information on is at
http://fnapcf.fnal.gov/~ron/dell_susp_3.5G_2blocks.txt
and http://fnapcf.fnal.gov/~ron/dell_susp_3.5G_2blocks.dmesg.txt
Is there a way to determine which acpi mapping/allocation is not being
honored or is missing and/or using some memmap= kernel param
to get the kernel to stay away from the memory being changed during
suspend/resume? (any other kernel param that might be useful?)
Thanks,
Ron
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: suspend/resume memory corruption on Dell Latitude D830
2008-04-15 3:53 ` suspend/resume memory corruption on Dell Latitude D830 Ron Rechenmacher
@ 2008-04-16 15:16 ` Ron Rechenmacher
0 siblings, 0 replies; 3+ messages in thread
From: Ron Rechenmacher @ 2008-04-16 15:16 UTC (permalink / raw)
To: linux-acpi; +Cc: Ron Rechenmacher
I've made a significant partial/temporary fix.
I was able to use the memmap= kernel cmdline options to apparently keep
the kernel from using the memory that the suspend/resume process is
erroneously writing to:
kernel /vmlinuz-2.6.24.3-trc ro root=LABEL=/3 memmap=exactmap
memmap=636K@0 memmap=4K$636K memmap=3046M@1M memmap=1682K$3668334K
memmap=64M$3094M memmap=64K$4076M memmap=16K$4174944K
memmap=448K$4174976K memmap=24K$4175488K memmap=64K$4078M
memmap=2M$4094M memmap=511M@4097M x12345678
This results in the following output from the kernel:
...
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
BIOS-e820: 0000000000100000 - 00000000dfe5b800 (usable)
BIOS-e820: 00000000dfe5b800 - 00000000e0000000 (reserved)
BIOS-e820: 00000000f4000000 - 00000000f8000000 (reserved)
BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
BIOS-e820: 00000000fed18000 - 00000000fed1c000 (reserved)
BIOS-e820: 00000000fed20000 - 00000000fed90000 (reserved)
BIOS-e820: 00000000feda0000 - 00000000feda6000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee10000 (reserved)
BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100002000 - 0000000120000000 (usable)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 917083) 1 entries of 256 used
Entering add_active_range(0, 1048578, 1179648) 2 entries of 256 used
end_pfn_map = 1179648
user-defined physical RAM map:
user: 0000000000000000 - 000000000009f000 (usable)
user: 000000000009f000 - 00000000000a0000 (reserved)
user: 0000000000100000 - 00000000be700000 (usable)
user: 00000000dfe5b800 - 00000000e0000000 (reserved)
user: 00000000c1600000 - 00000000c5600000 (reserved)
user: 00000000fec00000 - 00000000fec10000 (reserved)
user: 00000000fed18000 - 00000000fed1c000 (reserved)
user: 00000000fed20000 - 00000000fed90000 (reserved)
user: 00000000feda0000 - 00000000feda6000 (reserved)
user: 00000000fee00000 - 00000000fee10000 (reserved)
user: 00000000ffe00000 - 0000000100000000 (reserved)
user: 0000000100100000 - 0000000120000000 (usable)
Entering add_active_range(0, 0, 159) 0 entries of 256 used
Entering add_active_range(0, 256, 780032) 1 entries of 256 used
Entering add_active_range(0, 1048832, 1179648) 2 entries of 256 used
end_pfn_map = 1179648
...
With this, I loose near 500MB but my tests pass -- I do not see any
memory corruption after the resume. I could get a lot of the 500MB back
if I did not max out the kernel cmdline.
Apparently the problem area is just after be700000.
My 8 month battle seems to have turned into a linux/linux-acpi success
story. Thanks for the kernel cmdline options. I wonder if Vista could do
this?
Thanks,
Ron
Ron Rechenmacher wrote:
> More information on is at
> http://fnapcf.fnal.gov/~ron/dell_susp_3.5G_2blocks.txt
> and http://fnapcf.fnal.gov/~ron/dell_susp_3.5G_2blocks.dmesg.txt
>
> Is there a way to determine which acpi mapping/allocation is not being
> honored or is missing and/or using some memmap= kernel param
> to get the kernel to stay away from the memory being changed during
> suspend/resume? (any other kernel param that might be useful?)
>
> Thanks,
> Ron
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-04-16 15:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-14 22:41 suspend/resume memory corruption on Dell Latitude D830 -- help please Ron Rechenmacher
2008-04-15 3:53 ` suspend/resume memory corruption on Dell Latitude D830 Ron Rechenmacher
2008-04-16 15:16 ` Ron Rechenmacher
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.