All of lore.kernel.org
 help / color / mirror / Atom feed
* Memory Corruption
@ 2002-08-19 16:50 Dave Boutcher
  2002-08-19 20:01 ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Boutcher @ 2002-08-19 16:50 UTC (permalink / raw)
  To: reiserfs-list

Hi,

I'm chasing a wierd memory corruption problem on a ppc64 system.  The
first byte of a slab_t structure keeps getting stepped on (zeroed,
actually.)  This happens during a testcase that copies a large file
called "junk" between file systems (a mix of ext2 and reiser) on a
2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
SuSE's SLES-7 release that we have customers running...

In every case, the page immediately preceding the slab_t has exactly the
same data in it, and it looks like some kind of directory structure
(note the presence of the word "junk", along with ".." and "." towards
the end.)
C000000037008E00: FD8C0600 FE8C0600 FF8C0600 008D0600 <                >
C000000037008E10: 018D0600 028D0600 038D0600 048D0600 <                >
C000000037008E20: 058D0600 068D0600 078D0600 088D0600 <                >
C000000037008E30: 098D0600 0A8D0600 0B8D0600 0C8D0600 <                >
C000000037008E40: 0D8D0600 0E8D0600 0F8D0600 108D0600 <                >
C000000037008E50: 118D0600 128D0600 138D0600 148D0600 <                >
C000000037008E60: 158D0600 168D0600 178D0600 188D0600 <                >
C000000037008E70: 198D0600 1A8D0600 1B8D0600 1C8D0600 <                >
C000000037008E80: 1D8D0600 1E8D0600 1F8D0600 208D0600 <                >
C000000037008E90: 218D0600 228D0600 238D0600 248D0600 <!   "   #   $   >
C000000037008EA0: 258D0600 268D0600 278D0600 288D0600 <%   &   '   (   >
C000000037008EB0: 298D0600 2A8D0600 2B8D0600 2C8D0600 <)   *   +   ,   >
C000000037008EC0: 2D8D0600 2E8D0600 2F8D0600 308D0600 <-   .   /   0   >
C000000037008ED0: 318D0600 328D0600 338D0600 348D0600 <1   2   3   4   >
C000000037008EE0: 358D0600 368D0600 378D0600 388D0600 <5   6   7   8   >
C000000037008EF0: 398D0600 3A8D0600 3B8D0600 3C8D0600 <9   :   ;   <   >
C000000037008F00: 3D8D0600 3E8D0600 3F8D0600 408D0600 <=   >   ?   @   >
C000000037008F10: 418D0600 428D0600 438D0600 448D0600 <A   B   C   D   >
C000000037008F20: 458D0600 468D0600 478D0600 488D0600 <E   F   G   H   >
C000000037008F30: 498D0600 4A8D0600 4B8D0600 4C8D0600 <I   J   K   L   >
C000000037008F40: 4D8D0600 4E8D0600 4F8D0600 508D0600 <M   N   O   P   >
C000000037008F50: 518D0600 528D0600 538D0600 548D0600 <Q   R   S   T   >
C000000037008F60: 558D0600 A4810000 01000000 0020F906 <U               >
C000000037008F70: 00000000 00000000 00000000 B377493D <             wI=>
C000000037008F80: C377493D C377493D 907C0300 32000000 < wI= wI= |  2   >
C000000037008F90: 01000000 01000000 02000000 40000400 <            @   >
C000000037008FA0: 02000000 00000000 01000000 38000400 <            8   >
C000000037008FB0: 80F1A501 02000000 03000000 30000400 <            0   >
C000000037008FC0: 6A756E6B 00000000 2E2E0000 00000000 <junk    ..      >
C000000037008FD0: 2E000000 00000000 ED4174F0 03000000 <.        At     >
C000000037008FE0: 48000000 00000000 00000000 00000000 <H               >
C000000037008FF0: 91B2103D B377493D B377493D 01000000 <   = wI= wI=    >

The byte immediately following that gets zeroed.  It sure looks to me
like someone is going over the end of a buffer.

The question is, does anyone recognize that data structure?!?!?!

Thanks!!!

Dave B




^ permalink raw reply	[flat|nested] 24+ messages in thread
* Memory corruption
@ 2008-04-24 15:31 Geert Uytterhoeven
  0 siblings, 0 replies; 24+ messages in thread
From: Geert Uytterhoeven @ 2008-04-24 15:31 UTC (permalink / raw)
  To: Linux/PPC Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5023 bytes --]

	Hi,

I saw some random lockups on my PS3, so I decided to give the current kernel a
try on the PS3 development tool.  It crashes when setting up the network:

| <5>Sending DHCP requests ., OK
| IP-Config: Got DHCP answer from 192.168.106.200, my address is 192.168.106.196
| IP-Config: Complete:
|      device=eth0, addr=192.168.106.196, mask=255.255.255.0, gw=192.168.106.254,
|      host=192.168.106.196, domain=sonytel.be, nis-domain=(none),
|      bootserver=192.168.106.200, rootserver=192.168.106.200, rootpath=/disk-02/ps3linux/debian-powerpc
| <5>Looking up port of RPC 100003/2 on 192.168.106.200
| <0>Unrecoverable FP Unavailable Exception 800 at c000000000305220
| Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
| SMP NR_CPUS=2 PS3
| Modules linked in:
| NIP: c000000000305220 LR: c000000000304d34 CTR: c0000000003051c0
| REGS: c00000000604aa70 TRAP: 0800   Not tainted  (2.6.25-03562-g3dc5063-dirty)
| MSR: 8000000000008032 <EE,IR,DR>  CR: 24004082  XER: 00000000
| TASK = c000000006046040[1] 'swapper' THREAD: c000000006048000 CPU: 0
| <6>GPR00: 0000000000000800 c00000000604acf0 c000000000603a88 c000000006262680 
| <6>GPR04: 0662160400000002 0000000000004000 c0000000064a4110 c00000000062eda8 
| <6>GPR08: c0000000061a6000 0000000000000001 0000000000000100 c0000000062bf880 
| <6>GPR12: 0000001100000000 c000000000548300 0000000000000000 0000000000000000 
| <6>GPR16: 0000000000000000 000000000000005c 0000000000000000 000000000000005c 
| <6>GPR20: c0000000063a9db8 00000000c0a86ac8 0000000000000000 c0000000063a9d08 
| <6>GPR24: 0000000000000040 0000000000004000 c0000000063a9b80 c000000006391e00 
| <6>GPR28: c0000000064a4020 c000000006262680 c0000000005ae478 c00000000604acf0 
| NIP [c000000000305220] .ip_output+0x60/0x8c
| LR [c000000000304d34] .ip_local_out+0x50/0x78
| Call Trace:
| [c00000000604acf0] [c00000000604ada0] 0xc00000000604ada0 (unreliable)
| [c00000000604ad70] [c000000000304d34] .ip_local_out+0x50/0x78
| [c00000000604ae00] [c0000000003050c0] .ip_push_pending_frames+0x364/0x410
| [c00000000604aeb0] [c000000000326a60] .udp_push_pending_frames+0x350/0x408
| [c00000000604af70] [c000000000328048] .udp_sendmsg+0x4c4/0x630
| [c00000000604b0d0] [c0000000003306e4] .inet_sendmsg+0x84/0xb0
| [c00000000604b170] [c0000000002cd430] .sock_sendmsg+0xc4/0x108
| [c00000000604b370] [c0000000002ceed8] .kernel_sendmsg+0x40/0x64
| [c00000000604b400] [c00000000038cc1c] .xs_send_kvec+0xc8/0x100
| [c00000000604b510] [c00000000038cd10] .xs_sendpages+0xbc/0x2f4
| [c00000000604b5e0] [c00000000038ed38] .xs_udp_send_request+0x60/0x148
| [c00000000604b680] [c00000000038b1b8] .xprt_transmit+0x144/0x27c
| [c00000000604b730] [c00000000038776c] .call_transmit+0x248/0x2b0
| [c00000000604b7d0] [c000000000390a68] .__rpc_execute+0xd8/0x314
| [c00000000604b870] [c000000000390d18] .rpc_execute+0x40/0x5c
| [c00000000604b900] [c000000000387fe8] .rpc_run_task+0x84/0xb0
| [c00000000604b9a0] [c00000000038814c] .rpc_call_sync+0x74/0xc0
| [c00000000604ba70] [c00000000039a568] .rpcb_getport_sync+0x110/0x178
| [c00000000604bb80] [c000000000511118] .root_nfs_getport+0x8c/0xbc
| [c00000000604bc30] [c0000000005112f0] .nfs_root_data+0x1a8/0x328
| [c00000000604bd70] [c0000000004f66a8] .mount_root+0x40/0x150
| [c00000000604be10] [c0000000004f695c] .prepare_namespace+0x1a4/0x1f4
| [c00000000604bea0] [c0000000004f5a48] .kernel_init+0x388/0x3c8
| [c00000000604bf90] [c0000000000229c8] .kernel_thread+0x4c/0x68
| Instruction dump:
| e9230028 e8fe8018 7c000026 54001ffe e9090018 78001f24 7d27002a 38000800 
| 7d2948f8 7d6b482a e92b0058 39290001 <c0000000> 00546e70 f9030020 4bfff775 
                                       ^^^^^^^^  ^^^^^^^^
			     should be f92b0058  b003007e

| <4>---[ end trace c7cf3d9b6c787395 ]---
| <0>Kernel panic - not syncing: Attempted to kill init!
| smp_call_function on cpu 0: other cpus not responding (0)
| 
|    System does not reboot automatically.
|    Please press POWER button.
| 
| <7>eth0: no IPv6 routers present

Findings:
  - Disabling CONFIG_INET fixed the problem.
  - I didn't manage to lock up my PS3 afterwards neither.
    But... while typing this, I saw an oops accessing address
    0xf000f000f0007000 somewhere in the networking code, so it looks like some
    corruption is going on after all.
  - Upon closer look, 8 bytes in the instruction dump above are not correct
    and have been overwritten with 0xc000000000546e70, which is the address of
    init_task.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Network and Software Technology Center Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

Sony Network and Software Technology Center Europe
A division of Sony Service Centre (Europe) N.V.
Registered office: Technologielaan 7 · B-1840 Londerzeel · Belgium
VAT BE 0413.825.160 · RPR Brussels
Fortis Bank Zaventem · BIC GEBABEBB08A · IBAN BE39001382358619

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Re: Memory Corruption
@ 2002-08-29 13:40 Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2002-08-29 13:40 UTC (permalink / raw)
  To: David Boutcher; +Cc: reiserfs-list, linux-fsdevel

On Wed, 2002-08-28 at 17:00, David Boutcher wrote:
> 
> >On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
> >> Hi,
> >>
> >> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
> >> first byte of a slab_t structure keeps getting stepped on (zeroed,
> >> actually.)  This happens during a testcase that copies a large file
> >> called "junk" between file systems (a mix of ext2 and reiser) on a
> >> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
> >> SuSE's SLES-7 release that we have customers running...
> >
> >Any chance the test case involves renames?
> >
> >-chris
> 
> So I posted my problem with memory corruption a few weeks ago....and the
> problem turned out to be a REALLY old/moldy set of userland reiser tools.
> I don't know exactly why that caused memory corruption in the kernel, but
> updating the tools fixed everything right up.

Well, that shouldn't fix it ;-)  Which version of reiserfsprogs were you
running before?

Are the filesystems getting checked during boot at all (you would see
reiserfsck messages during boot)?

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread
* Memory corruption
@ 2002-08-15 20:26 Dave Boutcher
  2002-08-15 20:36 ` Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Boutcher @ 2002-08-15 20:26 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I'm chasing a wierd memory corruption problem on a ppc64 system.  The
first byte of a slab_t structure keeps getting stepped on (zeroed,
actually.)  This happens during a testcase that copies a large file
called "junk" between file systems (a mix of ext2 and reiser) on a
2.4.13 kernel.

In every case, the page immediately preceding the slab_t has exactly the
same data in it, and it looks like some kind of directory structure
(note the presence of the word "junk", along with ".." and "." towards
the end.)

C000000037008E00: FD8C0600 FE8C0600 FF8C0600 008D0600 <                >
C000000037008E10: 018D0600 028D0600 038D0600 048D0600 <                >
C000000037008E20: 058D0600 068D0600 078D0600 088D0600 <                >
C000000037008E30: 098D0600 0A8D0600 0B8D0600 0C8D0600 <                >
C000000037008E40: 0D8D0600 0E8D0600 0F8D0600 108D0600 <                >
C000000037008E50: 118D0600 128D0600 138D0600 148D0600 <                >
C000000037008E60: 158D0600 168D0600 178D0600 188D0600 <                >
C000000037008E70: 198D0600 1A8D0600 1B8D0600 1C8D0600 <                >
C000000037008E80: 1D8D0600 1E8D0600 1F8D0600 208D0600 <                >
C000000037008E90: 218D0600 228D0600 238D0600 248D0600 <!   "   #   $   >
C000000037008EA0: 258D0600 268D0600 278D0600 288D0600 <%   &   '   (   >
C000000037008EB0: 298D0600 2A8D0600 2B8D0600 2C8D0600 <)   *   +   ,   >
C000000037008EC0: 2D8D0600 2E8D0600 2F8D0600 308D0600 <-   .   /   0   >
C000000037008ED0: 318D0600 328D0600 338D0600 348D0600 <1   2   3   4   >
C000000037008EE0: 358D0600 368D0600 378D0600 388D0600 <5   6   7   8   >
C000000037008EF0: 398D0600 3A8D0600 3B8D0600 3C8D0600 <9   :   ;   <   >
C000000037008F00: 3D8D0600 3E8D0600 3F8D0600 408D0600 <=   >   ?   @   >
C000000037008F10: 418D0600 428D0600 438D0600 448D0600 <A   B   C   D   >
C000000037008F20: 458D0600 468D0600 478D0600 488D0600 <E   F   G   H   >
C000000037008F30: 498D0600 4A8D0600 4B8D0600 4C8D0600 <I   J   K   L   >
C000000037008F40: 4D8D0600 4E8D0600 4F8D0600 508D0600 <M   N   O   P   >
C000000037008F50: 518D0600 528D0600 538D0600 548D0600 <Q   R   S   T   >
C000000037008F60: 558D0600 A4810000 01000000 0020F906 <U               >
C000000037008F70: 00000000 00000000 00000000 B377493D <             wI=>
C000000037008F80: C377493D C377493D 907C0300 32000000 < wI= wI= |  2   >
C000000037008F90: 01000000 01000000 02000000 40000400 <            @   >
C000000037008FA0: 02000000 00000000 01000000 38000400 <            8   >
C000000037008FB0: 80F1A501 02000000 03000000 30000400 <            0   >
C000000037008FC0: 6A756E6B 00000000 2E2E0000 00000000 <junk    ..      >
C000000037008FD0: 2E000000 00000000 ED4174F0 03000000 <.        At     >
C000000037008FE0: 48000000 00000000 00000000 00000000 <H               >
C000000037008FF0: 91B2103D B377493D B377493D 01000000 <   = wI= wI=    >

The byte immediately following that gets zeroed.  It sure looks to me
like someone is going over the end of a buffer.

The question is, does anyone recognize that data structure?!?!?!

Thanks!!!

Dave B




^ permalink raw reply	[flat|nested] 24+ messages in thread
* Memory Corruption
@ 2001-01-05  8:33 Ryan Sizemore
  0 siblings, 0 replies; 24+ messages in thread
From: Ryan Sizemore @ 2001-01-05  8:33 UTC (permalink / raw)
  To: Linux-Kernel

This message has a couple of questions to it, so maybe a few people might
want to contribute to answering them all. My apologies in advance for the
long length of this post.

The Problem:
I have an Alpha PC164 with 512 Meg of memory. As a friend and I were setting
it up, we tried to compile mozilla. At some point during the install, a
repeating error would scroll by the screen so fast that we could not read
it. From what we could pick out, we determined that the error was memory
related. We deduced that since compiling mozilla would fill the entire bank
of memory, once gcc (or whatever directly writes to memory) tried to address
the bad area of memory, gcc would produce the error. Also, after trying to
recompile mozilla a number of times, the error would be at a random point,
usually after about 15 or 20 minutes of compiling. From this information, we
hazard to guess that one of the eight 64 Meg SIMMS was bad, or contained a
bad area. Therefore, we removed the last 4 of the 8 modules, and the error
never occurred.

The suggested solution:
We plan to swap out the 4 of the 4 remaining modules with the 4 that we
removed earlier, one at a time, and try to compile mozilla, since it will
fill all of the memory. Then, hopefully, we can rotate the modules to find
the one that contains the bad area.

We are not quite sure what to do from there. Here are our ideas:
1. One suggestion I made was to create a ram drive over the last 64 Meg of
addressable memory, the simply not read or write to the drive. Is that even
possible? Can I tell the kernel to create a ram drive over a certain area of
memory?
2. Another idea I had was to tell the kernel to only use a certain size of
memory, with a modification to lilo.conf: append="mem=448m" since 512(the
total memory) - 64(the size of the module) = 448Meg. Will this work? Any
ideas?

Another question:
We are not sure if the memory is ECC or not, but we think that there is a
good chance of it. Are there any kernel optimizations that can be made so
that the kernel can map out the bad memory and mark it so that it cant be
used? The machine is booted from an SRM prompt, if that helps.

Please let me know if anyone had any ideas on these problems. Thanks in
advance to all those out there who took the time to read this.

--Ryan Sizemore

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 24+ messages in thread
* Memory corruption
@ 1999-06-22  1:39 Ulf Carlsson
  1999-06-30  1:01 ` William J. Earl
  0 siblings, 1 reply; 24+ messages in thread
From: Ulf Carlsson @ 1999-06-22  1:39 UTC (permalink / raw)
  To: linux

Hi,

The compiler may stop working sometimes on certain files, giving bogus error
messages which I don't understand (the compiler is probably not the only
application affected).  Running this program I just wrote forces the corrupted
caches to be flushed or something and ``fixes'' the problems:

int main(void)
{
	unsigned long tot = 0;
	unsigned long i = 1 << 20;
	void *p;
	int failures = 0;

	while (i) {
		p = malloc(i);
		if (!p) {
			if (failures++ < 10)
				continue;
			i = i >> 1;
			failures = 0;
			continue;
		}
		memset(p, 0, i);
		tot += i;
	}
	printf("Total memory set: %u kb\n", tot >> 10);
}

Maybe I should put this in my crontab along with sync :-)

Does anyone else notice these problems?

- Ulf

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-04-24 15:31 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-19 16:50 Memory Corruption Dave Boutcher
2002-08-19 20:01 ` Chris Mason
2002-08-28 21:00   ` David Boutcher
2002-08-28 21:00   ` [reiserfs-list] " David Boutcher
2002-08-29 13:40     ` Chris Mason
  -- strict thread matches above, loose matches on Subject: below --
2008-04-24 15:31 Memory corruption Geert Uytterhoeven
2002-08-29 13:40 Memory Corruption Chris Mason
2002-08-15 20:26 Memory corruption Dave Boutcher
2002-08-15 20:36 ` Andreas Dilger
2001-01-05  8:33 Memory Corruption Ryan Sizemore
1999-06-22  1:39 Memory corruption Ulf Carlsson
1999-06-30  1:01 ` William J. Earl
1999-06-30  2:47   ` Ulf Carlsson
1999-06-30 22:01     ` William J. Earl
1999-07-01  0:23       ` Ralf Baechle
1999-07-01  0:53         ` William J. Earl
1999-07-01 11:25           ` Harald Koerfgen
1999-07-02 22:41           ` Ralf Baechle
1999-07-06 13:05           ` Ralf Baechle
1999-07-07 21:08             ` Harald Koerfgen
1999-07-08  1:51               ` Warner Losh
1999-07-08  3:12                 ` William J. Earl
     [not found]                   ` <37846EE7.EADD9E32@niisi.msk.ru>
1999-07-08 17:56                     ` William J. Earl
1999-07-08 10:39               ` Ralf Baechle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.