Memory corruption

All of lore.kernel.org
 help / color / mirror / Atom feed

* Memory corruption
@ 1999-06-22  1:39 Ulf Carlsson
  1999-06-30  1:01 ` William J. Earl
  0 siblings, 1 reply; 24+ messages in thread
From: Ulf Carlsson @ 1999-06-22  1:39 UTC (permalink / raw)
  To: linux

Hi,

The compiler may stop working sometimes on certain files, giving bogus error
messages which I don't understand (the compiler is probably not the only
application affected).  Running this program I just wrote forces the corrupted
caches to be flushed or something and ``fixes'' the problems:

int main(void)
{
	unsigned long tot = 0;
	unsigned long i = 1 << 20;
	void *p;
	int failures = 0;

	while (i) {
		p = malloc(i);
		if (!p) {
			if (failures++ < 10)
				continue;
			i = i >> 1;
			failures = 0;
			continue;
		}
		memset(p, 0, i);
		tot += i;
	}
	printf("Total memory set: %u kb\n", tot >> 10);
}

Maybe I should put this in my crontab along with sync :-)

Does anyone else notice these problems?

- Ulf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-06-22  1:39 Memory corruption Ulf Carlsson
@ 1999-06-30  1:01 ` William J. Earl
  1999-06-30  2:47   ` Ulf Carlsson
  0 siblings, 1 reply; 24+ messages in thread
From: William J. Earl @ 1999-06-30  1:01 UTC (permalink / raw)
  To: Ulf Carlsson; +Cc: linux

Ulf Carlsson writes:
 > Hi,
 > 
 > The compiler may stop working sometimes on certain files, giving bogus error
 > messages which I don't understand (the compiler is probably not the only
 > application affected).  Running this program I just wrote forces the corrupted
 > caches to be flushed or something and ``fixes'' the problems:
...

      This problem sounds like a cache flushing problem.  Do you also
get SIGILL, SIGBUS, and SIGSEGV failures?  One possibility is that the icache
is not being flushed on a page fault, when a page is read in from disk,
and the icache still has old data in it.  This could lead to a cache line
of bogus instructions being executed.

      What model of CPU do you have in your machine?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-06-30  1:01 ` William J. Earl
@ 1999-06-30  2:47   ` Ulf Carlsson
  1999-06-30 22:01     ` William J. Earl
  0 siblings, 1 reply; 24+ messages in thread
From: Ulf Carlsson @ 1999-06-30  2:47 UTC (permalink / raw)
  To: William J. Earl; +Cc: linux

>  > The compiler may stop working sometimes on certain files, giving bogus
>  > error messages which I don't understand (the compiler is probably not the
>  > only application affected).  Running this program I just wrote forces the
>  > corrupted caches to be flushed or something and ``fixes'' the problems:
> ...
> 
>       This problem sounds like a cache flushing problem.  Do you also get
>       SIGILL, SIGBUS, and SIGSEGV failures?  One possibility is that the
>       icache is not being flushed on a page fault, when a page is read in from
>       disk, and the icache still has old data in it.  This could lead to a
>       cache line of bogus instructions being executed.

Sometimes when this happens I think I only get a SIGSEGV or a SIGBUS, otherwise
I get internal compiler errors.  It's hard to say since these problems are very
hard to reproduce, and I forget what happens from time to time.  I have
unfortunately not written down the results.  It sounds like this may be the
cause of the type of file corruption I have when only a little part of the file
is damaged (sounds like the problem covers both icache and dcache).  That type
of file corruption goes away after reboot.  I haven't had a chance to try this
with my discard-disk-cache program since this happens very seldom..

>       What model of CPU do you have in your machine?

I have a 133 MHz R4600 with 512kb board cache, 16kb dcache and 16kb icache.

Regards,
Ulf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-06-30  2:47   ` Ulf Carlsson
@ 1999-06-30 22:01     ` William J. Earl
  1999-07-01  0:23       ` Ralf Baechle
  0 siblings, 1 reply; 24+ messages in thread
From: William J. Earl @ 1999-06-30 22:01 UTC (permalink / raw)
  To: Ulf Carlsson; +Cc: linux, ralf

Ulf Carlsson writes:
...
 > Sometimes when this happens I think I only get a SIGSEGV or a SIGBUS, otherwise
 > I get internal compiler errors.  It's hard to say since these problems are very
 > hard to reproduce, and I forget what happens from time to time.  I have
 > unfortunately not written down the results.  It sounds like this may be the
 > cause of the type of file corruption I have when only a little part of the file
 > is damaged (sounds like the problem covers both icache and dcache).  That type
 > of file corruption goes away after reboot.  I haven't had a chance to try this
 > with my discard-disk-cache program since this happens very seldom..
 > 
 > >       What model of CPU do you have in your machine?
 > 
 > I have a 133 MHz R4600 with 512kb board cache, 16kb dcache and 16kb icache.

     I have been looking at the fault handling and the cache flushing routines
for the R4600.  In do_no_page() in mm/memory.c, we have:

	flush_page_to_ram(page);

I don't see where any code invalidates the icache, which might have
cached lines from a previous incarnation of the page.
flush_page_to_ram(), for the R4600, essentially does a writeback of
the dcache, if I understand the code correctly.  I believe that an
icache invalidate is also needed, at least for executable pages
(including any page for which mprotect() with PROT_EXEC has been
called, not just for text pages from an executable file).  Also,
unless something has changed, my understanding is that conflicting
virtual aliases (in the dcache) are still possible, which will also
lead to data corruption when it happens.

     In particular, if process A mmaps a file page at virtual index
0 and process B happens to mmap the same file page at virtual index
1, they will in general corrupt each other's view of the data.

     There is a comment in memory.c that a non-present page shouldn't
be cached, but it is not yet clear to me that this is guaranteed for
the icache.  Also, the flush_page_to_ram() slows down processing on
machines which physical cache tags, for cases where the virtual
index used by the kernel and the virtual index used by the application
are the same.  It should have an extra argument of the intended user virtual
address, so that it can decide whether to flush or not on architectures
such as MIPS.

    Handling the virtual index conflicts requires dynamic ownership
switching (including cache flushing), which means that we have to record
those hardware-valid PTEs currently referencing the page, so that we can
invalidate the PTEs and flush the cache when a fault happens for a mapping
of a different color.  We could take a brute-force approach, and record
just one mapping, forcing a fault on each use of a different message,
which would allow us to keep the reverse map in an array parallel to mem_map,
or we could use some more complex structure to record mappings.  Also,
to reduce the frequency of conflicts, address assignment in do_mmap()
should take cache color into account on machines with virtually indexed
caches which lack hardware cache coherency (such as the R4000PC, R4600,
and R5000).

    

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-06-30 22:01     ` William J. Earl
@ 1999-07-01  0:23       ` Ralf Baechle
  1999-07-01  0:53         ` William J. Earl
  0 siblings, 1 reply; 24+ messages in thread
From: Ralf Baechle @ 1999-07-01  0:23 UTC (permalink / raw)
  To: William J. Earl; +Cc: Ulf Carlsson, linux

On Wed, Jun 30, 1999 at 03:01:27PM -0700, William J. Earl wrote:

>      I have been looking at the fault handling and the cache flushing routines
> for the R4600.  In do_no_page() in mm/memory.c, we have:
> 
> 	flush_page_to_ram(page);
> 
> I don't see where any code invalidates the icache, which might have
> cached lines from a previous incarnation of the page.
> flush_page_to_ram(), for the R4600, essentially does a writeback of
> the dcache, if I understand the code correctly.  I believe that an
> icache invalidate is also needed, at least for executable pages
> (including any page for which mprotect() with PROT_EXEC has been
> called, not just for text pages from an executable file).  Also,
> unless something has changed, my understanding is that conflicting
> virtual aliases (in the dcache) are still possible, which will also
> lead to data corruption when it happens.

The particular flush_page_to_ram() call is only necessary because the
call to vma->vm_ops->nopage() may have brought the page into the
primary cache under it's KSEG0 address.

>      In particular, if process A mmaps a file page at virtual index
> 0 and process B happens to mmap the same file page at virtual index
> 1, they will in general corrupt each other's view of the data.

Oh, the common case is either shared r/o mappings or SysV SHM which per
ABI is 64kb aligned, so the hairy case doesn't hit us.  Usually ...

Especially I don't see why anything should corrupt executable pages
which are r/o mapped.

>      There is a comment in memory.c that a non-present page shouldn't
> be cached, but it is not yet clear to me that this is guaranteed for
> the icache.

Flushing the caches for pages which are being unmapped is done by
flush_cach_page and takes care of the VM_EXEC flag.

On exec, fork or exit we flush the entire cache so that problems shouldn't
hit us either.

Actually we're pretty generous with our cacheflushed, we flush more than we
should.

> Also, the flush_page_to_ram() slows down processing on
> machines which physical cache tags, for cases where the virtual
> index used by the kernel and the virtual index used by the application
> are the same.  It should have an extra argument of the intended user virtual
> address, so that it can decide whether to flush or not on architectures
> such as MIPS.

For R3000 and R6000 flush_page_to_ram() is a no-op, see arch/mips/mm/r2300.c
and arch/mips/mm/r6000.c.

For virtual indexed CPUs something like change_page_colour(oldvaddr, newvaddr)
would usually do a more efficient job than always flushing the page to
memory especially when combined with an allocator which takes the vaddr where
the page will be mapped as a hint.

  Ralf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-01  0:23       ` Ralf Baechle
@ 1999-07-01  0:53         ` William J. Earl
  1999-07-01 11:25           ` Harald Koerfgen
                             ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: William J. Earl @ 1999-07-01  0:53 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: William J. Earl, Ulf Carlsson, linux

Ralf Baechle writes:
...
 > >      In particular, if process A mmaps a file page at virtual index
 > > 0 and process B happens to mmap the same file page at virtual index
 > > 1, they will in general corrupt each other's view of the data.
 > 
 > Oh, the common case is either shared r/o mappings or SysV SHM which per
 > ABI is 64kb aligned, so the hairy case doesn't hit us.  Usually ...
 > 
 > Especially I don't see why anything should corrupt executable pages
 > which are r/o mapped.

     Suppose physical page X has been used as logical page 100 of executable
file ABC, and is then freed, but is still partially in the icache at 
virtual index 0.  Then suppose the page X is reused as logical page 200 of
executable DEF, at virtual index 0.  The writeback of the data cache is
good, but there are still cache lines from file ABC in the icache.  If
nothing flushes the icache (and there is no reason to flush the icache
when reusing a page for data), the icache will have stale data with respect
to the new identity of page X as logical page 200 of executable DEF.

     Also, if there are incompatible aliases for a page, and there are
dirty lines left in the cache when the mapping for, say, virtual index
1 is released, and then the mapping for virtual index 0 is also released,
and the page, which has KSEG0 virtual index 0 is used for I/O, the normal
flushing will flush only virtual index 0.  A later victim writeback of
the dirty lines for virtual index 1 will overwrite the new data with
stale data, even if the new data is instructions.  This case can apply
even if the one alias is a kernel KSEG0 alias and the other is a 
user alias.  For regular file I/O, this is not a problem, but it is a problem
with mmap(), particularly since Linux mmap() makes no attempt to keep multiple
mappings of the same page of a file color-congruent.  (mmap() addresses
are essentially arbitrary.)  

     The icache issue applies to all processors.  The dcache issue applies only
to the R4000PC, R4600, and R5000.

 > >      There is a comment in memory.c that a non-present page shouldn't
 > > be cached, but it is not yet clear to me that this is guaranteed for
 > > the icache.
 > 
 > Flushing the caches for pages which are being unmapped is done by
 > flush_cach_page and takes care of the VM_EXEC flag.
 > 
 > On exec, fork or exit we flush the entire cache so that problems shouldn't
 > hit us either.

      It is not clear this works as expected if the page is stolen by
vmscan.

 > Actually we're pretty generous with our cacheflushed, we flush more than we
 > should.

     Yes, but it is not clear that all paths are covered.

 > > Also, the flush_page_to_ram() slows down processing on
 > > machines which physical cache tags, for cases where the virtual
 > > index used by the kernel and the virtual index used by the application
 > > are the same.  It should have an extra argument of the intended user virtual
 > > address, so that it can decide whether to flush or not on architectures
 > > such as MIPS.
 > 
 > For R3000 and R6000 flush_page_to_ram() is a no-op, see arch/mips/mm/r2300.c
 > and arch/mips/mm/r6000.c.

    Yes, since those have write-through caches.  The icache
invalidation is still an issue, if there are any paths, such as
try_to_swap_out(), which break a virtual-to-physical mapping without
flushing the icache.

 > For virtual indexed CPUs something like change_page_colour(oldvaddr, newvaddr)
 > would usually do a more efficient job than always flushing the page to
 > memory especially when combined with an allocator which takes the vaddr where
 > the page will be mapped as a hint.

      Right.  Also, for IRIX and RISCos, I had mmap prefer an mmap
address for which color(address) == color(file_offset), so that
applications not using MAP_FIXED would always map a given file page at
the same virtual color, and I had the kernel use page_mapin() to make
a page addressable, so that I could have page_mapin() create a KSEG2
mapping of the appropriate color if it were different from the KSEG0
color of the page (for cases where the allocator could not allocate a
page with KSEG0 color to match the desired virtual color).
page_mapin() would of course return the KSEG0 address if the KSEG0
color matched the virtual color.  The color changing code is still
neaded to deal with MAP_FIXED and so on, but it is much less
performance-critical.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-01  0:53         ` William J. Earl
@ 1999-07-01 11:25           ` Harald Koerfgen
  1999-07-02 22:41           ` Ralf Baechle
  1999-07-06 13:05           ` Ralf Baechle
  2 siblings, 0 replies; 24+ messages in thread
From: Harald Koerfgen @ 1999-07-01 11:25 UTC (permalink / raw)
  To: William J. Earl; +Cc: linux, Ulf Carlsson, Ralf Baechle, linux-mips


On 01-Jul-99 William J. Earl wrote:
> Ralf Baechle writes:
[...]
>  > Actually we're pretty generous with our cacheflushed, we flush more than we
>  > should.
> 
>      Yes, but it is not clear that all paths are covered.
> 
>  > > Also, the flush_page_to_ram() slows down processing on
>  > > machines which physical cache tags, for cases where the virtual
>  > > index used by the kernel and the virtual index used by the application
>  > > are the same.  It should have an extra argument of the intended user virtual
>  > > address, so that it can decide whether to flush or not on architectures
>  > > such as MIPS.
>  > 
>  > For R3000 and R6000 flush_page_to_ram() is a no-op, see arch/mips/mm/r2300.c
>  > and arch/mips/mm/r6000.c.
> 
>     Yes, since those have write-through caches.  The icache
> invalidation is still an issue, if there are any paths, such as
> try_to_swap_out(), which break a virtual-to-physical mapping without
> flushing the icache.

A good point. That seems to be exactly the problem R3k DECstations have. Processes
are dying with SIGABRT SIGBUS or SIGSEGV shortly after swapping occurs. Trying to
hunt that down I removed all optimisations from the cacheflushing routines and 
replaced them with flush_cache_all() but that didn't help.

---
Regards,
Harald

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-01  0:53         ` William J. Earl
  1999-07-01 11:25           ` Harald Koerfgen
@ 1999-07-02 22:41           ` Ralf Baechle
  1999-07-06 13:05           ` Ralf Baechle
  2 siblings, 0 replies; 24+ messages in thread
From: Ralf Baechle @ 1999-07-02 22:41 UTC (permalink / raw)
  To: William J. Earl; +Cc: Ulf Carlsson, linux

On Wed, Jun 30, 1999 at 05:53:58PM -0700, William J. Earl wrote:

>      Suppose physical page X has been used as logical page 100 of executable
> file ABC, and is then freed, but is still partially in the icache at 
> virtual index 0.  Then suppose the page X is reused as logical page 200 of
> executable DEF, at virtual index 0.  The writeback of the data cache is
> good, but there are still cache lines from file ABC in the icache.  If
> nothing flushes the icache (and there is no reason to flush the icache
> when reusing a page for data), the icache will have stale data with respect
> to the new identity of page X as logical page 200 of executable DEF.

Ok, yes that can happen in theory if code has been executed in a page
which was not marked PROT_EXEC but execed though.  Fixing that makes things
quite a bit slower, we'll have to flush the icache on every flush_cache_page.
flush_cache_range() already does this.

Hmm...  Maybe a my-software-behaves-properly-and-I-know-this-is-dangerous-
sysctl() which restablishes the current i-cache flushing behaviour if
VM_EXEC is unset?

I herewith order an execution protection bit for the next generation MIPS
and while we're at it an integer add with carry for faster IP checksums.

> The icache issue applies to all processors.  The dcache issue applies only
> to the R4000PC, R4600, and R5000.

And R41xx, R42xx, R43xx, R4700, Nevada, Kronus, Sony Playstation II CPU ...

>  > Flushing the caches for pages which are being unmapped is done by
>  > flush_cach_page and takes care of the VM_EXEC flag.
>  > 
>  > On exec, fork or exit we flush the entire cache so that problems shouldn't
>  > hit us either.
>
> It is not clear this works as expected if the page is stolen by vmscan.

The thing is that as I already mentioned above a page might be in the
icache even though it isn't marked as VM_EXEC.

>  > Actually we're pretty generous with our cacheflushed, we flush more
>  > than we should.
>
> Yes, but it is not clear that all paths are covered.
>
>  > > Also, the flush_page_to_ram() slows down processing on
>  > > machines which physical cache tags, for cases where the virtual
>  > > index used by the kernel and the virtual index used by the application
>  > > are the same.  It should have an extra argument of the intended user
>  > > virtual address, so that it can decide whether to flush or not on
>  > > architectures such as MIPS.
>  > 
>  > For R3000 and R6000 flush_page_to_ram() is a no-op, see
>  >  arch/mips/mm/r2300.c and arch/mips/mm/r6000.c.
>
> Yes, since those have write-through caches.

The cache write policy doesn't matter in that case.

> The icache invalidation is still an issue, if there are any paths, such
> as try_to_swap_out(), which break a virtual-to-physical mapping without
> flushing the icache.

>  > For virtual indexed CPUs something like change_page_colour(oldvaddr,
>  > newvaddr) would usually do a more efficient job than always flushing the
>  > page to memory especially when combined with an allocator which takes the
>  > vaddr where the page will be mapped as a hint.
>
>       Right.  Also, for IRIX and RISCos, I had mmap prefer an mmap
> address for which color(address) == color(file_offset), so that
> applications not using MAP_FIXED would always map a given file page at
> the same virtual color, and I had the kernel use page_mapin() to make
> a page addressable, so that I could have page_mapin() create a KSEG2
> mapping of the appropriate color if it were different from the KSEG0
> color of the page (for cases where the allocator could not allocate a
> page with KSEG0 color to match the desired virtual color).
> page_mapin() would of course return the KSEG0 address if the KSEG0
> color matched the virtual color.  The color changing code is still
> neaded to deal with MAP_FIXED and so on, but it is much less
> performance-critical.

That will also deal efficiently with the way ld.so loads ELF binaries.

  Ralf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-01  0:53         ` William J. Earl
  1999-07-01 11:25           ` Harald Koerfgen
  1999-07-02 22:41           ` Ralf Baechle
@ 1999-07-06 13:05           ` Ralf Baechle
  1999-07-07 21:08             ` Harald Koerfgen
  2 siblings, 1 reply; 24+ messages in thread
From: Ralf Baechle @ 1999-07-06 13:05 UTC (permalink / raw)
  To: William J. Earl; +Cc: Ulf Carlsson, linux, linux-mips, linux-mips

I've received a report from some person who is working on his own R3081
port.  He also observes data corruption and suspects reading of swapped
pages is causing that.

Sigh,

  Ralf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-06 13:05           ` Ralf Baechle
@ 1999-07-07 21:08             ` Harald Koerfgen
  1999-07-08  1:51               ` Warner Losh
  1999-07-08 10:39               ` Ralf Baechle
  0 siblings, 2 replies; 24+ messages in thread
From: Harald Koerfgen @ 1999-07-07 21:08 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips, linux-mips, linux, Ulf Carlsson, William J. Earl


On 06-Jul-99 Ralf Baechle wrote:
> I've received a report from some person who is working on his own R3081
> port.  He also observes data corruption and suspects reading of swapped
> pages is causing that.

That's definitely true for R3k DECstations, and no, flushing the icache in
flush_tlb_page() does not help. I have added cacheflushing to all tlb routines,
copy_page and even rw_swap_page_base() and swap_after_unlock_page() without
success.

Any ideas?
---
Regards,
Harald

P.S.: I'll be on vacation until July 18th so this has twait a little bit :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-07 21:08             ` Harald Koerfgen
@ 1999-07-08  1:51               ` Warner Losh
  1999-07-08  3:12                 ` William J. Earl
  1999-07-08 10:39               ` Ralf Baechle
  1 sibling, 1 reply; 24+ messages in thread
From: Warner Losh @ 1999-07-08  1:51 UTC (permalink / raw)
  To: Harald Koerfgen
  Cc: Ralf Baechle, linux-mips, linux-mips, linux, Ulf Carlsson,
	William J. Earl

In message <XFMail.990707230857.Harald.Koerfgen@home.ivm.de> Harald Koerfgen writes:
: That's definitely true for R3k DECstations, and no, flushing the icache in
: flush_tlb_page() does not help. I have added cacheflushing to all tlb routines,
: copy_page and even rw_swap_page_base() and swap_after_unlock_page() without
: success.

Don'y you want to flush the dcache as well?  I think that you can run
into problems when you have a dirty dcache and then dma into the pages
that are dirty.  Instant karma corruption, no?  Or am I thinking of
some other problem?

Warner

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-08  1:51               ` Warner Losh
@ 1999-07-08  3:12                 ` William J. Earl
       [not found]                   ` <37846EE7.EADD9E32@niisi.msk.ru>
  0 siblings, 1 reply; 24+ messages in thread
From: William J. Earl @ 1999-07-08  3:12 UTC (permalink / raw)
  To: Warner Losh
  Cc: Harald Koerfgen, Ralf Baechle, linux-mips, linux-mips, linux,
	Ulf Carlsson, William J. Earl

Warner Losh writes:
 > In message <XFMail.990707230857.Harald.Koerfgen@home.ivm.de> Harald Koerfgen writes:
 > : That's definitely true for R3k DECstations, and no, flushing the icache in
 > : flush_tlb_page() does not help. I have added cacheflushing to all tlb routines,
 > : copy_page and even rw_swap_page_base() and swap_after_unlock_page() without
 > : success.
 > 
 > Don'y you want to flush the dcache as well?  I think that you can run
 > into problems when you have a dirty dcache and then dma into the pages
 > that are dirty.  Instant karma corruption, no?  Or am I thinking of
 > some other problem?

      The R3000 has a write-through cache, so there cannot be dirty cache
lines, although you do have to flush the write buffers to be completely
correct (in the case of a DMA device writing to memory VERY quickly after
the register write which starts it up, on some hardware). 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  1999-07-07 21:08             ` Harald Koerfgen
  1999-07-08  1:51               ` Warner Losh
@ 1999-07-08 10:39               ` Ralf Baechle
  1 sibling, 0 replies; 24+ messages in thread
From: Ralf Baechle @ 1999-07-08 10:39 UTC (permalink / raw)
  To: Harald Koerfgen
  Cc: linux-mips, linux-mips, linux, Ulf Carlsson, William J. Earl

On Wed, Jul 07, 1999 at 11:08:57PM +0200, Harald Koerfgen wrote:

> On 06-Jul-99 Ralf Baechle wrote:
> > I've received a report from some person who is working on his own R3081
> > port.  He also observes data corruption and suspects reading of swapped
> > pages is causing that.
> 
> That's definitely true for R3k DECstations, and no, flushing the icache in
> flush_tlb_page() does not help. I have added cacheflushing to all tlb routines,
> copy_page and even rw_swap_page_base() and swap_after_unlock_page() without
> success.

Note that on R3000 with it's physical indexed caches there is no way that
cache problems should be able to crash the whole system.  At least under the
provision that DMA drivers get their cacheflushing right.

I recently tried to put our memcpy / memmove from the kernel into libc
and as result ended up with a libc which was almost unusable.  Also, a
part of memove is disabled by #if 0, it was demonstrated to cause data
corruption.  Time to fix that bastard.  The whole file is a big mess, btw.
because the code tries to share as much code as possible between memcpy,
memmove and __copy_{to,from}_user.  So put on your peril sensitive
glasses ;-)

> P.S.: I'll be on vacation until July 18th so this has twait a little bit :-)

s/.*/P.S.: I have plenty of time for hacking during my vacation :-)/p ;-)

  Ralf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
       [not found]                   ` <37846EE7.EADD9E32@niisi.msk.ru>
@ 1999-07-08 17:56                     ` William J. Earl
  0 siblings, 0 replies; 24+ messages in thread
From: William J. Earl @ 1999-07-08 17:56 UTC (permalink / raw)
  To: Gleb O. Raiko
  Cc: William J. Earl, Warner Losh, Harald Koerfgen, Ralf Baechle,
	linux-mips, linux-mips, linux, Ulf Carlsson

Gleb O. Raiko writes:
 > "William J. Earl" wrote:
...
 > >       The R3000 has a write-through cache, so there cannot be dirty cache
 > > lines, although you do have to flush the write buffers to be completely
 > > correct (in the case of a DMA device writing to memory VERY quickly after
 > > the register write which starts it up, on some hardware).
 > 
 > You must flush d-cache after dma. While some cache controllers are able
 > to watch the bus and flush the data that are invalidated due to DMA
 > transfers, I think, most r3k boxes doesn't have such beasts. Flushing
 > d-cache wasn't implemented at the same time as the cache stuff because
 > we hadn't boxes with DMA devices.

     Most R3000 (and many R4000/R4600/R5000) boxes do not have
cache-coherent I/O, and Linux/MIPS does do cache flushing.  If
everything is well-organized, one can flush the d-cache only before an
I/O.  On an R3000, it does not much matter which approach you take,
since the caches are write-through (aside from the need to flush the
write-buffer before initiating a DMA).  For later processors, you must
flush the d-cache BEFORE a DMA, since victim writebacks of dirty lines
after a DMA into memory has updated memory will lead to I/O data
corruption, and failure to flush dirty lines before a DMA from memory
will lead to stale data being written to disk.  If it is possible for
the CPU to access the buffer during the DMA, then you must invalidate
the cache for the buffer after a DMA into memory as well, but a
well-constructed system should never do that.  

    If you have a buffer which is not cache-line-aligned (which is
possible with the general case of raw or direct I/O, although not in
unmodified Linux at the moment), then, for DMA into memory, you must
use temporary buffers for any portion of the buffer which occupies
just part of a cache line, and copy the data from the temporary buffer
to the real buffer after the DMA completes, to account for the
possibility of a separate thread modifying data outside the buffer in
the shared cache line, leading to a victim writeback (or a
writethrough on the R3000).  This could apply even to the R3000, depending
on how the compiler generates code for a partial-word update, although
it is unlikely.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Memory Corruption
@ 2001-01-05  8:33 Ryan Sizemore
  0 siblings, 0 replies; 24+ messages in thread
From: Ryan Sizemore @ 2001-01-05  8:33 UTC (permalink / raw)
  To: Linux-Kernel

This message has a couple of questions to it, so maybe a few people might
want to contribute to answering them all. My apologies in advance for the
long length of this post.

The Problem:
I have an Alpha PC164 with 512 Meg of memory. As a friend and I were setting
it up, we tried to compile mozilla. At some point during the install, a
repeating error would scroll by the screen so fast that we could not read
it. From what we could pick out, we determined that the error was memory
related. We deduced that since compiling mozilla would fill the entire bank
of memory, once gcc (or whatever directly writes to memory) tried to address
the bad area of memory, gcc would produce the error. Also, after trying to
recompile mozilla a number of times, the error would be at a random point,
usually after about 15 or 20 minutes of compiling. From this information, we
hazard to guess that one of the eight 64 Meg SIMMS was bad, or contained a
bad area. Therefore, we removed the last 4 of the 8 modules, and the error
never occurred.

The suggested solution:
We plan to swap out the 4 of the 4 remaining modules with the 4 that we
removed earlier, one at a time, and try to compile mozilla, since it will
fill all of the memory. Then, hopefully, we can rotate the modules to find
the one that contains the bad area.

We are not quite sure what to do from there. Here are our ideas:
1. One suggestion I made was to create a ram drive over the last 64 Meg of
addressable memory, the simply not read or write to the drive. Is that even
possible? Can I tell the kernel to create a ram drive over a certain area of
memory?
2. Another idea I had was to tell the kernel to only use a certain size of
memory, with a modification to lilo.conf: append="mem=448m" since 512(the
total memory) - 64(the size of the module) = 448Meg. Will this work? Any
ideas?

Another question:
We are not sure if the memory is ECC or not, but we think that there is a
good chance of it. Are there any kernel optimizations that can be made so
that the kernel can map out the bad memory and mark it so that it cant be
used? The machine is booted from an SRM prompt, if that helps.

Please let me know if anyone had any ideas on these problems. Thanks in
advance to all those out there who took the time to read this.

--Ryan Sizemore

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Memory corruption
@ 2002-08-15 20:26 Dave Boutcher
  2002-08-15 20:36 ` Andreas Dilger
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Boutcher @ 2002-08-15 20:26 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I'm chasing a wierd memory corruption problem on a ppc64 system.  The
first byte of a slab_t structure keeps getting stepped on (zeroed,
actually.)  This happens during a testcase that copies a large file
called "junk" between file systems (a mix of ext2 and reiser) on a
2.4.13 kernel.

In every case, the page immediately preceding the slab_t has exactly the
same data in it, and it looks like some kind of directory structure
(note the presence of the word "junk", along with ".." and "." towards
the end.)

C000000037008E00: FD8C0600 FE8C0600 FF8C0600 008D0600 <                >
C000000037008E10: 018D0600 028D0600 038D0600 048D0600 <                >
C000000037008E20: 058D0600 068D0600 078D0600 088D0600 <                >
C000000037008E30: 098D0600 0A8D0600 0B8D0600 0C8D0600 <                >
C000000037008E40: 0D8D0600 0E8D0600 0F8D0600 108D0600 <                >
C000000037008E50: 118D0600 128D0600 138D0600 148D0600 <                >
C000000037008E60: 158D0600 168D0600 178D0600 188D0600 <                >
C000000037008E70: 198D0600 1A8D0600 1B8D0600 1C8D0600 <                >
C000000037008E80: 1D8D0600 1E8D0600 1F8D0600 208D0600 <                >
C000000037008E90: 218D0600 228D0600 238D0600 248D0600 <!   "   #   $   >
C000000037008EA0: 258D0600 268D0600 278D0600 288D0600 <%   &   '   (   >
C000000037008EB0: 298D0600 2A8D0600 2B8D0600 2C8D0600 <)   *   +   ,   >
C000000037008EC0: 2D8D0600 2E8D0600 2F8D0600 308D0600 <-   .   /   0   >
C000000037008ED0: 318D0600 328D0600 338D0600 348D0600 <1   2   3   4   >
C000000037008EE0: 358D0600 368D0600 378D0600 388D0600 <5   6   7   8   >
C000000037008EF0: 398D0600 3A8D0600 3B8D0600 3C8D0600 <9   :   ;   <   >
C000000037008F00: 3D8D0600 3E8D0600 3F8D0600 408D0600 <=   >   ?   @   >
C000000037008F10: 418D0600 428D0600 438D0600 448D0600 <A   B   C   D   >
C000000037008F20: 458D0600 468D0600 478D0600 488D0600 <E   F   G   H   >
C000000037008F30: 498D0600 4A8D0600 4B8D0600 4C8D0600 <I   J   K   L   >
C000000037008F40: 4D8D0600 4E8D0600 4F8D0600 508D0600 <M   N   O   P   >
C000000037008F50: 518D0600 528D0600 538D0600 548D0600 <Q   R   S   T   >
C000000037008F60: 558D0600 A4810000 01000000 0020F906 <U               >
C000000037008F70: 00000000 00000000 00000000 B377493D <             wI=>
C000000037008F80: C377493D C377493D 907C0300 32000000 < wI= wI= |  2   >
C000000037008F90: 01000000 01000000 02000000 40000400 <            @   >
C000000037008FA0: 02000000 00000000 01000000 38000400 <            8   >
C000000037008FB0: 80F1A501 02000000 03000000 30000400 <            0   >
C000000037008FC0: 6A756E6B 00000000 2E2E0000 00000000 <junk    ..      >
C000000037008FD0: 2E000000 00000000 ED4174F0 03000000 <.        At     >
C000000037008FE0: 48000000 00000000 00000000 00000000 <H               >
C000000037008FF0: 91B2103D B377493D B377493D 01000000 <   = wI= wI=    >

The byte immediately following that gets zeroed.  It sure looks to me
like someone is going over the end of a buffer.

The question is, does anyone recognize that data structure?!?!?!

Thanks!!!

Dave B




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory corruption
  2002-08-15 20:26 Memory corruption Dave Boutcher
@ 2002-08-15 20:36 ` Andreas Dilger
  0 siblings, 0 replies; 24+ messages in thread
From: Andreas Dilger @ 2002-08-15 20:36 UTC (permalink / raw)
  To: Dave Boutcher; +Cc: linux-fsdevel

On Aug 15, 2002  15:26 -0500, Dave Boutcher wrote:
> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
> first byte of a slab_t structure keeps getting stepped on (zeroed,
> actually.)  This happens during a testcase that copies a large file
> called "junk" between file systems (a mix of ext2 and reiser) on a
> 2.4.13 kernel.

Well, I hate to say it, but 2.4.13 is a very old kernel.  Also, since
reiserfs was fairly new to non-x86 architectures there may have been
significant fixes since then.

> C000000037008E00: FD8C0600 FE8C0600 FF8C0600 008D0600 <                >
> C000000037008E10: 018D0600 028D0600 038D0600 048D0600 <                >
> C000000037008E20: 058D0600 068D0600 078D0600 088D0600 <                >
> C000000037008E30: 098D0600 0A8D0600 0B8D0600 0C8D0600 <                >
> C000000037008E40: 0D8D0600 0E8D0600 0F8D0600 108D0600 <                >
> C000000037008E50: 118D0600 128D0600 138D0600 148D0600 <                >
> C000000037008E60: 158D0600 168D0600 178D0600 188D0600 <                >
> C000000037008E70: 198D0600 1A8D0600 1B8D0600 1C8D0600 <                >
> C000000037008E80: 1D8D0600 1E8D0600 1F8D0600 208D0600 <                >
> C000000037008E90: 218D0600 228D0600 238D0600 248D0600 <!   "   #   $   >
> C000000037008EA0: 258D0600 268D0600 278D0600 288D0600 <%   &   '   (   >
> C000000037008EB0: 298D0600 2A8D0600 2B8D0600 2C8D0600 <)   *   +   ,   >
> C000000037008EC0: 2D8D0600 2E8D0600 2F8D0600 308D0600 <-   .   /   0   >
> C000000037008ED0: 318D0600 328D0600 338D0600 348D0600 <1   2   3   4   >
> C000000037008EE0: 358D0600 368D0600 378D0600 388D0600 <5   6   7   8   >
> C000000037008EF0: 398D0600 3A8D0600 3B8D0600 3C8D0600 <9   :   ;   <   >
> C000000037008F00: 3D8D0600 3E8D0600 3F8D0600 408D0600 <=   >   ?   @   >
> C000000037008F10: 418D0600 428D0600 438D0600 448D0600 <A   B   C   D   >
> C000000037008F20: 458D0600 468D0600 478D0600 488D0600 <E   F   G   H   >
> C000000037008F30: 498D0600 4A8D0600 4B8D0600 4C8D0600 <I   J   K   L   >
> C000000037008F40: 4D8D0600 4E8D0600 4F8D0600 508D0600 <M   N   O   P   >
> C000000037008F50: 518D0600 528D0600 538D0600 548D0600 <Q   R   S   T   >
> C000000037008F60: 558D0600 A4810000 01000000 0020F906 <U               >
> C000000037008F70: 00000000 00000000 00000000 B377493D <             wI=>
> C000000037008F80: C377493D C377493D 907C0300 32000000 < wI= wI= |  2   >
> C000000037008F90: 01000000 01000000 02000000 40000400 <            @   >
> C000000037008FA0: 02000000 00000000 01000000 38000400 <            8   >
> C000000037008FB0: 80F1A501 02000000 03000000 30000400 <            0   >
> C000000037008FC0: 6A756E6B 00000000 2E2E0000 00000000 <junk    ..      >
> C000000037008FD0: 2E000000 00000000 ED4174F0 03000000 <.        At     >
> C000000037008FE0: 48000000 00000000 00000000 00000000 <H               >
> C000000037008FF0: 91B2103D B377493D B377493D 01000000 <   = wI= wI=    >
> 
> The byte immediately following that gets zeroed.  It sure looks to me
> like someone is going over the end of a buffer.
> 
> The question is, does anyone recognize that data structure?!?!?!

It doesn't look ext2-ish.  The ext2 on-disk directory entries would have
a few bytes between "junk", "..", and "." (reclen, namelen, inode number),
and would be in the order ".", "..", and "junk" instead.  The items are
also too small to be dentries or dirents from a readdir.  I don't know
enough about reiserfs to say either way, but I would suggest posting to
their list also (be prepared again for the "your kernel is too old" from
them as well).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Memory Corruption
@ 2002-08-19 16:50 Dave Boutcher
  2002-08-19 20:01 ` Chris Mason
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Boutcher @ 2002-08-19 16:50 UTC (permalink / raw)
  To: reiserfs-list

Hi,

I'm chasing a wierd memory corruption problem on a ppc64 system.  The
first byte of a slab_t structure keeps getting stepped on (zeroed,
actually.)  This happens during a testcase that copies a large file
called "junk" between file systems (a mix of ext2 and reiser) on a
2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
SuSE's SLES-7 release that we have customers running...

In every case, the page immediately preceding the slab_t has exactly the
same data in it, and it looks like some kind of directory structure
(note the presence of the word "junk", along with ".." and "." towards
the end.)
C000000037008E00: FD8C0600 FE8C0600 FF8C0600 008D0600 <                >
C000000037008E10: 018D0600 028D0600 038D0600 048D0600 <                >
C000000037008E20: 058D0600 068D0600 078D0600 088D0600 <                >
C000000037008E30: 098D0600 0A8D0600 0B8D0600 0C8D0600 <                >
C000000037008E40: 0D8D0600 0E8D0600 0F8D0600 108D0600 <                >
C000000037008E50: 118D0600 128D0600 138D0600 148D0600 <                >
C000000037008E60: 158D0600 168D0600 178D0600 188D0600 <                >
C000000037008E70: 198D0600 1A8D0600 1B8D0600 1C8D0600 <                >
C000000037008E80: 1D8D0600 1E8D0600 1F8D0600 208D0600 <                >
C000000037008E90: 218D0600 228D0600 238D0600 248D0600 <!   "   #   $   >
C000000037008EA0: 258D0600 268D0600 278D0600 288D0600 <%   &   '   (   >
C000000037008EB0: 298D0600 2A8D0600 2B8D0600 2C8D0600 <)   *   +   ,   >
C000000037008EC0: 2D8D0600 2E8D0600 2F8D0600 308D0600 <-   .   /   0   >
C000000037008ED0: 318D0600 328D0600 338D0600 348D0600 <1   2   3   4   >
C000000037008EE0: 358D0600 368D0600 378D0600 388D0600 <5   6   7   8   >
C000000037008EF0: 398D0600 3A8D0600 3B8D0600 3C8D0600 <9   :   ;   <   >
C000000037008F00: 3D8D0600 3E8D0600 3F8D0600 408D0600 <=   >   ?   @   >
C000000037008F10: 418D0600 428D0600 438D0600 448D0600 <A   B   C   D   >
C000000037008F20: 458D0600 468D0600 478D0600 488D0600 <E   F   G   H   >
C000000037008F30: 498D0600 4A8D0600 4B8D0600 4C8D0600 <I   J   K   L   >
C000000037008F40: 4D8D0600 4E8D0600 4F8D0600 508D0600 <M   N   O   P   >
C000000037008F50: 518D0600 528D0600 538D0600 548D0600 <Q   R   S   T   >
C000000037008F60: 558D0600 A4810000 01000000 0020F906 <U               >
C000000037008F70: 00000000 00000000 00000000 B377493D <             wI=>
C000000037008F80: C377493D C377493D 907C0300 32000000 < wI= wI= |  2   >
C000000037008F90: 01000000 01000000 02000000 40000400 <            @   >
C000000037008FA0: 02000000 00000000 01000000 38000400 <            8   >
C000000037008FB0: 80F1A501 02000000 03000000 30000400 <            0   >
C000000037008FC0: 6A756E6B 00000000 2E2E0000 00000000 <junk    ..      >
C000000037008FD0: 2E000000 00000000 ED4174F0 03000000 <.        At     >
C000000037008FE0: 48000000 00000000 00000000 00000000 <H               >
C000000037008FF0: 91B2103D B377493D B377493D 01000000 <   = wI= wI=    >

The byte immediately following that gets zeroed.  It sure looks to me
like someone is going over the end of a buffer.

The question is, does anyone recognize that data structure?!?!?!

Thanks!!!

Dave B




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory Corruption
  2002-08-19 16:50 Memory Corruption Dave Boutcher
@ 2002-08-19 20:01 ` Chris Mason
  2002-08-28 21:00   ` [reiserfs-list] " David Boutcher
  2002-08-28 21:00   ` David Boutcher
  0 siblings, 2 replies; 24+ messages in thread
From: Chris Mason @ 2002-08-19 20:01 UTC (permalink / raw)
  To: Dave Boutcher; +Cc: reiserfs-list

On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
> Hi,
> 
> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
> first byte of a slab_t structure keeps getting stepped on (zeroed,
> actually.)  This happens during a testcase that copies a large file
> called "junk" between file systems (a mix of ext2 and reiser) on a
> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
> SuSE's SLES-7 release that we have customers running...
> 
> In every case, the page immediately preceding the slab_t has exactly the
> same data in it, and it looks like some kind of directory structure
> (note the presence of the word "junk", along with ".." and "." towards
> the end.)

Any chance the test case involves renames?

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory Corruption
  2002-08-19 20:01 ` Chris Mason
  2002-08-28 21:00   ` [reiserfs-list] " David Boutcher
@ 2002-08-28 21:00   ` David Boutcher
  1 sibling, 0 replies; 24+ messages in thread
From: David Boutcher @ 2002-08-28 21:00 UTC (permalink / raw)
  To: Chris Mason; +Cc: reiserfs-list, linux-fsdevel


>On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
>> Hi,
>>
>> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
>> first byte of a slab_t structure keeps getting stepped on (zeroed,
>> actually.)  This happens during a testcase that copies a large file
>> called "junk" between file systems (a mix of ext2 and reiser) on a
>> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
>> SuSE's SLES-7 release that we have customers running...
>
>Any chance the test case involves renames?
>
>-chris

So I posted my problem with memory corruption a few weeks ago....and the
problem turned out to be a REALLY old/moldy set of userland reiser tools.
I don't know exactly why that caused memory corruption in the kernel, but
updating the tools fixed everything right up.

Dave B



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Memory Corruption
  2002-08-19 20:01 ` Chris Mason
@ 2002-08-28 21:00   ` David Boutcher
  2002-08-29 13:40     ` Chris Mason
  2002-08-28 21:00   ` David Boutcher
  1 sibling, 1 reply; 24+ messages in thread
From: David Boutcher @ 2002-08-28 21:00 UTC (permalink / raw)
  To: Chris Mason; +Cc: reiserfs-list, linux-fsdevel


>On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
>> Hi,
>>
>> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
>> first byte of a slab_t structure keeps getting stepped on (zeroed,
>> actually.)  This happens during a testcase that copies a large file
>> called "junk" between file systems (a mix of ext2 and reiser) on a
>> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
>> SuSE's SLES-7 release that we have customers running...
>
>Any chance the test case involves renames?
>
>-chris

So I posted my problem with memory corruption a few weeks ago....and the
problem turned out to be a REALLY old/moldy set of userland reiser tools.
I don't know exactly why that caused memory corruption in the kernel, but
updating the tools fixed everything right up.

Dave B



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [reiserfs-list] Memory Corruption
  2002-08-28 21:00   ` [reiserfs-list] " David Boutcher
@ 2002-08-29 13:40     ` Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2002-08-29 13:40 UTC (permalink / raw)
  To: David Boutcher; +Cc: reiserfs-list, linux-fsdevel

On Wed, 2002-08-28 at 17:00, David Boutcher wrote:
> 
> >On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
> >> Hi,
> >>
> >> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
> >> first byte of a slab_t structure keeps getting stepped on (zeroed,
> >> actually.)  This happens during a testcase that copies a large file
> >> called "junk" between file systems (a mix of ext2 and reiser) on a
> >> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
> >> SuSE's SLES-7 release that we have customers running...
> >
> >Any chance the test case involves renames?
> >
> >-chris
> 
> So I posted my problem with memory corruption a few weeks ago....and the
> problem turned out to be a REALLY old/moldy set of userland reiser tools.
> I don't know exactly why that caused memory corruption in the kernel, but
> updating the tools fixed everything right up.

Well, that shouldn't fix it ;-)  Which version of reiserfsprogs were you
running before?

Are the filesystems getting checked during boot at all (you would see
reiserfsck messages during boot)?

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Memory Corruption
@ 2002-08-29 13:40 Chris Mason
  0 siblings, 0 replies; 24+ messages in thread
From: Chris Mason @ 2002-08-29 13:40 UTC (permalink / raw)
  To: David Boutcher; +Cc: reiserfs-list, linux-fsdevel

On Wed, 2002-08-28 at 17:00, David Boutcher wrote:
> 
> >On Mon, 2002-08-19 at 12:50, Dave Boutcher wrote:
> >> Hi,
> >>
> >> I'm chasing a wierd memory corruption problem on a ppc64 system.  The
> >> first byte of a slab_t structure keeps getting stepped on (zeroed,
> >> actually.)  This happens during a testcase that copies a large file
> >> called "junk" between file systems (a mix of ext2 and reiser) on a
> >> 2.4.13 kernel.  I know that's REALLY REALLY old, but it's whats in
> >> SuSE's SLES-7 release that we have customers running...
> >
> >Any chance the test case involves renames?
> >
> >-chris
> 
> So I posted my problem with memory corruption a few weeks ago....and the
> problem turned out to be a REALLY old/moldy set of userland reiser tools.
> I don't know exactly why that caused memory corruption in the kernel, but
> updating the tools fixed everything right up.

Well, that shouldn't fix it ;-)  Which version of reiserfsprogs were you
running before?

Are the filesystems getting checked during boot at all (you would see
reiserfsck messages during boot)?

-chris



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Memory corruption
@ 2008-04-24 15:31 Geert Uytterhoeven
  0 siblings, 0 replies; 24+ messages in thread
From: Geert Uytterhoeven @ 2008-04-24 15:31 UTC (permalink / raw)
  To: Linux/PPC Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5023 bytes --]

	Hi,

I saw some random lockups on my PS3, so I decided to give the current kernel a
try on the PS3 development tool.  It crashes when setting up the network:

| <5>Sending DHCP requests ., OK
| IP-Config: Got DHCP answer from 192.168.106.200, my address is 192.168.106.196
| IP-Config: Complete:
|      device=eth0, addr=192.168.106.196, mask=255.255.255.0, gw=192.168.106.254,
|      host=192.168.106.196, domain=sonytel.be, nis-domain=(none),
|      bootserver=192.168.106.200, rootserver=192.168.106.200, rootpath=/disk-02/ps3linux/debian-powerpc
| <5>Looking up port of RPC 100003/2 on 192.168.106.200
| <0>Unrecoverable FP Unavailable Exception 800 at c000000000305220
| Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
| SMP NR_CPUS=2 PS3
| Modules linked in:
| NIP: c000000000305220 LR: c000000000304d34 CTR: c0000000003051c0
| REGS: c00000000604aa70 TRAP: 0800   Not tainted  (2.6.25-03562-g3dc5063-dirty)
| MSR: 8000000000008032 <EE,IR,DR>  CR: 24004082  XER: 00000000
| TASK = c000000006046040[1] 'swapper' THREAD: c000000006048000 CPU: 0
| <6>GPR00: 0000000000000800 c00000000604acf0 c000000000603a88 c000000006262680 
| <6>GPR04: 0662160400000002 0000000000004000 c0000000064a4110 c00000000062eda8 
| <6>GPR08: c0000000061a6000 0000000000000001 0000000000000100 c0000000062bf880 
| <6>GPR12: 0000001100000000 c000000000548300 0000000000000000 0000000000000000 
| <6>GPR16: 0000000000000000 000000000000005c 0000000000000000 000000000000005c 
| <6>GPR20: c0000000063a9db8 00000000c0a86ac8 0000000000000000 c0000000063a9d08 
| <6>GPR24: 0000000000000040 0000000000004000 c0000000063a9b80 c000000006391e00 
| <6>GPR28: c0000000064a4020 c000000006262680 c0000000005ae478 c00000000604acf0 
| NIP [c000000000305220] .ip_output+0x60/0x8c
| LR [c000000000304d34] .ip_local_out+0x50/0x78
| Call Trace:
| [c00000000604acf0] [c00000000604ada0] 0xc00000000604ada0 (unreliable)
| [c00000000604ad70] [c000000000304d34] .ip_local_out+0x50/0x78
| [c00000000604ae00] [c0000000003050c0] .ip_push_pending_frames+0x364/0x410
| [c00000000604aeb0] [c000000000326a60] .udp_push_pending_frames+0x350/0x408
| [c00000000604af70] [c000000000328048] .udp_sendmsg+0x4c4/0x630
| [c00000000604b0d0] [c0000000003306e4] .inet_sendmsg+0x84/0xb0
| [c00000000604b170] [c0000000002cd430] .sock_sendmsg+0xc4/0x108
| [c00000000604b370] [c0000000002ceed8] .kernel_sendmsg+0x40/0x64
| [c00000000604b400] [c00000000038cc1c] .xs_send_kvec+0xc8/0x100
| [c00000000604b510] [c00000000038cd10] .xs_sendpages+0xbc/0x2f4
| [c00000000604b5e0] [c00000000038ed38] .xs_udp_send_request+0x60/0x148
| [c00000000604b680] [c00000000038b1b8] .xprt_transmit+0x144/0x27c
| [c00000000604b730] [c00000000038776c] .call_transmit+0x248/0x2b0
| [c00000000604b7d0] [c000000000390a68] .__rpc_execute+0xd8/0x314
| [c00000000604b870] [c000000000390d18] .rpc_execute+0x40/0x5c
| [c00000000604b900] [c000000000387fe8] .rpc_run_task+0x84/0xb0
| [c00000000604b9a0] [c00000000038814c] .rpc_call_sync+0x74/0xc0
| [c00000000604ba70] [c00000000039a568] .rpcb_getport_sync+0x110/0x178
| [c00000000604bb80] [c000000000511118] .root_nfs_getport+0x8c/0xbc
| [c00000000604bc30] [c0000000005112f0] .nfs_root_data+0x1a8/0x328
| [c00000000604bd70] [c0000000004f66a8] .mount_root+0x40/0x150
| [c00000000604be10] [c0000000004f695c] .prepare_namespace+0x1a4/0x1f4
| [c00000000604bea0] [c0000000004f5a48] .kernel_init+0x388/0x3c8
| [c00000000604bf90] [c0000000000229c8] .kernel_thread+0x4c/0x68
| Instruction dump:
| e9230028 e8fe8018 7c000026 54001ffe e9090018 78001f24 7d27002a 38000800 
| 7d2948f8 7d6b482a e92b0058 39290001 <c0000000> 00546e70 f9030020 4bfff775 
                                       ^^^^^^^^  ^^^^^^^^
			     should be f92b0058  b003007e

| <4>---[ end trace c7cf3d9b6c787395 ]---
| <0>Kernel panic - not syncing: Attempted to kill init!
| smp_call_function on cpu 0: other cpus not responding (0)
| 
|    System does not reboot automatically.
|    Please press POWER button.
| 
| <7>eth0: no IPv6 routers present

Findings:
  - Disabling CONFIG_INET fixed the problem.
  - I didn't manage to lock up my PS3 afterwards neither.
    But... while typing this, I saw an oops accessing address
    0xf000f000f0007000 somewhere in the networking code, so it looks like some
    corruption is going on after all.
  - Upon closer look, 8 bytes in the instruction dump above are not correct
    and have been overwritten with 0xc000000000546e70, which is the address of
    init_task.

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Network and Software Technology Center Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone:    +32 (0)2 700 8453
Fax:      +32 (0)2 700 8622
E-mail:   Geert.Uytterhoeven@sonycom.com
Internet: http://www.sony-europe.com/

Sony Network and Software Technology Center Europe
A division of Sony Service Centre (Europe) N.V.
Registered office: Technologielaan 7 · B-1840 Londerzeel · Belgium
VAT BE 0413.825.160 · RPR Brussels
Fortis Bank Zaventem · BIC GEBABEBB08A · IBAN BE39001382358619

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2008-04-24 15:31 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-08-19 16:50 Memory Corruption Dave Boutcher
2002-08-19 20:01 ` Chris Mason
2002-08-28 21:00   ` [reiserfs-list] " David Boutcher
2002-08-29 13:40     ` Chris Mason
2002-08-28 21:00   ` David Boutcher
  -- strict thread matches above, loose matches on Subject: below --
2008-04-24 15:31 Memory corruption Geert Uytterhoeven
2002-08-29 13:40 Memory Corruption Chris Mason
2002-08-15 20:26 Memory corruption Dave Boutcher
2002-08-15 20:36 ` Andreas Dilger
2001-01-05  8:33 Memory Corruption Ryan Sizemore
1999-06-22  1:39 Memory corruption Ulf Carlsson
1999-06-30  1:01 ` William J. Earl
1999-06-30  2:47   ` Ulf Carlsson
1999-06-30 22:01     ` William J. Earl
1999-07-01  0:23       ` Ralf Baechle
1999-07-01  0:53         ` William J. Earl
1999-07-01 11:25           ` Harald Koerfgen
1999-07-02 22:41           ` Ralf Baechle
1999-07-06 13:05           ` Ralf Baechle
1999-07-07 21:08             ` Harald Koerfgen
1999-07-08  1:51               ` Warner Losh
1999-07-08  3:12                 ` William J. Earl
     [not found]                   ` <37846EE7.EADD9E32@niisi.msk.ru>
1999-07-08 17:56                     ` William J. Earl
1999-07-08 10:39               ` Ralf Baechle

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.