ext3 / reiserfs data corruption, 2.5-bk

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* ext3 / reiserfs data corruption, 2.5-bk
@ 2003-06-09 19:35 Dave Jones
  2003-06-10  8:43 ` Oleg Drokin
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Jones @ 2003-06-09 19:35 UTC (permalink / raw)
  To: Linux Kernel

2.5 Bitkeeper tree as of last 24 hrs. Running a lot
of disk IO stress (multiple fsstress, over 100 fsx instances,
and random sync calling) produced failures on both reiserfs
and ext3.

Tests were done on seperate disks, but concurrently.

fsx logs at
http://www.codemonkey.org.uk/cruft/reiserfs.fsxlog
http://www.codemonkey.org.uk/cruft/ext3.fsxlog

		Dave

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-09 19:35 ext3 / reiserfs data corruption, 2.5-bk Dave Jones
@ 2003-06-10  8:43 ` Oleg Drokin
  2003-06-10  9:20   ` Dave Jones
  2003-06-10 21:44   ` Nathan Conrad
  0 siblings, 2 replies; 9+ messages in thread
From: Oleg Drokin @ 2003-06-10  8:43 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel

Hello!

On Mon, Jun 09, 2003 at 08:35:55PM +0100, Dave Jones wrote:

> 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
> of disk IO stress (multiple fsstress, over 100 fsx instances,
> and random sync calling) produced failures on both reiserfs
> and ext3.
> Tests were done on seperate disks, but concurrently.

Do you have smp or preempt enabled?

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10  8:43 ` Oleg Drokin
@ 2003-06-10  9:20   ` Dave Jones
  2003-06-10 21:44   ` Nathan Conrad
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Jones @ 2003-06-10  9:20 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Linux Kernel

On Tue, Jun 10, 2003 at 12:43:23PM +0400, Oleg Drokin wrote:

 > > 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
 > > of disk IO stress (multiple fsstress, over 100 fsx instances,
 > > and random sync calling) produced failures on both reiserfs
 > > and ext3.
 > > Tests were done on seperate disks, but concurrently.
 > 
 > Do you have smp or preempt enabled?

# CONFIG_SMP is not set
CONFIG_PREEMPT=y

		Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10  8:43 ` Oleg Drokin
  2003-06-10  9:20   ` Dave Jones
@ 2003-06-10 21:44   ` Nathan Conrad
  2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
                       ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Nathan Conrad @ 2003-06-10 21:44 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Dave Jones, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 2598 bytes --]

I've been noticing a similar problem on my laptop. This may, or may
not be related, but it did start somewhere within the past week (maybe
the IDE taskfile conversion???, to throw out a guess). I wonder if
Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
compiling the kernel with gcc 3.3 (Debian version).

Anyway, certain directories get locked up on occasion and when I try
to execute 'ls' or read from the directory, the process gets into a
locked up state; ^C does not work to kill the process. The only way to
make a directory "readable" is to restart the machine. I have not
noticed any FS corruption, just the lack of being able to enter the
directory.

 At the same time, a kernel bug will be displayed:


Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c016781a
*pde = 00000000
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[find_inode_fast+26/96]    Not tainted
EFLAGS: 00010286
EIP is at find_inode_fast+0x1a/0x60
eax: db0355c4   ebx: 0001859f   ecx: c3a69844   edx: 00000000
esi: dfd60c00   edi: dff99340   ebp: dff99340   esp: cc6dde50
ds: 007b   es: 007b   ss: 0068
Process emacs20 (pid: 16508, threadinfo=cc6dc000 task=c6d0adc0)
Stack: c4bca5b8 0001859f 0001859f dfd60c00 c0167d2e dfd60c00 dff99340 0001859f 
       0001859f da191d40 dfd60c00 da191d40 c018e45b dfd60c00 0001859f db666130 
       fffffff4 dca22aac dca22a44 c015cd60 dca22a44 da191d40 00000000 cc6ddf48 
Call Trace:
 [iget_locked+78/160] iget_locked+0x4e/0xa0
 [ext3_lookup+107/208] ext3_lookup+0x6b/0xd0
 [real_lookup+192/240] real_lookup+0xc0/0xf0
 [do_lookup+158/176] do_lookup+0x9e/0xb0
 [link_path_walk+1066/2000] link_path_walk+0x42a/0x7d0
 [__user_walk+73/96] __user_walk+0x49/0x60
 [vfs_stat+31/96] vfs_stat+0x1f/0x60
 [sys_stat64+27/64] sys_stat64+0x1b/0x40
 [syscall_call+7/11] syscall_call+0x7/0xb

Code: 0f 18 02 90 39 59 18 89 c8 74 0f 85 d2 89 d1 75 ed 31 c0 83 


On Tue, Jun 10, 2003 at 12:43:23PM +0400, Oleg Drokin wrote:
> Hello!
> 
> On Mon, Jun 09, 2003 at 08:35:55PM +0100, Dave Jones wrote:
> 
> > 2.5 Bitkeeper tree as of last 24 hrs. Running a lot
> > of disk IO stress (multiple fsstress, over 100 fsx instances,
> > and random sync calling) produced failures on both reiserfs
> > and ext3.
> > Tests were done on seperate disks, but concurrently.
> 
> Do you have smp or preempt enabled?
> 
> Bye,
>     Oleg

-Nathan Conrad

-- 
Nathan J. Conrad
GPG: F4FC 7E25 9308 ECE1 735C  0798 CE86 DA45 9170 3112

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10 21:44   ` Nathan Conrad
@ 2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
  2003-06-10 22:18       ` Nathan Conrad
  2003-06-10 20:59     ` Andrew Morton
  2003-06-10 22:49     ` ext3 / reiserfs data corruption, 2.5-bk Dave Jones
  2 siblings, 1 reply; 9+ messages in thread
From: Bartlomiej Zolnierkiewicz @ 2003-06-10 18:11 UTC (permalink / raw)
  To: Nathan Conrad; +Cc: Oleg Drokin, Dave Jones, Linux Kernel


On Tue, 10 Jun 2003, Nathan Conrad wrote:

> I've been noticing a similar problem on my laptop. This may, or may
> not be related, but it did start somewhere within the past week (maybe
> the IDE taskfile conversion???, to throw out a guess). I wonder if

wrt taskfile conversion, if you are using DMA on your IDE disks,
there shouldn't be any change in behaviour.

I will prepare a patch adding old crap and making it selectable
(default will be taskfile, if you go into problems you can check
with old code) to easy spotting possible taskfile problems
and allowing quick judging - taskfile guilty/not guilty.

--
Bartlomiej

> Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
> disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
> compiling the kernel with gcc 3.3 (Debian version).
>
> Anyway, certain directories get locked up on occasion and when I try
> to execute 'ls' or read from the directory, the process gets into a
> locked up state; ^C does not work to kill the process. The only way to
> make a directory "readable" is to restart the machine. I have not
> noticed any FS corruption, just the lack of being able to enter the
> directory.
>
>  At the same time, a kernel bug will be displayed:

<...>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
@ 2003-06-10 22:18       ` Nathan Conrad
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Conrad @ 2003-06-10 22:18 UTC (permalink / raw)
  To: Bartlomiej Zolnierkiewicz; +Cc: Oleg Drokin, Dave Jones, Linux Kernel

[-- Attachment #1: Type: text/plain, Size: 1899 bytes --]

Oh, ok. I am using DMA on my drives. The problem with this bug is that
it is fairly hard to observe, I've only seen it about once every other
day. I should have also pointed out that I am using ext3.

I thought that it might be taskfile stuff because that was the major
change in the kernel the time right before I started to notice these
problems. There likely is some other source of problems because you
say that there should be no change in behaviour.

-Nathan

On Tue, Jun 10, 2003 at 08:11:22PM +0200, Bartlomiej Zolnierkiewicz wrote:
> 
> On Tue, 10 Jun 2003, Nathan Conrad wrote:
> 
> > I've been noticing a similar problem on my laptop. This may, or may
> > not be related, but it did start somewhere within the past week (maybe
> > the IDE taskfile conversion???, to throw out a guess). I wonder if
> 
> wrt taskfile conversion, if you are using DMA on your IDE disks,
> there shouldn't be any change in behaviour.
> 
> I will prepare a patch adding old crap and making it selectable
> (default will be taskfile, if you go into problems you can check
> with old code) to easy spotting possible taskfile problems
> and allowing quick judging - taskfile guilty/not guilty.
> 
> --
> Bartlomiej
> 
> > Dave Jones is using IDE or SCSI. CONFIG_SMP and CONFIG_PREEMPT are
> > disabled on my machine (Sony Vaio PCG-FXA49 laptop, Athlon4). I'm
> > compiling the kernel with gcc 3.3 (Debian version).
> >
> > Anyway, certain directories get locked up on occasion and when I try
> > to execute 'ls' or read from the directory, the process gets into a
> > locked up state; ^C does not work to kill the process. The only way to
> > make a directory "readable" is to restart the machine. I have not
> > noticed any FS corruption, just the lack of being able to enter the
> > directory.
> >
> >  At the same time, a kernel bug will be displayed:
> 
> <...>


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10 21:44   ` Nathan Conrad
  2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
@ 2003-06-10 20:59     ` Andrew Morton
  2003-06-12  5:20       ` ext3 / reiserfs data corruption, 2.5-bk; NULL pointer dereference bug Nathan Conrad
  2003-06-10 22:49     ` ext3 / reiserfs data corruption, 2.5-bk Dave Jones
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2003-06-10 20:59 UTC (permalink / raw)
  To: Nathan Conrad; +Cc: green, davej, linux-kernel

Nathan Conrad <conrad@bungled.net> wrote:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
>  printing eip:
> c016781a
> *pde = 00000000
> Oops: 0000 [#1]
> CPU:    0
> EIP:    0060:[find_inode_fast+26/96]    Not tainted

Something scribbled on your inode hash chains.  Please make sure that
you're building the kernel with all the memory debug options enabled, and
run memtest86 on that machine for 12 hourws or so.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk; NULL pointer dereference bug
  2003-06-10 20:59     ` Andrew Morton
@ 2003-06-12  5:20       ` Nathan Conrad
  0 siblings, 0 replies; 9+ messages in thread
From: Nathan Conrad @ 2003-06-12  5:20 UTC (permalink / raw)
  To: Andrew Morton; +Cc: green, davej, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2042 bytes --]

I just saw another one of these NULL pointer dereference oops on my
laptop:

Unable to handle kernel NULL pointer dereference at virtual address 00000000
 printing eip:
c01665f3
*pde = 00000000
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[__d_lookup+99/256]    Not tainted
EFLAGS: 00210282
EIP is at __d_lookup+0x63/0x100
eax: 00000000   ebx: c06ef980   ecx: 00000010   edx: dfe80000
esi: dfe8da40   edi: 00000000   ebp: df85be70   esp: db047ec8
ds: 007b   es: 007b   ss: 0068
Process gcc (pid: 4738, threadinfo=db046000 task=c22198c0)
Stack: dcfcc014 c012a225 00000000 00000000 dfe8da40 db047f48 00000000 dcfcc001 
       0029e101 00000003 dcfcc001 db047f90 dfff4fc0 db047f3c c015cf80 dfd50e00 
       db047f44 c015cb64 dcfcc001 dcfcc005 db047f3c db047f44 c015d129 db047f90 
Call Trace:
[in_group_p+37/48] in_group_p+0x25/0x30
[do_lookup+48/176] do_lookup+0x30/0xb0
[permission+84/112] permission+0x54/0x70
[link_path_walk+297/2000] link_path_walk+0x129/0x7d0
[__user_walk+73/96] __user_walk+0x49/0x60
[sys_access+129/320] sys_access+0x81/0x140
[syscall_call+7/11] syscall_call+0x7/0xb

Code: 0f 18 00 90 8b 74 24 10 8d 5d 90 39 73 78 75 17 8b 7b 58 89 

I ran memtest86 for about 14 hours and it passed all of its tests. I
enabled the memory debugging options (under the kernel hacking
section) and I did not notice any errors displayed by it in my syslog.

I'm not sure what else to try... The backtrace is signifigantly
different that the last one...

On Tue, Jun 10, 2003 at 01:59:35PM -0700, Andrew Morton wrote:
> Nathan Conrad <conrad@bungled.net> wrote:
> >
> > Unable to handle kernel NULL pointer dereference at virtual address 00000000
> >  printing eip:
> > c016781a
> > *pde = 00000000
> > Oops: 0000 [#1]
> > CPU:    0
> > EIP:    0060:[find_inode_fast+26/96]    Not tainted
> 
> Something scribbled on your inode hash chains.  Please make sure that
> you're building the kernel with all the memory debug options enabled, and
> run memtest86 on that machine for 12 hourws or so.


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ext3 / reiserfs data corruption, 2.5-bk
  2003-06-10 21:44   ` Nathan Conrad
  2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
  2003-06-10 20:59     ` Andrew Morton
@ 2003-06-10 22:49     ` Dave Jones
  2 siblings, 0 replies; 9+ messages in thread
From: Dave Jones @ 2003-06-10 22:49 UTC (permalink / raw)
  To: Nathan Conrad; +Cc: Oleg Drokin, Linux Kernel

On Tue, Jun 10, 2003 at 05:44:36PM -0400, Nathan Conrad wrote:
 > I've been noticing a similar problem on my laptop. This may, or may
 > not be related, but it did start somewhere within the past week (maybe
 > the IDE taskfile conversion???, to throw out a guess). I wonder if
 > Dave Jones is using IDE or SCSI.

IDE. I'm too cheap to buy SCSI.

		Dave


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2003-06-12  5:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-09 19:35 ext3 / reiserfs data corruption, 2.5-bk Dave Jones
2003-06-10  8:43 ` Oleg Drokin
2003-06-10  9:20   ` Dave Jones
2003-06-10 21:44   ` Nathan Conrad
2003-06-10 18:11     ` Bartlomiej Zolnierkiewicz
2003-06-10 22:18       ` Nathan Conrad
2003-06-10 20:59     ` Andrew Morton
2003-06-12  5:20       ` ext3 / reiserfs data corruption, 2.5-bk; NULL pointer dereference bug Nathan Conrad
2003-06-10 22:49     ` ext3 / reiserfs data corruption, 2.5-bk Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox