Severe, huge data corruption with softraid

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Severe, huge data corruption with softraid
@ 2005-03-02 23:23 Michael Tokarev
  2005-03-02 23:57 ` Michael Tokarev
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Michael Tokarev @ 2005-03-02 23:23 UTC (permalink / raw)
  To: linux-raid

Too bad I can't diagnose the problem correctly, but it is
here somewhere, and is (hardly) reproduceable.

I'm doing alot of experiments right now with various raid options
and read/write speed.  And 3 times now, the whole system went boom
during the experiments.  It is writing into random places on all
disks, including boot sectors, partition tables and whatnot, so
obviously every filesystem out there becomes corrupt to hell.

It seems the problem is due to integer overflow somewhere in raid
(very probably raid5) or ext3fs code, as it is starting to write
to the beginning of all disks instead of the raid partitions being
tested.  It *may* be related to direct-io (O_DIRECT) into a file
in ext3 filesystem which is on top of softraid5 array.  It may also
be related to raid10 code, but it is less likely.

Here's the scenario.

I have 7 scsi disks, sda..sdg, 36GB each.
On each drive there's a 3GB partition at the end (sdX10)
where I'm testing stuff.
I tried to create various raid arrays out of those sdX10 partitions,
including raid5 (various chunk sizes), raid1+raid0 and raid10.
On top of the raid array, I also tried to create ext3 fs.
And did various read/write tests on both the md device (without the
filesystem) and a file on the filesystem.
The tests - just sequential read and write with various I/O size
(8k, 16k, 32k, ..., 1m) and various O_DIRECT/O_SYNC/fsync() combinations.

Ofcourse I created/stopped raid arrays (all on the same sdX11), created,
mounted and umounted filesystem on that arrays and did alot of reading
and writing.  I'm sure I didn't access other devices during all this
testing (like trying to write to /dev/sdX instead of /dev/sdX11), and
did not write to the device while there was filesystem mounted.  And
yes, my /dev/ major/minor numbers are correct (just verified to be sure).

The symthom is simple: at some time, partition table on /dev/sdX becomes
corrupt (either primary or extended which is at about 1.2Gb of the start
of each disk), just like alot of other stuff, mostly at the beginning of
all disks -- on all but one or two disks involved in testing.

We lost the system this way after first series of testing, and during
re-install (as there's no data anymore anyway), I descided to perform
some more testing, and hit the same prob again and (after restoring
partition tables) yet again.

All my attempts to reproduce it failed so far, but when I din't watch
partition tables after each operation, it happened again after yet more
series of tests.

One note: every time before it "crashed", I tried to create/use a raid5
array out of 3, 4 or 5 drives with chunk size = 4Kb (each partition is
3GB large), and -- if i recall correctly -- experimented with direct
write on the filesystem created on top of the array.  Maybe it dislikes
chunk size this small...

Now it's 02:18 here, deep night and I'm still in office -- I have to re-
install the server by morning so our users will have something to do,
so I have very limited time for more testing.  Any quick suggestions
about what/where to look at right now welcome...

BTW, the hardware is good, drives, memory, mobo and CPUs.
This happens on either 2.6.10 or 2.6.9 the first time, now it is
running 2.6.9.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-02 23:23 Severe, huge data corruption with softraid Michael Tokarev
@ 2005-03-02 23:57 ` Michael Tokarev
  2005-03-03  0:46   ` Peter T. Breuer
  2005-03-03  0:10 ` berk walker
  2005-03-03  9:00 ` Gordon Henderson
  2 siblings, 1 reply; 7+ messages in thread
From: Michael Tokarev @ 2005-03-02 23:57 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev wrote:
[data corruption.. snip]

And finally I managed to get an OOPs.
Created fresh raid5 array out of 4 partitions,
chunk size = 4kb.
Created ext3fs on it.
Tested write speed (direct-io) - it was terrible,
  about 6MB/sec for 64KB blocks - it's very unusual.
Umounted the fs.
Did a direct-write test agains the md device.
And at the same time, did an `rmmod raid0' -
unused in my config at that time. -- not sure
if it's relevant or not.
And get "sigsegv" in my program, and the
following oops:

md: raid0 personality unregistered
Unable to handle kernel paging request at virtual address f8924690
  printing eip:
f8924690
*pde = 02127067
*pte = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: raid10 nfsd exportfs raid5 xor nfs lockd sunrpc 8250 serial_core w83627hf i2c_sensor
i2c_isa i2c_core e1000 genrtc ext3 jbd mbcache raid1 sd_mod md aic79xx scsi_mod
CPU:    1
EIP:    0060:[<f8924690>]    Not tainted VLI
EFLAGS: 00010286   (2.6.9-i686smp-0)
EIP is at 0xf8924690
eax: ecd04028   ebx: c99ead40   ecx: c21dc380   edx: c99ead40
esi: ecd04028   edi: f8924690   ebp: c21dc380   esp: f1d39cac
ds: 007b   es: 007b   ss: 0068
Process dio (pid: 21941, threadinfo=f1d39000 task=f7d40890)
Stack: c015b5dd c99ead40 c10063a0 00001000 00000000 c015b64c 00001000 00000000
        f7d23800 00000000 c01778f2 00000000 f7d23800 c017798d f7d23800 c10063a0
        c0177a4e 00000000 00000001 00000000 f7d2384c f7d23800 c0177e78 00001000
Call Trace:
  [<c015b5dd>] __bio_add_page+0x13d/0x180
  [<c015b64c>] bio_add_page+0x2c/0x40
  [<c01778f2>] dio_bio_add_page+0x22/0x70
  [<c017798d>] dio_send_cur_page+0x4d/0xa0
  [<c0177a4e>] submit_page_section+0x6e/0x140
  [<c0177e78>] do_direct_IO+0x288/0x380
  [<c0178164>] direct_io_worker+0x1f4/0x520
  [<c017869d>] __blockdev_direct_IO+0x20d/0x308
  [<c015d770>] blkdev_get_blocks+0x0/0x70
  [<c015d83f>] blkdev_direct_IO+0x5f/0x80
  [<c015d770>] blkdev_get_blocks+0x0/0x70
  [<c013c304>] generic_file_direct_IO+0x74/0x90
  [<c013b352>] generic_file_direct_write+0x62/0x170
  [<c016f7cb>] inode_update_time+0xbb/0xc0
  [<c013bcfe>] generic_file_aio_write_nolock+0x2ce/0x490
  [<c013bf51>] generic_file_write_nolock+0x91/0xc0
  [<c011ae9e>] scheduler_tick+0x16e/0x470
  [<c0115135>] smp_apic_timer_interrupt+0x85/0xf0
  [<c011c850>] autoremove_wake_function+0x0/0x50
  [<c015e6c0>] blkdev_file_write+0x0/0x30
  [<c015e6e0>] blkdev_file_write+0x20/0x30
  [<c0156770>] vfs_write+0xb0/0x110
  [<c0156897>] sys_write+0x47/0x80
  [<c010603f>] syscall_call+0x7/0xb
Code:  Bad EIP value.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-02 23:23 Severe, huge data corruption with softraid Michael Tokarev
  2005-03-02 23:57 ` Michael Tokarev
@ 2005-03-03  0:10 ` berk walker
  2005-03-03  9:00 ` Gordon Henderson
  2 siblings, 0 replies; 7+ messages in thread
From: berk walker @ 2005-03-03  0:10 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: linux-raid

Just a thought, have you tried swapping power supplies, and 
checked/improved the system's Earth ground?
b-

Michael Tokarev wrote:

> Too bad I can't diagnose the problem correctly, but it is
> here somewhere, and is (hardly) reproduceable.
>
> I'm doing alot of experiments right now with various raid options
> and read/write speed.  And 3 times now, the whole system went boom
> during the experiments.  It is writing into random places on all
> disks, including boot sectors, partition tables and whatnot, so
> obviously every filesystem out there becomes corrupt to hell.
>
> It seems the problem is due to integer overflow somewhere in raid
> (very probably raid5) or ext3fs code, as it is starting to write
> to the beginning of all disks instead of the raid partitions being
> tested.  It *may* be related to direct-io (O_DIRECT) into a file
> in ext3 filesystem which is on top of softraid5 array.  It may also
> be related to raid10 code, but it is less likely.
>
> Here's the scenario.
>
> I have 7 scsi disks, sda..sdg, 36GB each.
> On each drive there's a 3GB partition at the end (sdX10)
> where I'm testing stuff.
> I tried to create various raid arrays out of those sdX10 partitions,
> including raid5 (various chunk sizes), raid1+raid0 and raid10.
> On top of the raid array, I also tried to create ext3 fs.
> And did various read/write tests on both the md device (without the
> filesystem) and a file on the filesystem.
> The tests - just sequential read and write with various I/O size
> (8k, 16k, 32k, ..., 1m) and various O_DIRECT/O_SYNC/fsync() combinations.
>
> Ofcourse I created/stopped raid arrays (all on the same sdX11), created,
> mounted and umounted filesystem on that arrays and did alot of reading
> and writing.  I'm sure I didn't access other devices during all this
> testing (like trying to write to /dev/sdX instead of /dev/sdX11), and
> did not write to the device while there was filesystem mounted.  And
> yes, my /dev/ major/minor numbers are correct (just verified to be sure).
>
> The symthom is simple: at some time, partition table on /dev/sdX becomes
> corrupt (either primary or extended which is at about 1.2Gb of the start
> of each disk), just like alot of other stuff, mostly at the beginning of
> all disks -- on all but one or two disks involved in testing.
>
> We lost the system this way after first series of testing, and during
> re-install (as there's no data anymore anyway), I descided to perform
> some more testing, and hit the same prob again and (after restoring
> partition tables) yet again.
>
> All my attempts to reproduce it failed so far, but when I din't watch
> partition tables after each operation, it happened again after yet more
> series of tests.
>
> One note: every time before it "crashed", I tried to create/use a raid5
> array out of 3, 4 or 5 drives with chunk size = 4Kb (each partition is
> 3GB large), and -- if i recall correctly -- experimented with direct
> write on the filesystem created on top of the array.  Maybe it dislikes
> chunk size this small...
>
> Now it's 02:18 here, deep night and I'm still in office -- I have to re-
> install the server by morning so our users will have something to do,
> so I have very limited time for more testing.  Any quick suggestions
> about what/where to look at right now welcome...
>
> BTW, the hardware is good, drives, memory, mobo and CPUs.
> This happens on either 2.6.10 or 2.6.9 the first time, now it is
> running 2.6.9.
>
> /mjt
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> .
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-02 23:57 ` Michael Tokarev
@ 2005-03-03  0:46   ` Peter T. Breuer
  2005-03-03  1:24     ` Michael Tokarev
  0 siblings, 1 reply; 7+ messages in thread
From: Peter T. Breuer @ 2005-03-03  0:46 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> And finally I managed to get an OOPs.

What CPU? SMP? How many? 

Which kernel? Is it preemptive?

> Created fresh raid5 array out of 4 partitions,
> chunk size = 4kb.
> Created ext3fs on it.
> Tested write speed (direct-io) - it was terrible,
>   about 6MB/sec for 64KB blocks - it's very unusual.
> Umounted the fs.
> Did a direct-write test agains the md device.
> And at the same time, did an `rmmod raid0' -

That may be a race - I get horrible oops from time to time on module
removal in the 2.6 series, with quite a few different modules, but it
looks more like a race as soon as you say "and at the same time".

> unused in my config at that time. -- not sure
> if it's relevant or not.
> And get "sigsegv" in my program, and the
> following oops:
> 
> md: raid0 personality unregistered

Well, the rmmod worked, and then the other (write) process oopsed.

> Unable to handle kernel paging request at virtual address f8924690

That address is bogus. Looks more like a negative integer. I suppose
ram corruption is a posibility too.

>   printing eip:
> f8924690
> *pde = 02127067
> *pte = 00000000
> Oops: 0000 [#1]
> SMP
> Modules linked in: raid10 nfsd exportfs raid5 xor nfs lockd sunrpc 8250 serial_core w83627hf i2c_sensor
> i2c_isa i2c_core e1000 genrtc ext3 jbd mbcache raid1 sd_mod md aic79xx scsi_mod
> CPU:    1
> EIP:    0060:[<f8924690>]    Not tainted VLI
> EFLAGS: 00010286   (2.6.9-i686smp-0)
> EIP is at 0xf8924690
> eax: ecd04028   ebx: c99ead40   ecx: c21dc380   edx: c99ead40
> esi: ecd04028   edi: f8924690   ebp: c21dc380   esp: f1d39cac
> ds: 007b   es: 007b   ss: 0068
> Process dio (pid: 21941, threadinfo=f1d39000 task=f7d40890)
> Stack: c015b5dd c99ead40 c10063a0 00001000 00000000 c015b64c 00001000 00000000
>         f7d23800 00000000 c01778f2 00000000 f7d23800 c017798d f7d23800 c10063a0
>         c0177a4e 00000000 00000001 00000000 f7d2384c f7d23800 c0177e78 00001000

Code?

> Call Trace:
>   [<c015b5dd>] __bio_add_page+0x13d/0x180

3/4 of the way through.

>   [<c015b64c>] bio_add_page+0x2c/0x40
>   [<c01778f2>] dio_bio_add_page+0x22/0x70
>   [<c017798d>] dio_send_cur_page+0x4d/0xa0
>   [<c0177a4e>] submit_page_section+0x6e/0x140
>   [<c0177e78>] do_direct_IO+0x288/0x380

That looks the relevant entry.

>   [<c0178164>] direct_io_worker+0x1f4/0x520
>   [<c017869d>] __blockdev_direct_IO+0x20d/0x308
>   [<c015d770>] blkdev_get_blocks+0x0/0x70
>   [<c015d83f>] blkdev_direct_IO+0x5f/0x80
>   [<c015d770>] blkdev_get_blocks+0x0/0x70
>   [<c013c304>] generic_file_direct_IO+0x74/0x90
>   [<c013b352>] generic_file_direct_write+0x62/0x170
>   [<c016f7cb>] inode_update_time+0xbb/0xc0
>   [<c013bcfe>] generic_file_aio_write_nolock+0x2ce/0x490
>   [<c013bf51>] generic_file_write_nolock+0x91/0xc0
>   [<c011ae9e>] scheduler_tick+0x16e/0x470
>   [<c0115135>] smp_apic_timer_interrupt+0x85/0xf0
>   [<c011c850>] autoremove_wake_function+0x0/0x50
>   [<c015e6c0>] blkdev_file_write+0x0/0x30
>   [<c015e6e0>] blkdev_file_write+0x20/0x30
>   [<c0156770>] vfs_write+0xb0/0x110
>   [<c0156897>] sys_write+0x47/0x80
>   [<c010603f>] syscall_call+0x7/0xb
> Code:  Bad EIP value.

More info needed.

Peter


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-03  0:46   ` Peter T. Breuer
@ 2005-03-03  1:24     ` Michael Tokarev
  2005-03-03  3:01       ` Peter T. Breuer
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Tokarev @ 2005-03-03  1:24 UTC (permalink / raw)
  To: linux-raid

Peter T. Breuer wrote:
> Michael Tokarev <mjt@tls.msk.ru> wrote:
> 
>>And finally I managed to get an OOPs.
> 
> What CPU? SMP? How many? 
> 
> Which kernel? Is it preemptive?

CPU is 2x Xeon 2.4GHz, ht enabled (so it's 4 logical CPUs).
Kernel - the one which oopsed - is 2.6.9, patched for various
trivial problems (like fixes went in for raid10 for 2.6.10).
It is NOT preemptive (I had enouth games with preempt kernels
on servers).

I'm trying 2.6.11 (just released and built) now.  No corruption
so far, but it is only running for about a hour.


>>Created fresh raid5 array out of 4 partitions,
>>chunk size = 4kb.
>>Created ext3fs on it.
>>Tested write speed (direct-io) - it was terrible,
>>  about 6MB/sec for 64KB blocks - it's very unusual.
>>Umounted the fs.
>>Did a direct-write test agains the md device.
>>And at the same time, did an `rmmod raid0' -
> 
> That may be a race - I get horrible oops from time to time on module
> removal in the 2.6 series, with quite a few different modules, but it
> looks more like a race as soon as you say "and at the same time".
> 
>>unused in my config at that time. -- not sure
>>if it's relevant or not.
>>And get "sigsegv" in my program, and the
>>following oops:
>>
>>md: raid0 personality unregistered
> 
> Well, the rmmod worked, and then the other (write) process oopsed.

Yes, rmmod went ok.

>>Unable to handle kernel paging request at virtual address f8924690
> 
> That address is bogus. Looks more like a negative integer. I suppose
> ram corruption is a posibility too.

Ram corruption in what sense?  Faulty DIMM?
Well, it indeed is possible, everything is possible.  This is 2Gb
of ECC memory (2x512 and 4x256 modules in 6 banks) from Kingston,
ValueRam I think (the expensive one, that is ;)

The machine is on UPS, and power is very stable here too.

>>  printing eip:
>>f8924690
>>*pde = 02127067
>>*pte = 00000000
>>Oops: 0000 [#1]
>>SMP
>>Modules linked in: raid10 nfsd exportfs raid5 xor nfs lockd sunrpc 8250 serial_core w83627hf i2c_sensor
>>i2c_isa i2c_core e1000 genrtc ext3 jbd mbcache raid1 sd_mod md aic79xx scsi_mod
>>CPU:    1
>>EIP:    0060:[<f8924690>]    Not tainted VLI
>>EFLAGS: 00010286   (2.6.9-i686smp-0)
>>EIP is at 0xf8924690
>>eax: ecd04028   ebx: c99ead40   ecx: c21dc380   edx: c99ead40
>>esi: ecd04028   edi: f8924690   ebp: c21dc380   esp: f1d39cac
>>ds: 007b   es: 007b   ss: 0068
>>Process dio (pid: 21941, threadinfo=f1d39000 task=f7d40890)
>>Stack: c015b5dd c99ead40 c10063a0 00001000 00000000 c015b64c 00001000 00000000
>>        f7d23800 00000000 c01778f2 00000000 f7d23800 c017798d f7d23800 c10063a0
>>        c0177a4e 00000000 00000001 00000000 f7d2384c f7d23800 c0177e78 00001000
> 
> Code?

Hmm?
I'm terrible sorry but I never tried to go that deep.  I just don't know
what you mean here.  Well, maybe I know what did you mean, but I don't know
how to convert that series of hex numbers into something sensitive... ;)

>>Call Trace:
>>  [<c015b5dd>] __bio_add_page+0x13d/0x180
> 
> 3/4 of the way through.
> 
>>  [<c015b64c>] bio_add_page+0x2c/0x40
>>  [<c01778f2>] dio_bio_add_page+0x22/0x70
>>  [<c017798d>] dio_send_cur_page+0x4d/0xa0
>>  [<c0177a4e>] submit_page_section+0x6e/0x140
>>  [<c0177e78>] do_direct_IO+0x288/0x380
> 
> That looks the relevant entry.

And what to do with it?

>>  [<c0178164>] direct_io_worker+0x1f4/0x520
>>  [<c017869d>] __blockdev_direct_IO+0x20d/0x308
>>  [<c015d770>] blkdev_get_blocks+0x0/0x70
>>  [<c015d83f>] blkdev_direct_IO+0x5f/0x80
>>  [<c015d770>] blkdev_get_blocks+0x0/0x70
>>  [<c013c304>] generic_file_direct_IO+0x74/0x90
>>  [<c013b352>] generic_file_direct_write+0x62/0x170
>>  [<c016f7cb>] inode_update_time+0xbb/0xc0
>>  [<c013bcfe>] generic_file_aio_write_nolock+0x2ce/0x490
>>  [<c013bf51>] generic_file_write_nolock+0x91/0xc0
>>  [<c011ae9e>] scheduler_tick+0x16e/0x470
>>  [<c0115135>] smp_apic_timer_interrupt+0x85/0xf0
>>  [<c011c850>] autoremove_wake_function+0x0/0x50
>>  [<c015e6c0>] blkdev_file_write+0x0/0x30
>>  [<c015e6e0>] blkdev_file_write+0x20/0x30
>>  [<c0156770>] vfs_write+0xb0/0x110
>>  [<c0156897>] sys_write+0x47/0x80
>>  [<c010603f>] syscall_call+0x7/0xb
>>Code:  Bad EIP value.
> 
> More info needed.

The question is: how? ;)

Thanks.

/mjt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-03  1:24     ` Michael Tokarev
@ 2005-03-03  3:01       ` Peter T. Breuer
  0 siblings, 0 replies; 7+ messages in thread
From: Peter T. Breuer @ 2005-03-03  3:01 UTC (permalink / raw)
  To: linux-raid

Michael Tokarev <mjt@tls.msk.ru> wrote:
> >>Unable to handle kernel paging request at virtual address f8924690
> > 
> > That address is bogus. Looks more like a negative integer. I suppose
> > ram corruption is a posibility too.
> 
> Ram corruption in what sense?  Faulty DIMM?

Anything.

> Well, it indeed is possible, everything is possible.  This is 2Gb
> of ECC memory (2x512 and 4x256 modules in 6 banks) from Kingston,
> ValueRam I think (the expensive one, that is ;)

It that case the corruption, if it is so, will originate in overheated
cpu, bus, or bridge, rather than the ram itself. Or disk, disk
controller, etc.

> The machine is on UPS, and power is very stable here too.
> 
> >>  printing eip:
> >>f8924690
> >>*pde = 02127067
> >>*pte = 00000000
> >>Oops: 0000 [#1]
> >>SMP
> >>Modules linked in: raid10 nfsd exportfs raid5 xor nfs lockd sunrpc 8250 serial_core w83627hf i2c_sensor
> >>i2c_isa i2c_core e1000 genrtc ext3 jbd mbcache raid1 sd_mod md aic79xx scsi_mod
> >>CPU:    1
> >>EIP:    0060:[<f8924690>]    Not tainted VLI
> >>EFLAGS: 00010286   (2.6.9-i686smp-0)
> >>EIP is at 0xf8924690
> >>eax: ecd04028   ebx: c99ead40   ecx: c21dc380   edx: c99ead40
> >>esi: ecd04028   edi: f8924690   ebp: c21dc380   esp: f1d39cac
> >>ds: 007b   es: 007b   ss: 0068
> >>Process dio (pid: 21941, threadinfo=f1d39000 task=f7d40890)
> >>Stack: c015b5dd c99ead40 c10063a0 00001000 00000000 c015b64c 00001000 00000000
> >>        f7d23800 00000000 c01778f2 00000000 f7d23800 c017798d f7d23800 c10063a0
> >>        c0177a4e 00000000 00000001 00000000 f7d2384c f7d23800 c0177e78 00001000
> > 
> > Code?
> 
> Hmm?

You didn't quote the code listing from the oops printout.

> I'm terrible sorry but I never tried to go that deep.  I just don't know
> what you mean here.  Well, maybe I know what did you mean, but I don't know
> how to convert that series of hex numbers into something sensitive... ;)

It's not THOSE but thers I was referring to.

> >>Call Trace:
> >>  [<c015b5dd>] __bio_add_page+0x13d/0x180
> > 
> > 3/4 of the way through.
> > 
> >>  [<c015b64c>] bio_add_page+0x2c/0x40
> >>  [<c01778f2>] dio_bio_add_page+0x22/0x70
> >>  [<c017798d>] dio_send_cur_page+0x4d/0xa0
> >>  [<c0177a4e>] submit_page_section+0x6e/0x140
> >>  [<c0177e78>] do_direct_IO+0x288/0x380
> > 
> > That looks the relevant entry.
> 
> And what to do with it?

Nothing. Look at the code for a clue maybe. Anyway, it's nothing to do
with RAID.

> >>  [<c0178164>] direct_io_worker+0x1f4/0x520
> >>  [<c017869d>] __blockdev_direct_IO+0x20d/0x308
> >>  [<c015d770>] blkdev_get_blocks+0x0/0x70
> >>  [<c015d83f>] blkdev_direct_IO+0x5f/0x80
> >>  [<c015d770>] blkdev_get_blocks+0x0/0x70
> >>  [<c013c304>] generic_file_direct_IO+0x74/0x90
> >>  [<c013b352>] generic_file_direct_write+0x62/0x170
> >>  [<c016f7cb>] inode_update_time+0xbb/0xc0
> >>  [<c013bcfe>] generic_file_aio_write_nolock+0x2ce/0x490
> >>  [<c013bf51>] generic_file_write_nolock+0x91/0xc0
> >>  [<c011ae9e>] scheduler_tick+0x16e/0x470
> >>  [<c0115135>] smp_apic_timer_interrupt+0x85/0xf0
> >>  [<c011c850>] autoremove_wake_function+0x0/0x50
> >>  [<c015e6c0>] blkdev_file_write+0x0/0x30
> >>  [<c015e6e0>] blkdev_file_write+0x20/0x30
> >>  [<c0156770>] vfs_write+0xb0/0x110
> >>  [<c0156897>] sys_write+0x47/0x80
> >>  [<c010603f>] syscall_call+0x7/0xb
> >>Code:  Bad EIP value.
> > 
> > More info needed.
> 
> The question is: how? ;)

There should be a bit of the oops where it shows you the code fragment.

But it doesn't look very informative. It's a straight DIO write that
oopsed.

Peter 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Severe, huge data corruption with softraid
  2005-03-02 23:23 Severe, huge data corruption with softraid Michael Tokarev
  2005-03-02 23:57 ` Michael Tokarev
  2005-03-03  0:10 ` berk walker
@ 2005-03-03  9:00 ` Gordon Henderson
  2 siblings, 0 replies; 7+ messages in thread
From: Gordon Henderson @ 2005-03-03  9:00 UTC (permalink / raw)
  To: linux-raid

When I was building a server recently, I ran into a file-system corruption
issue with ext3, although I didn't have the time to do anything about it
(or even verify that it wasn't me doing something stupid)

However, I only saw it when I used the  stride=8 parameter to mk2efs, and
the -j (ext3) option. Without those everything seemed fine.

The symptoms I saw were the filesystem having a size of -64Z after I'd
closed and deleted all files on it. An fsck of the filesystem owuld sfix
it (after correcting many errors) I didn't notice an actual file
corruption, but it seemed the filesystem was croaked.

I was able to verify that this happened without any RAID code (just a
plain old partition), so I'm not sure this has any bearing on the issues
in this thread, however..

Alas I've not had the time to go back and do some real tests. (and it's
not entirely possible its something to do with the old version of the
older e2fsutils in Debian woody)

Gordon

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-03-03  9:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-02 23:23 Severe, huge data corruption with softraid Michael Tokarev
2005-03-02 23:57 ` Michael Tokarev
2005-03-03  0:46   ` Peter T. Breuer
2005-03-03  1:24     ` Michael Tokarev
2005-03-03  3:01       ` Peter T. Breuer
2005-03-03  0:10 ` berk walker
2005-03-03  9:00 ` Gordon Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).