Re: Severe, huge data corruption with softraid

All of lore.kernel.org
 help / color / mirror / Atom feed

From: berk walker <berk@panix.com>
To: Michael Tokarev <mjt@tls.msk.ru>
Cc: linux-raid@vger.kernel.org
Subject: Re: Severe, huge data corruption with softraid
Date: Wed, 02 Mar 2005 19:10:57 -0500	[thread overview]
Message-ID: <42265611.4020307@panix.com> (raw)
In-Reply-To: <42264AF4.4000600@tls.msk.ru>

Just a thought, have you tried swapping power supplies, and 
checked/improved the system's Earth ground?
b-

Michael Tokarev wrote:

> Too bad I can't diagnose the problem correctly, but it is
> here somewhere, and is (hardly) reproduceable.
>
> I'm doing alot of experiments right now with various raid options
> and read/write speed.  And 3 times now, the whole system went boom
> during the experiments.  It is writing into random places on all
> disks, including boot sectors, partition tables and whatnot, so
> obviously every filesystem out there becomes corrupt to hell.
>
> It seems the problem is due to integer overflow somewhere in raid
> (very probably raid5) or ext3fs code, as it is starting to write
> to the beginning of all disks instead of the raid partitions being
> tested.  It *may* be related to direct-io (O_DIRECT) into a file
> in ext3 filesystem which is on top of softraid5 array.  It may also
> be related to raid10 code, but it is less likely.
>
> Here's the scenario.
>
> I have 7 scsi disks, sda..sdg, 36GB each.
> On each drive there's a 3GB partition at the end (sdX10)
> where I'm testing stuff.
> I tried to create various raid arrays out of those sdX10 partitions,
> including raid5 (various chunk sizes), raid1+raid0 and raid10.
> On top of the raid array, I also tried to create ext3 fs.
> And did various read/write tests on both the md device (without the
> filesystem) and a file on the filesystem.
> The tests - just sequential read and write with various I/O size
> (8k, 16k, 32k, ..., 1m) and various O_DIRECT/O_SYNC/fsync() combinations.
>
> Ofcourse I created/stopped raid arrays (all on the same sdX11), created,
> mounted and umounted filesystem on that arrays and did alot of reading
> and writing.  I'm sure I didn't access other devices during all this
> testing (like trying to write to /dev/sdX instead of /dev/sdX11), and
> did not write to the device while there was filesystem mounted.  And
> yes, my /dev/ major/minor numbers are correct (just verified to be sure).
>
> The symthom is simple: at some time, partition table on /dev/sdX becomes
> corrupt (either primary or extended which is at about 1.2Gb of the start
> of each disk), just like alot of other stuff, mostly at the beginning of
> all disks -- on all but one or two disks involved in testing.
>
> We lost the system this way after first series of testing, and during
> re-install (as there's no data anymore anyway), I descided to perform
> some more testing, and hit the same prob again and (after restoring
> partition tables) yet again.
>
> All my attempts to reproduce it failed so far, but when I din't watch
> partition tables after each operation, it happened again after yet more
> series of tests.
>
> One note: every time before it "crashed", I tried to create/use a raid5
> array out of 3, 4 or 5 drives with chunk size = 4Kb (each partition is
> 3GB large), and -- if i recall correctly -- experimented with direct
> write on the filesystem created on top of the array.  Maybe it dislikes
> chunk size this small...
>
> Now it's 02:18 here, deep night and I'm still in office -- I have to re-
> install the server by morning so our users will have something to do,
> so I have very limited time for more testing.  Any quick suggestions
> about what/where to look at right now welcome...
>
> BTW, the hardware is good, drives, memory, mobo and CPUs.
> This happens on either 2.6.10 or 2.6.9 the first time, now it is
> running 2.6.9.
>
> /mjt
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> .
>

next prev parent reply	other threads:[~2005-03-03  0:10 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-03-02 23:23 Severe, huge data corruption with softraid Michael Tokarev
2005-03-02 23:57 ` Michael Tokarev
2005-03-03  0:46   ` Peter T. Breuer
2005-03-03  1:24     ` Michael Tokarev
2005-03-03  3:01       ` Peter T. Breuer
2005-03-03  0:10 ` berk walker [this message]
2005-03-03  9:00 ` Gordon Henderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42265611.4020307@panix.com \
    --to=berk@panix.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mjt@tls.msk.ru \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.