Md corruption using RAID10 on linux-2.6.21

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Md corruption using RAID10 on linux-2.6.21
@ 2007-05-16 18:38 Don Dupuis
  2007-05-17  1:54 ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-16 18:38 UTC (permalink / raw)
  To: linux-raid

I have a system with 4 disks in a raid10 configuration. Here is the
output of mdadm:

bash-3.1# mdadm -D /dev/md_d0
/dev/md_d0:
        Version : 00.90.03
  Creation Time : Wed May 16 10:28:44 2007
     Raid Level : raid10
     Array Size : 3646464 (3.48 GiB 3.73 GB)
  Used Dev Size : 2734848 (2.61 GiB 2.80 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Wed May 16 12:20:29 2007
          State : active, resyncing
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=3, far=1
     Chunk Size : 256K

 Rebuild Status : 37% complete

           UUID : fe3cad98:406511ae:3df46086:0a218818
         Events : 0.1066

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2
       3       8       50        3      active sync   /dev/sdd2

The problem arises when I do a drive removal such as sda and then I
remove power from the system. Most of the time I will have a corrupted
partition on the md device. Other corruption will be my root partition
which is an ext3 filesystem. I seem to have a better chance of booting
a least 1 time with no errors with bitmap turned on, but If I repeat
the process, I will have corruption as well. Also with bitmap turned
on, adding the new drive into the md device will take way to too long.
I only get about 3MB per second on the resync. With bitmap turned off,
I will get between 10MB to 15MB resync rate. Has anyone else seen this
behavior, or is this situation is no tested very often? I would think
that I shouldn't get corruption with this raid  setup and jornaling of
my filesytems? Any help would be appreciated.

Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-16 18:38 Md corruption using RAID10 on linux-2.6.21 Don Dupuis
@ 2007-05-17  1:54 ` Neil Brown
  2007-05-17  2:57   ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2007-05-17  1:54 UTC (permalink / raw)
  To: Don Dupuis; +Cc: linux-raid

On Wednesday May 16, dondster@gmail.com wrote:
...
> 
> The problem arises when I do a drive removal such as sda and then I
> remove power from the system. Most of the time I will have a corrupted
> partition on the md device. Other corruption will be my root partition
> which is an ext3 filesystem. I seem to have a better chance of booting
> a least 1 time with no errors with bitmap turned on, but If I repeat
> the process, I will have corruption as well. Also with bitmap turned
> on, adding the new drive into the md device will take way to too long.
> I only get about 3MB per second on the resync. With bitmap turned off,
> I will get between 10MB to 15MB resync rate. Has anyone else seen this
> behavior, or is this situation is no tested very often? I would think
> that I shouldn't get corruption with this raid  setup and jornaling of
> my filesytems? Any help would be appreciated.

The resync rate should be the same whether you have a bitmap or not,
so that observation is very strange.  Can you double check, and report
the contents of "/proc/mdstat" in the two situations.

You say you have corruption on your root filesystem.  Presumably that
is not on the raid?  Maybe the drive doesn't get a chance to flush
it's cache when you power-off.  Do you get the same corruption if you
simulate a crash without turning off the power. e.g.
   echo b > /proc/sysrq-trigger

Do you get the same corruption in the raid10 if you turn it off
*without* removing a drive first?

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-17  1:54 ` Neil Brown
@ 2007-05-17  2:57   ` Don Dupuis
  2007-05-17  2:58     ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-17  2:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 5/16/07, Neil Brown <neilb@suse.de> wrote:
> On Wednesday May 16, dondster@gmail.com wrote:
> ...
> >
> > The problem arises when I do a drive removal such as sda and then I
> > remove power from the system. Most of the time I will have a corrupted
> > partition on the md device. Other corruption will be my root partition
> > which is an ext3 filesystem. I seem to have a better chance of booting
> > a least 1 time with no errors with bitmap turned on, but If I repeat
> > the process, I will have corruption as well. Also with bitmap turned
> > on, adding the new drive into the md device will take way to too long.
> > I only get about 3MB per second on the resync. With bitmap turned off,
> > I will get between 10MB to 15MB resync rate. Has anyone else seen this
> > behavior, or is this situation is no tested very often? I would think
> > that I shouldn't get corruption with this raid  setup and jornaling of
> > my filesytems? Any help would be appreciated.
>
>
> The resync rate should be the same whether you have a bitmap or not,
> so that observation is very strange.  Can you double check, and report
> the contents of "/proc/mdstat" in the two situations.
>
> You say you have corruption on your root filesystem.  Presumably that
> is not on the raid?  Maybe the drive doesn't get a chance to flush
> it's cache when you power-off.  Do you get the same corruption if you
> simulate a crash without turning off the power. e.g.
>    echo b > /proc/sysrq-trigger
>
> Do you get the same corruption in the raid10 if you turn it off
> *without* removing a drive first?
>
> NeilBrown
>
Powering off with all drives will not have corruption. When I have a
drive missing and the md device does a full resync, I will get the
corruption. Usually the md partition table is corrupt or gone. and
with the first drive gone it happens more frequently. If the partition
table is not corrupt, then the rootfilesystem or one of the other
filesystems on the md device will be corrupted. Yes my root filesystem
is on the raid device. I will update with the bitmap resync rate stuff
later.

Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-17  2:57   ` Don Dupuis
@ 2007-05-17  2:58     ` Don Dupuis
  2007-05-17  3:50       ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-17  2:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> On 5/16/07, Neil Brown <neilb@suse.de> wrote:
> > On Wednesday May 16, dondster@gmail.com wrote:
> > ...
> > >
> > > The problem arises when I do a drive removal such as sda and then I
> > > remove power from the system. Most of the time I will have a corrupted
> > > partition on the md device. Other corruption will be my root partition
> > > which is an ext3 filesystem. I seem to have a better chance of booting
> > > a least 1 time with no errors with bitmap turned on, but If I repeat
> > > the process, I will have corruption as well. Also with bitmap turned
> > > on, adding the new drive into the md device will take way to too long.
> > > I only get about 3MB per second on the resync. With bitmap turned off,
> > > I will get between 10MB to 15MB resync rate. Has anyone else seen this
> > > behavior, or is this situation is no tested very often? I would think
> > > that I shouldn't get corruption with this raid  setup and jornaling of
> > > my filesytems? Any help would be appreciated.
> >
> >
> > The resync rate should be the same whether you have a bitmap or not,
> > so that observation is very strange.  Can you double check, and report
> > the contents of "/proc/mdstat" in the two situations.
> >
> > You say you have corruption on your root filesystem.  Presumably that
> > is not on the raid?  Maybe the drive doesn't get a chance to flush
> > it's cache when you power-off.  Do you get the same corruption if you
> > simulate a crash without turning off the power. e.g.
> >    echo b > /proc/sysrq-trigger
> >
> > Do you get the same corruption in the raid10 if you turn it off
> > *without* removing a drive first?
> >
> > NeilBrown
> >
> Powering off with all drives will not have corruption. When I have a
> drive missing and the md device does a full resync, I will get the
> corruption. Usually the md partition table is corrupt or gone. and
> with the first drive gone it happens more frequently. If the partition
> table is not corrupt, then the rootfilesystem or one of the other
> filesystems on the md device will be corrupted. Yes my root filesystem
> is on the raid device. I will update with the bitmap resync rate stuff
> later.
>
> Don
>
Forgot to tell you that I have the drive write cache disabled on all my drives.

Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-17  2:58     ` Don Dupuis
@ 2007-05-17  3:50       ` Don Dupuis
  2007-05-21 19:32         ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-17  3:50 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> > On 5/16/07, Neil Brown <neilb@suse.de> wrote:
> > > On Wednesday May 16, dondster@gmail.com wrote:
> > > ...
> > > >
> > > > The problem arises when I do a drive removal such as sda and then I
> > > > remove power from the system. Most of the time I will have a corrupted
> > > > partition on the md device. Other corruption will be my root partition
> > > > which is an ext3 filesystem. I seem to have a better chance of booting
> > > > a least 1 time with no errors with bitmap turned on, but If I repeat
> > > > the process, I will have corruption as well. Also with bitmap turned
> > > > on, adding the new drive into the md device will take way to too long.
> > > > I only get about 3MB per second on the resync. With bitmap turned off,
> > > > I will get between 10MB to 15MB resync rate. Has anyone else seen this
> > > > behavior, or is this situation is no tested very often? I would think
> > > > that I shouldn't get corruption with this raid  setup and jornaling of
> > > > my filesytems? Any help would be appreciated.
> > >
> > >
> > > The resync rate should be the same whether you have a bitmap or not,
> > > so that observation is very strange.  Can you double check, and report
> > > the contents of "/proc/mdstat" in the two situations.
> > >
> > > You say you have corruption on your root filesystem.  Presumably that
> > > is not on the raid?  Maybe the drive doesn't get a chance to flush
> > > it's cache when you power-off.  Do you get the same corruption if you
> > > simulate a crash without turning off the power. e.g.
> > >    echo b > /proc/sysrq-trigger
> > >
> > > Do you get the same corruption in the raid10 if you turn it off
> > > *without* removing a drive first?
> > >
> > > NeilBrown
> > >
> > Powering off with all drives will not have corruption. When I have a
> > drive missing and the md device does a full resync, I will get the
> > corruption. Usually the md partition table is corrupt or gone. and
> > with the first drive gone it happens more frequently. If the partition
> > table is not corrupt, then the rootfilesystem or one of the other
> > filesystems on the md device will be corrupted. Yes my root filesystem
> > is on the raid device. I will update with the bitmap resync rate stuff
> > later.
> >
> > Don
> >
> Forgot to tell you that I have the drive write cache disabled on all my drives.
>
> Don
>
Here is the /proc/mdstat output doing a recover after adding a drive
to the md device:
unused devices: <none>
-bash-3.1$ cat /proc/mdstat
Personalities : [raid10]
md_d0 : active raid10 sda2[4] sdd2[3] sdc2[2] sdb2[1]
      3646464 blocks 256K chunks 3 near-copies [4/3] [_UUU]
      [>....................]  recovery =  2.6% (73216/2734848)
finish=4.8min speed=9152K/sec

unused devices: <none>
-bash-3.1$ cat /proc/mdstat
Personalities : [raid10]
md_d0 : active raid10 sda2[4] sdd2[3] sdc2[2] sdb2[1]
      3646464 blocks 256K chunks 3 near-copies [4/3] [_UUU]
      [>....................]  recovery =  3.4% (93696/2734848)
finish=4.6min speed=9369K/sec

I am still trying to get where I had the low recover rate with the
bitmap turned on. I will get back with you
Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-17  3:50       ` Don Dupuis
@ 2007-05-21 19:32         ` Don Dupuis
  2007-05-22  0:50           ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-21 19:32 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> > On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> > > On 5/16/07, Neil Brown <neilb@suse.de> wrote:
> > > > On Wednesday May 16, dondster@gmail.com wrote:
> > > > ...
> > > > >
> > > > > The problem arises when I do a drive removal such as sda and then I
> > > > > remove power from the system. Most of the time I will have a corrupted
> > > > > partition on the md device. Other corruption will be my root partition
> > > > > which is an ext3 filesystem. I seem to have a better chance of booting
> > > > > a least 1 time with no errors with bitmap turned on, but If I repeat
> > > > > the process, I will have corruption as well. Also with bitmap turned
> > > > > on, adding the new drive into the md device will take way to too long.
> > > > > I only get about 3MB per second on the resync. With bitmap turned off,
> > > > > I will get between 10MB to 15MB resync rate. Has anyone else seen this
> > > > > behavior, or is this situation is no tested very often? I would think
> > > > > that I shouldn't get corruption with this raid  setup and jornaling of
> > > > > my filesytems? Any help would be appreciated.
> > > >
> > > >
> > > > The resync rate should be the same whether you have a bitmap or not,
> > > > so that observation is very strange.  Can you double check, and report
> > > > the contents of "/proc/mdstat" in the two situations.
> > > >
> > > > You say you have corruption on your root filesystem.  Presumably that
> > > > is not on the raid?  Maybe the drive doesn't get a chance to flush
> > > > it's cache when you power-off.  Do you get the same corruption if you
> > > > simulate a crash without turning off the power. e.g.
> > > >    echo b > /proc/sysrq-trigger
> > > >
> > > > Do you get the same corruption in the raid10 if you turn it off
> > > > *without* removing a drive first?
> > > >
> > > > NeilBrown
> > > >
> > > Powering off with all drives will not have corruption. When I have a
> > > drive missing and the md device does a full resync, I will get the
> > > corruption. Usually the md partition table is corrupt or gone. and
> > > with the first drive gone it happens more frequently. If the partition
> > > table is not corrupt, then the rootfilesystem or one of the other
> > > filesystems on the md device will be corrupted. Yes my root filesystem
> > > is on the raid device. I will update with the bitmap resync rate stuff
> > > later.
> > >
> > > Don
> > >
> > Forgot to tell you that I have the drive write cache disabled on all my drives.
> >
> > Don
> >
> Here is the /proc/mdstat output doing a recover after adding a drive
> to the md device:
> unused devices: <none>
> -bash-3.1$ cat /proc/mdstat
> Personalities : [raid10]
> md_d0 : active raid10 sda2[4] sdd2[3] sdc2[2] sdb2[1]
>       3646464 blocks 256K chunks 3 near-copies [4/3] [_UUU]
>       [>....................]  recovery =  2.6% (73216/2734848)
> finish=4.8min speed=9152K/sec
>
> unused devices: <none>
> -bash-3.1$ cat /proc/mdstat
> Personalities : [raid10]
> md_d0 : active raid10 sda2[4] sdd2[3] sdc2[2] sdb2[1]
>       3646464 blocks 256K chunks 3 near-copies [4/3] [_UUU]
>       [>....................]  recovery =  3.4% (93696/2734848)
> finish=4.6min speed=9369K/sec
>
> I am still trying to get where I had the low recover rate with the
> bitmap turned on. I will get back with you
> Don
>
Any new updates Neil?
Any new things to try to get you additional info?
THanks

Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-21 19:32         ` Don Dupuis
@ 2007-05-22  0:50           ` Neil Brown
  2007-05-22  2:47             ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2007-05-22  0:50 UTC (permalink / raw)
  To: Don Dupuis; +Cc: linux-raid

On Monday May 21, dondster@gmail.com wrote:
> On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> >
> > I am still trying to get where I had the low recover rate with the
> > bitmap turned on. I will get back with you
> > Don
> >
> Any new updates Neil?
> Any new things to try to get you additional info?
> THanks
> 
> Don

You said "I will get back with you" and I was waiting for that... I
hoped that your further testing might reveal some details that would
shine a light on the situation.

One question:  Your description seems to say that you get corruption
after the resync has finished.  Is the corruption there before the
resync starts?
I guess what I would like it:
   Start with fully active array.  Check for corruption.
   Remove one drive.  Check for corruptions.
   Turn off system.  Turn it on again, array assembles with one
   missing device.  Check for corruption.
   Add device, resync starts.  Check for corruption.
   Wait for resync to finish.  Check for corruption.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-22  0:50           ` Neil Brown
@ 2007-05-22  2:47             ` Don Dupuis
  2007-05-22  3:59               ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-22  2:47 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

On 5/21/07, Neil Brown <neilb@suse.de> wrote:
> On Monday May 21, dondster@gmail.com wrote:
> > On 5/16/07, Don Dupuis <dondster@gmail.com> wrote:
> > >
> > > I am still trying to get where I had the low recover rate with the
> > > bitmap turned on. I will get back with you
> > > Don
> > >
> > Any new updates Neil?
> > Any new things to try to get you additional info?
> > THanks
> >
> > Don
>
> You said "I will get back with you" and I was waiting for that... I
> hoped that your further testing might reveal some details that would
> shine a light on the situation.
>
> One question:  Your description seems to say that you get corruption
> after the resync has finished.  Is the corruption there before the
> resync starts?
> I guess what I would like it:
>   Start with fully active array.  Check for corruption.
>   Remove one drive.  Check for corruptions.
>   Turn off system.  Turn it on again, array assembles with one
>   missing device.  Check for corruption.
>   Add device, resync starts.  Check for corruption.
>   Wait for resync to finish.  Check for corruption.
>
> NeilBrown
>
I was going to get back with you concerning the low resync rates. The
data corruption happens like this.
1.  Start with fully active array. Everything is fine.
2.  I remove a drive. Everything is fine. I then will power off the machine.
3.  I powerup and load up an initramfs which has my initial root
filesystem and scripts for handling the assembly of the array. My init
script will determine which drive was remove and assemble the
remaining 3. If a resync happens, I will wait for the resync to
complete. If complete I will then do a fdisk -l /dev/md_d0 to make
sure the partition table is complete. Most of the time, it will be
"unknow partition table. At this point I am dead in the water because
I can't pivot_root to my real root filesystem. If I get through the
resync and the partition table is correct my other corruption will be
to filesystems on the md device. All filesystems are ext3 with full
data journaling enabled. I could have one corrupted or multiple. fsck
will not be able to clean up. Under normal circumstances, once I am up
running on the real root filesystem I would add the removed disk back
into the md device with the recover running in the back ground. Sorry
for the confusion on the "get back with you". I basically have 2
issues, the corruption issue is my main priority at this point.

Thanks

Don

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-22  2:47             ` Don Dupuis
@ 2007-05-22  3:59               ` Neil Brown
  2007-05-31  1:52                 ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2007-05-22  3:59 UTC (permalink / raw)
  To: Don Dupuis; +Cc: linux-raid

On Monday May 21, dondster@gmail.com wrote:
> I was going to get back with you concerning the low resync rates. The
> data corruption happens like this.
> 1.  Start with fully active array. Everything is fine.
> 2.  I remove a drive. Everything is fine. I then will power off the machine.
> 3.  I powerup and load up an initramfs which has my initial root
> filesystem and scripts for handling the assembly of the array. My init
> script will determine which drive was remove and assemble the
> remaining 3.

Could I see this script please?

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-22  3:59               ` Neil Brown
@ 2007-05-31  1:52                 ` Don Dupuis
  2007-05-31  5:16                   ` Neil Brown
  0 siblings, 1 reply; 12+ messages in thread
From: Don Dupuis @ 2007-05-31  1:52 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil, I sent the scripts to you. Any update on this issue?

Thanks
Don

On 5/21/07, Neil Brown <neilb@suse.de> wrote:
> On Monday May 21, dondster@gmail.com wrote:
> > I was going to get back with you concerning the low resync rates. The
> > data corruption happens like this.
> > 1.  Start with fully active array. Everything is fine.
> > 2.  I remove a drive. Everything is fine. I then will power off the machine.
> > 3.  I powerup and load up an initramfs which has my initial root
> > filesystem and scripts for handling the assembly of the array. My init
> > script will determine which drive was remove and assemble the
> > remaining 3.
>
> Could I see this script please?
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-31  1:52                 ` Don Dupuis
@ 2007-05-31  5:16                   ` Neil Brown
  2007-06-01 15:58                     ` Don Dupuis
  0 siblings, 1 reply; 12+ messages in thread
From: Neil Brown @ 2007-05-31  5:16 UTC (permalink / raw)
  To: Don Dupuis; +Cc: linux-raid

On Wednesday May 30, dondster@gmail.com wrote:
> Neil, I sent the scripts to you. Any update on this issue?

Sorry, I got distracted.

Your scripts are way more complicated than needed.  Most of the logic
in there is already in mdadm.

   mdadm --assemble /dev/md_d0 --run --uuid=$BOOTUUID /dev/sd[abcd]2

can replace most of it.  And you don't need to wait for resync to
complete before mounting filesystems.

That said: I cannot see anything in your script that would actually do
the wrong thing.

Hmmm... I see now I wasn't quite testing the right thing.  I need to
trigger a resync with one device missing.
i.e
  mdadm -C /dev/md0 -l10 -n4 -p n3 /dev/sd[abcd]1
  mkfs /dev/md0
  mdadm /dev/md0 -f /dev/sda1
  mdadm -S /dev/md0
  mdadm -A /dev/md0 -R --update=resync /dev/sd[bcd]1
  fsck -f /dev/md0

This fails just as you say.
Following patch fixes it as well as another problem I found while
doing this testing.

Thanks for pursuing this.

NeilBrown

diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
--- .prev/drivers/md/raid10.c	2007-05-21 11:18:23.000000000 +1000
+++ ./drivers/md/raid10.c	2007-05-31 15:11:42.000000000 +1000
@@ -1866,6 +1866,7 @@ static sector_t sync_request(mddev_t *md
 			int d = r10_bio->devs[i].devnum;
 			bio = r10_bio->devs[i].bio;
 			bio->bi_end_io = NULL;
+			clear_bit(BIO_UPTODATE, &bio->bi_flags);
 			if (conf->mirrors[d].rdev == NULL ||
 			    test_bit(Faulty, &conf->mirrors[d].rdev->flags))
 				continue;
@@ -2036,6 +2037,11 @@ static int run(mddev_t *mddev)
 	/* 'size' is now the number of chunks in the array */
 	/* calculate "used chunks per device" in 'stride' */
 	stride = size * conf->copies;
+
+	/* We need to round up when dividing by raid_disks to
+	 * get the stride size.
+	 */
+	stride += conf->raid_disks - 1;
 	sector_div(stride, conf->raid_disks);
 	mddev->size = stride  << (conf->chunk_shift-1);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Md corruption using RAID10 on linux-2.6.21
  2007-05-31  5:16                   ` Neil Brown
@ 2007-06-01 15:58                     ` Don Dupuis
  0 siblings, 0 replies; 12+ messages in thread
From: Don Dupuis @ 2007-06-01 15:58 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Thanks Neil. This took care of my issue. I was doing a full set of
tests to make sure before I replied. Thanks for all your hard work.

Don

On 5/31/07, Neil Brown <neilb@suse.de> wrote:
> On Wednesday May 30, dondster@gmail.com wrote:
> > Neil, I sent the scripts to you. Any update on this issue?
>
> Sorry, I got distracted.
>
> Your scripts are way more complicated than needed.  Most of the logic
> in there is already in mdadm.
>
>    mdadm --assemble /dev/md_d0 --run --uuid=$BOOTUUID /dev/sd[abcd]2
>
> can replace most of it.  And you don't need to wait for resync to
> complete before mounting filesystems.
>
> That said: I cannot see anything in your script that would actually do
> the wrong thing.
>
> Hmmm... I see now I wasn't quite testing the right thing.  I need to
> trigger a resync with one device missing.
> i.e
>   mdadm -C /dev/md0 -l10 -n4 -p n3 /dev/sd[abcd]1
>   mkfs /dev/md0
>   mdadm /dev/md0 -f /dev/sda1
>   mdadm -S /dev/md0
>   mdadm -A /dev/md0 -R --update=resync /dev/sd[bcd]1
>   fsck -f /dev/md0
>
> This fails just as you say.
> Following patch fixes it as well as another problem I found while
> doing this testing.
>
> Thanks for pursuing this.
>
> NeilBrown
>
> diff .prev/drivers/md/raid10.c ./drivers/md/raid10.c
> --- .prev/drivers/md/raid10.c   2007-05-21 11:18:23.000000000 +1000
> +++ ./drivers/md/raid10.c       2007-05-31 15:11:42.000000000 +1000
> @@ -1866,6 +1866,7 @@ static sector_t sync_request(mddev_t *md
>                         int d = r10_bio->devs[i].devnum;
>                         bio = r10_bio->devs[i].bio;
>                         bio->bi_end_io = NULL;
> +                       clear_bit(BIO_UPTODATE, &bio->bi_flags);
>                         if (conf->mirrors[d].rdev == NULL ||
>                             test_bit(Faulty, &conf->mirrors[d].rdev->flags))
>                                 continue;
> @@ -2036,6 +2037,11 @@ static int run(mddev_t *mddev)
>         /* 'size' is now the number of chunks in the array */
>         /* calculate "used chunks per device" in 'stride' */
>         stride = size * conf->copies;
> +
> +       /* We need to round up when dividing by raid_disks to
> +        * get the stride size.
> +        */
> +       stride += conf->raid_disks - 1;
>         sector_div(stride, conf->raid_disks);
>         mddev->size = stride  << (conf->chunk_shift-1);
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-06-01 15:58 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-16 18:38 Md corruption using RAID10 on linux-2.6.21 Don Dupuis
2007-05-17  1:54 ` Neil Brown
2007-05-17  2:57   ` Don Dupuis
2007-05-17  2:58     ` Don Dupuis
2007-05-17  3:50       ` Don Dupuis
2007-05-21 19:32         ` Don Dupuis
2007-05-22  0:50           ` Neil Brown
2007-05-22  2:47             ` Don Dupuis
2007-05-22  3:59               ` Neil Brown
2007-05-31  1:52                 ` Don Dupuis
2007-05-31  5:16                   ` Neil Brown
2007-06-01 15:58                     ` Don Dupuis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).