MD RAID 1 fail/remove/add corruption in 3.10

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* MD RAID 1 fail/remove/add corruption in 3.10
@ 2013-07-16 18:49 Joe Lawrence
  2013-07-16 19:05 ` Joe Lawrence
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Joe Lawrence @ 2013-07-16 18:49 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown, Martin Wilck

Hi Neil, Martin,

While testing patches to fix RAID1 repair GPF crash w/3.10-rc7
( http://thread.gmane.org/gmane.linux.raid/43351 ), I encountered disk
corruption when repeatedly failing, removing, and adding MD RAID1
component disks to their array.  The RAID1 was created with an internal
write bitmap and the test was run against alternating disks in the
set.  I bisected this behavior back to commit 7ceb17e8 "md: Allow
devices to be re-added to a read-only array", specifically these lines
of code:

remove_and_add_spares:

+		if (rdev->saved_raid_disk >= 0 && mddev->in_sync) {
+			spin_lock_irq(&mddev->write_lock);
+			if (mddev->in_sync)
+				/* OK, this device, which is in_sync,
+				 * will definitely be noticed before
+				 * the next write, so recovery isn't
+				 * needed.
+				 */
+				rdev->recovery_offset = mddev->recovery_cp;
+			spin_unlock_irq(&mddev->write_lock);
+		}
+		if (mddev->ro && rdev->recovery_offset != MaxSector)
+			/* not safe to add this disk now */
+			continue;

when I #ifdef 0 these lines out, leaving rdev->recovery_offset = 0,
then my tests run without incident.

If there is any instrumentation I can apply to remove_and_add_spares
I'll be happy to gather more data.  I'll send an attached copy of
my test programs in a reply so this mail doesn't get bounced by
any spam filters.

Thanks,

-- Joe

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID 1 fail/remove/add corruption in 3.10
  2013-07-16 18:49 MD RAID 1 fail/remove/add corruption in 3.10 Joe Lawrence
@ 2013-07-16 19:05 ` Joe Lawrence
  2013-07-17  2:52 ` Brad Campbell
  2013-07-17  4:52 ` NeilBrown
  2 siblings, 0 replies; 5+ messages in thread
From: Joe Lawrence @ 2013-07-16 19:05 UTC (permalink / raw)
  To: linux-raid; +Cc: NeilBrown, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 775 bytes --]

Attached are the test scripts:

  mdcreate - create three MD RAID1 pairs with internal bitmaps, create
             an {ext4,xfs,btrfs} filesystem

  break_md - loop between the two component disks, failing and then
             re-adding them.  Calls mdtest script in between test-runs.

  mdtest - stops fio tests on MD, umounts, issues RAID "check", fsck on
           each MD.  If the fsck fails, return failure.  If fsck is
           good, then mount and restart fio tests.

Usually I would see RAID mismatch_cnt of non-zero after the first or
second disk break.  Then, within a few iterations one of the fsck
programs (usually xfs or btrfs) would complain.

These scripts were cobbled together in the last day or two, so standard
disclaimers apply :)

Regards,

-- Joe

[-- Attachment #2: break_md.sh --]
[-- Type: application/x-shellscript, Size: 1192 bytes --]

[-- Attachment #3: mdcreate.sh --]
[-- Type: application/x-shellscript, Size: 470 bytes --]

[-- Attachment #4: mdtest.sh --]
[-- Type: application/x-shellscript, Size: 1686 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID 1 fail/remove/add corruption in 3.10
  2013-07-16 18:49 MD RAID 1 fail/remove/add corruption in 3.10 Joe Lawrence
  2013-07-16 19:05 ` Joe Lawrence
@ 2013-07-17  2:52 ` Brad Campbell
  2013-07-17  4:53   ` NeilBrown
  2013-07-17  4:52 ` NeilBrown
  2 siblings, 1 reply; 5+ messages in thread
From: Brad Campbell @ 2013-07-17  2:52 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, NeilBrown, Martin Wilck

On 17/07/13 02:49, Joe Lawrence wrote:
> Hi Neil, Martin,
>
> While testing patches to fix RAID1 repair GPF crash w/3.10-rc7
> ( http://thread.gmane.org/gmane.linux.raid/43351 ), I encountered disk
> corruption when repeatedly failing, removing, and adding MD RAID1
> component disks to their array.  The RAID1 was created with an internal
> write bitmap and the test was run against alternating disks in the
> set.  I bisected this behavior back to commit 7ceb17e8 "md: Allow
> devices to be re-added to a read-only array", specifically these lines
> of code:

This sounds like an issue I just bumped up against in RAID-5.
I have a test box with a RAID-5 comprised of 2 x 2TB drives, and 6 
RAID-0's of 2 x 1TB drives.

root@test:/root# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid5 md20[0] md25[8] md24[7] md22[6] sdl[4] sdn[3] md23[2] 
md21[1]
       13673683968 blocks super 1.2 level 5, 512k chunk, algorithm 2 
[8/8] [UUUUUUUU]
       bitmap: 0/15 pages [0KB], 65536KB chunk

md22 : active raid0 sdk[0] sdm[1]
       1953524736 blocks super 1.2 512k chunks

md20 : active raid0 sdj[0] sdo[1]
       1953522688 blocks super 1.2 512k chunks

md21 : active raid0 sdh[0] sdi[1]
       1953524736 blocks super 1.2 512k chunks

md25 : active raid0 sda[0] sdb[1]
       2441900544 blocks super 1.2 512k chunks

md23 : active raid0 sdd[0] sde[1]
       1953522688 blocks super 1.2 512k chunks

md24 : active raid0 sdf[0] sdg[1]
       1953524736 blocks super 1.2 512k chunks

I was running a check over md3 whilst rsyncing a load of data onto it.
md20 was ejected some time during this process. (A smart query issued 
caused a timeout on one of the drives). I removed md20, stopped md20, 
started md20 and re-added md20.

This should have caused a re-build as the bitmap would have been way out 
of sync, however it immediately reported the rebuild complete and left 
the array mostly trashed. (about 500,000 mismatch counts).

kernel at the time was late in the 3.11-rc1 merge window. 
3.10.0-09289-g9903883

I've been meaning to try and reproduce it, but as each operation takes 
about 5 hours it's slow going.

This is a test array, so it has no data value. I'm happy to try to 
reproduce this fault if it would help any.

Regards,
Brad

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID 1 fail/remove/add corruption in 3.10
  2013-07-16 18:49 MD RAID 1 fail/remove/add corruption in 3.10 Joe Lawrence
  2013-07-16 19:05 ` Joe Lawrence
  2013-07-17  2:52 ` Brad Campbell
@ 2013-07-17  4:52 ` NeilBrown
  2 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2013-07-17  4:52 UTC (permalink / raw)
  To: Joe Lawrence; +Cc: linux-raid, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 2142 bytes --]

On Tue, 16 Jul 2013 14:49:20 -0400 Joe Lawrence <joe.lawrence@stratus.com>
wrote:

> Hi Neil, Martin,
> 
> While testing patches to fix RAID1 repair GPF crash w/3.10-rc7
> ( http://thread.gmane.org/gmane.linux.raid/43351 ), I encountered disk
> corruption when repeatedly failing, removing, and adding MD RAID1
> component disks to their array.  The RAID1 was created with an internal
> write bitmap and the test was run against alternating disks in the
> set.  I bisected this behavior back to commit 7ceb17e8 "md: Allow
> devices to be re-added to a read-only array", specifically these lines
> of code:
> 
> remove_and_add_spares:
> 
> +		if (rdev->saved_raid_disk >= 0 && mddev->in_sync) {
> +			spin_lock_irq(&mddev->write_lock);
> +			if (mddev->in_sync)
> +				/* OK, this device, which is in_sync,
> +				 * will definitely be noticed before
> +				 * the next write, so recovery isn't
> +				 * needed.
> +				 */
> +				rdev->recovery_offset = mddev->recovery_cp;
> +			spin_unlock_irq(&mddev->write_lock);
> +		}
> +		if (mddev->ro && rdev->recovery_offset != MaxSector)
> +			/* not safe to add this disk now */
> +			continue;
> 
> when I #ifdef 0 these lines out, leaving rdev->recovery_offset = 0,
> then my tests run without incident.
> 
> If there is any instrumentation I can apply to remove_and_add_spares
> I'll be happy to gather more data.  I'll send an attached copy of
> my test programs in a reply so this mail doesn't get bounced by
> any spam filters.
> 
>

Thanks for the report Joe.

That code has problems.

If the array has a bitmap, then 'saved_raid_disk >= 0' means that the device
is fairly close, but the bitmap based resync is required first.  This code
skips that.

If the array does not have a bitmap, then 'saved_raid_disk >= 0' means that
this device was exactly right for this slot before, but there is no locking
to prevent updates going to the array between then super_1_validate checked
the event count, and when remove_and_add_spares tried to add it.

I suspect I should  just rip that code out and go back to the drawing board.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: MD RAID 1 fail/remove/add corruption in 3.10
  2013-07-17  2:52 ` Brad Campbell
@ 2013-07-17  4:53   ` NeilBrown
  0 siblings, 0 replies; 5+ messages in thread
From: NeilBrown @ 2013-07-17  4:53 UTC (permalink / raw)
  To: Brad Campbell; +Cc: Joe Lawrence, linux-raid, Martin Wilck

[-- Attachment #1: Type: text/plain, Size: 2724 bytes --]

On Wed, 17 Jul 2013 10:52:31 +0800 Brad Campbell <lists2009@fnarfbargle.com>
wrote:

> On 17/07/13 02:49, Joe Lawrence wrote:
> > Hi Neil, Martin,
> >
> > While testing patches to fix RAID1 repair GPF crash w/3.10-rc7
> > ( http://thread.gmane.org/gmane.linux.raid/43351 ), I encountered disk
> > corruption when repeatedly failing, removing, and adding MD RAID1
> > component disks to their array.  The RAID1 was created with an internal
> > write bitmap and the test was run against alternating disks in the
> > set.  I bisected this behavior back to commit 7ceb17e8 "md: Allow
> > devices to be re-added to a read-only array", specifically these lines
> > of code:
> 
> This sounds like an issue I just bumped up against in RAID-5.
> I have a test box with a RAID-5 comprised of 2 x 2TB drives, and 6 
> RAID-0's of 2 x 1TB drives.
> 
> 
> root@test:/root# cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md3 : active raid5 md20[0] md25[8] md24[7] md22[6] sdl[4] sdn[3] md23[2] 
> md21[1]
>        13673683968 blocks super 1.2 level 5, 512k chunk, algorithm 2 
> [8/8] [UUUUUUUU]
>        bitmap: 0/15 pages [0KB], 65536KB chunk
> 
> md22 : active raid0 sdk[0] sdm[1]
>        1953524736 blocks super 1.2 512k chunks
> 
> md20 : active raid0 sdj[0] sdo[1]
>        1953522688 blocks super 1.2 512k chunks
> 
> md21 : active raid0 sdh[0] sdi[1]
>        1953524736 blocks super 1.2 512k chunks
> 
> md25 : active raid0 sda[0] sdb[1]
>        2441900544 blocks super 1.2 512k chunks
> 
> md23 : active raid0 sdd[0] sde[1]
>        1953522688 blocks super 1.2 512k chunks
> 
> md24 : active raid0 sdf[0] sdg[1]
>        1953524736 blocks super 1.2 512k chunks
> 
> I was running a check over md3 whilst rsyncing a load of data onto it.
> md20 was ejected some time during this process. (A smart query issued 
> caused a timeout on one of the drives). I removed md20, stopped md20, 
> started md20 and re-added md20.
> 
> This should have caused a re-build as the bitmap would have been way out 
> of sync, however it immediately reported the rebuild complete and left 
> the array mostly trashed. (about 500,000 mismatch counts).
> 
> kernel at the time was late in the 3.11-rc1 merge window. 
> 3.10.0-09289-g9903883
> 
> I've been meaning to try and reproduce it, but as each operation takes 
> about 5 hours it's slow going.
> 
> This is a test array, so it has no data value. I'm happy to try to 
> reproduce this fault if it would help any.
> 
> Regards,
> Brad


Hi Brad,
 yes, sounds like the same problem, with same solution for now.  Remove the
 code that Joe highlighted.

Thanks.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-07-17  4:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-16 18:49 MD RAID 1 fail/remove/add corruption in 3.10 Joe Lawrence
2013-07-16 19:05 ` Joe Lawrence
2013-07-17  2:52 ` Brad Campbell
2013-07-17  4:53   ` NeilBrown
2013-07-17  4:52 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).