RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
@ 2008-05-17 21:30 David Lethe
  2008-05-17 23:16 ` Roger Heflin
  0 siblings, 1 reply; 7+ messages in thread
From: David Lethe @ 2008-05-17 21:30 UTC (permalink / raw)
  To: Guy Watkins, 'LinuxRaid', linux-kernel

It will. But that defeats the purpose.  I want to limit repair to only the raid stripe that utilizes a specifiv disk with a block that I know has a unrecoverable reas error.  

-----Original Message-----

From:  "Guy Watkins" <linux-raid@watkins-home.com>
Subj:  RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
Date:  Sat May 17, 2008 3:28 pm
Size:  2K
To:  "'David Lethe'" <david@santools.com>; "'LinuxRaid'" <linux-raid@vger.kernel.org>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>

} -----Original Message----- 
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- 
} owner@vger.kernel.org] On Behalf Of David Lethe 
} Sent: Saturday, May 17, 2008 3:10 PM 
} To: LinuxRaid; linux-kernel@vger.kernel.org 
} Subject: Mechanism to safely force repair of single md stripe w/o hurting 
} data integrity of file system 
}  
} I'm trying to figure out a mechanism to safely repair a stripe of data 
} when I know a particular disk has a unrecoverable read error at a 
} certain physical block (for 2.6 kernels) 
}  
} My original plan was to figure out the range of blocks in md device that 
} utilizes the known bad block and force a raw read on physical device 
} that covers the entire chunk and let the md driver do all of the work. 
}  
} Well, this didn't pan out. Problems include issues where if bad block 
} maps to the parity block in a stripe then md won't necessarily 
} read/verify parity, and in cases where you are running RAID1, then load 
} balancing might result in the kernel reading the bad block from the good 
} disk. 
}  
} So the degree of difficulty is much higher than I expected.  I prefer 
} not to patch kernels due to maintenance issues as well as desire for the 
} technique to work across numerous kernels and  patch revisions, and 
} frankly, the odds are I would screw it up.  An application-level program 
} that can be invoked as necessary would be ideal. 
}  
} As such, anybody up to the challenge of writing the code?  I want it 
} enough to paypal somebody $500 who can write it, and will gladly open 
} source the solution. 
}  
} (And to clarify why, I know physical block x on disk y is bad before the 
} O/S reads the block, and just want to rebuild the stripe, not the entire 
} md device when this happens. I must not compromise any file system data, 
} cached or non-cached that is built on the md device.  I have system with 
} >100TB and if I did a rebuild every time I discovered a bad block 
} somewhere, then a full parity repair would never complete before another 
} physical bad block is discovered.) 
}  
} Contact me offline for the financial details, but I would certainly 
} appreciate some thread discussion on an appropriate architecture.  At 
} least it is my opinion that such capability should eventually be native 
} Linux, but as long as there is a program that can be run on demand that 
} doesn't require rebuilding or patching kernels then that is all I need. 
}  
} David @ santools.com 

I thought this would cause md to read all blocks in an array: 
echo repair > /sys/block/md0/md/sync_action 

And rewrite any blocks that can't be read. 

In the old days, md would kick out a disk on a read error.  When you added 
it back, md would rewrite everything on that disk, which corrected read 
errors. 

Guy 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 21:30 Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
@ 2008-05-17 23:16 ` Roger Heflin
  0 siblings, 0 replies; 7+ messages in thread
From: Roger Heflin @ 2008-05-17 23:16 UTC (permalink / raw)
  To: David; +Cc: Guy Watkins, 'LinuxRaid', linux-kernel

David Lethe wrote:
> It will. But that defeats the purpose.  I want to limit repair to only the raid stripe that utilizes a specifiv disk with a block that I know has a unrecoverable reas error.  
> 
> -----Original Message-----
> 
> From:  "Guy Watkins" <linux-raid@watkins-home.com>
> Subj:  RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
> Date:  Sat May 17, 2008 3:28 pm
> Size:  2K
> To:  "'David Lethe'" <david@santools.com>; "'LinuxRaid'" <linux-raid@vger.kernel.org>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
> 
> } -----Original Message----- 
> } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- 
> } owner@vger.kernel.org] On Behalf Of David Lethe 
> } Sent: Saturday, May 17, 2008 3:10 PM 
> } To: LinuxRaid; linux-kernel@vger.kernel.org 
> } Subject: Mechanism to safely force repair of single md stripe w/o hurting 
> } data integrity of file system 
> }  
> } I'm trying to figure out a mechanism to safely repair a stripe of data 
> } when I know a particular disk has a unrecoverable read error at a 
> } certain physical block (for 2.6 kernels) 
> }  
> } My original plan was to figure out the range of blocks in md device that 
> } utilizes the known bad block and force a raw read on physical device 
> } that covers the entire chunk and let the md driver do all of the work. 
> }  
> } Well, this didn't pan out. Problems include issues where if bad block 
> } maps to the parity block in a stripe then md won't necessarily 
> } read/verify parity, and in cases where you are running RAID1, then load 
> } balancing might result in the kernel reading the bad block from the good 
> } disk. 
> }  
> } So the degree of difficulty is much higher than I expected.  I prefer 
> } not to patch kernels due to maintenance issues as well as desire for the 
> } technique to work across numerous kernels and  patch revisions, and 
> } frankly, the odds are I would screw it up.  An application-level program 
> } that can be invoked as necessary would be ideal. 
> }  
> } As such, anybody up to the challenge of writing the code?  I want it 
> } enough to paypal somebody $500 who can write it, and will gladly open 
> } source the solution. 
> }  
> } (And to clarify why, I know physical block x on disk y is bad before the 
> } O/S reads the block, and just want to rebuild the stripe, not the entire 
> } md device when this happens. I must not compromise any file system data, 
> } cached or non-cached that is built on the md device.  I have system with 
> } >100TB and if I did a rebuild every time I discovered a bad block 
> } somewhere, then a full parity repair would never complete before another 
> } physical bad block is discovered.) 
> }  
> } Contact me offline for the financial details, but I would certainly 
> } appreciate some thread discussion on an appropriate architecture.  At 
> } least it is my opinion that such capability should eventually be native 
> } Linux, but as long as there is a program that can be run on demand that 
> } doesn't require rebuilding or patching kernels then that is all I need. 
> }  
> } David @ santools.com 
>  
> I thought this would cause md to read all blocks in an array: 
> echo repair > /sys/block/md0/md/sync_action 
>  
> And rewrite any blocks that can't be read. 
>  
> In the old days, md would kick out a disk on a read error.  When you added 
> it back, md would rewrite everything on that disk, which corrected read 
> errors. 
>  
> Guy 
>

I bet $500 is well below minimum wage in the US for the number of hours it would 
take someone to do this.

And I would say that if you have > 100TB in a single raid5/6 that would mean you 
had to have at least 100 disks in that array, and most people get nervous at 
 >8-16 disks in either raid5 or raid6 arrays, and the statistics of disks going 
bad, and the chance of a rebuild succeeding before another disk/block goes bad 
gets smaller and smaller as the number of disks increase, as you have noted you 
are at the point that it becomes unlikely that the rebuild will ever complete 
even with good disks in the array.   Most people build a number of smaller 
raid5/raid6 arrays and then LVM them together to get around this issue.   And on 
top of that the larger number of disks the greater the IO required to do a 
rebuild so the slower the rebuild potentially is.   And that is assuming that 
you don't have a bad batch of disks that has an abnormally high failure rate.

I know of a hardware disk arrays that handle the bad block issue by allocating 
(on initial array construction) a set of spare blocks on each disk.  On finding 
a bad block on a disk they relocated and rebuild just the bad block on the disk 
with the bad block from the stripe/parity and somehow note that the block on the 
bad disk has been relocated, and after some number of bad blocks on a given 
disk, they note that the given disk has too many bad blocks, and you that should 
"clone" and then fail the original disk over to the cloned disk once the clone 
is finished, but this sort of thing would seem to be rather non-trivial, though 
if someone would setup a clone of the bad disk, and rebuild the bad sector this 
would probably cut down the amount of time/IO required to complete a rebuild, 
though it would still take several hours, and things would get more complicated 
if you had another failure during that process.


                                            Roger

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Regression- XFS won't mount on partitioned md array
@ 2008-05-16 17:11 David Greaves
  2008-05-16 18:59 ` Eric Sandeen
  0 siblings, 1 reply; 7+ messages in thread
From: David Greaves @ 2008-05-16 17:11 UTC (permalink / raw)
  To: David Chinner; +Cc: LinuxRaid, xfs, 'linux-kernel@vger.kernel.org'

I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer
mounts my xfs filesystem.

I bisected it to around
a67d7c5f5d25d0b13a4dfb182697135b014fa478
[XFS] Move platform specific mount option parse out of core XFS code

I have a RAID5 array with partitions:

Partition Table for /dev/md_d0

               First       Last
 # Type       Sector      Sector   Offset    Length   Filesystem Type (ID) Flag
-- ------- ----------- ----------- ------ ----------- -------------------- ----
 1 Primary           0  2500288279      4  2500288280 Linux (83)           None
 2 Primary  2500288280  2500483583      0      195304 Non-FS data (DA)     None


when I attempt to mount /media:
/dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0

I get:
 md_d0: p1 p2
XFS mounting filesystem md_d0p1
attempt to access beyond end of device
md_d0p2: rw=0, want=195311, limit=195304
I/O error in filesystem ("md_d0p1") meta-data dev md_d0p2 block 0x2fae7
("xlog_bread") error 5 buf count 512
XFS: empty log check failed
XFS: log mount/recovery failed: error 5
XFS: log mount failed

A repair:
  xfs_repair /dev/md_d0p1 -l /dev/md_d0p2
gives no errors.

Phase 1 - find and verify superblock...
Phase 2 - using external log on /dev/md_d0p2
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
...


David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regression- XFS won't mount on partitioned md array
  2008-05-16 17:11 Regression- XFS won't mount on partitioned md array David Greaves
@ 2008-05-16 18:59 ` Eric Sandeen
  2008-05-17 14:46   ` David Greaves
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2008-05-16 18:59 UTC (permalink / raw)
  To: David Greaves
  Cc: David Chinner, LinuxRaid, xfs,
	'linux-kernel@vger.kernel.org'

David Greaves wrote:
> I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer
> mounts my xfs filesystem.
> 
> I bisected it to around
> a67d7c5f5d25d0b13a4dfb182697135b014fa478
> [XFS] Move platform specific mount option parse out of core XFS code

around that... not exactly?  That commit should have been largely a code
move, which is not to say that it can't contain a bug... :)

> I have a RAID5 array with partitions:
> 
> Partition Table for /dev/md_d0
> 
>                First       Last
>  # Type       Sector      Sector   Offset    Length   Filesystem Type (ID) Flag
> -- ------- ----------- ----------- ------ ----------- -------------------- ----
>  1 Primary           0  2500288279      4  2500288280 Linux (83)           None
>  2 Primary  2500288280  2500483583      0      195304 Non-FS data (DA)     None
> 
> 
> when I attempt to mount /media:
> /dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0

mythbox?  :)

Hm, so it's the external log size that it doesn't much like...

> I get:
>  md_d0: p1 p2
> XFS mounting filesystem md_d0p1
> attempt to access beyond end of device
> md_d0p2: rw=0, want=195311, limit=195304

what does /proc/partitions say about md_d0p1 and p2?  Is it different
between the older & newer kernel?

What does xfs_info /mount/point say about the filesystem when you mount
it under the older kernel?  Or... if you can't mount it,

xfs_db -r -c "sb 0" -c p /dev/md_d0p1

-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regression- XFS won't mount on partitioned md array
  2008-05-16 18:59 ` Eric Sandeen
@ 2008-05-17 14:46   ` David Greaves
  2008-05-17 15:15     ` Eric Sandeen
  0 siblings, 1 reply; 7+ messages in thread
From: David Greaves @ 2008-05-17 14:46 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: David Chinner, LinuxRaid, xfs,
	'linux-kernel@vger.kernel.org'

Eric Sandeen wrote:
> David Greaves wrote:
>> I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer
>> mounts my xfs filesystem.
>>
>> I bisected it to around
>> a67d7c5f5d25d0b13a4dfb182697135b014fa478
>> [XFS] Move platform specific mount option parse out of core XFS code
> 
> around that... not exactly?  That commit should have been largely a code
> move, which is not to say that it can't contain a bug... :)
I got to within 4 on the bisect and my xfs partition containing the kernel src
and the bisect history blew up telling me that files were directories and then
exploding in a  heap of lost+found/  fragments. Quite, erm, "interesting" really.

At that point I decided I was close enough to ask for advice, looked at the
commits and took this one as the most likely to cause the bug :)

But, thinking about it, I can decode the kernel extraversion tags in /boot. From
that I think my bounds were:
40ebd81d1a7635cf92a59c387a599fce4863206b
[XFS] Use kernel-supplied "roundup_pow_of_two" for simplicity
and:
3ed6526441053d79b85d206b14d75125e6f51cc2
[XFS] Implement fallocate.

so those bound:
[XFS] Remove the BPCSHIFT and NB* based macros from XFS.
[XFS] Remove bogus assert
[XFS] optimize XFS_IS_REALTIME_INODE w/o realtime config
[XFS] Move platform specific mount option parse out of core XFS code

and just glancing through the patches I didn't see any changes that looked
likely in the others...


> 
>> I have a RAID5 array with partitions:
>>
>> Partition Table for /dev/md_d0
>>
>>                First       Last
>>  # Type       Sector      Sector   Offset    Length   Filesystem Type (ID) Flag
>> -- ------- ----------- ----------- ------ ----------- -------------------- ----
>>  1 Primary           0  2500288279      4  2500288280 Linux (83)           None
>>  2 Primary  2500288280  2500483583      0      195304 Non-FS data (DA)     None
>>
>>
>> when I attempt to mount /media:
>> /dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0
> 
> mythbox?  :)
Hey - we test some interesting corner cases... :)
My *wife* just told *me* to buy, and I quote "No more than 10" 1Tb Samsung
drives... I decided 5 would be plenty.

> Hm, so it's the external log size that it doesn't much like...
Yep - I noticed that - and ISTR that Neil has been fiddling in the md
partitioning code over the last 6 months or so.
I wondered where it got the larger figure from and if, somehow, md was changing
the partition size somehow...


>> I get:
>>  md_d0: p1 p2
>> XFS mounting filesystem md_d0p1
>> attempt to access beyond end of device
>> md_d0p2: rw=0, want=195311, limit=195304
> 
> what does /proc/partitions say about md_d0p1 and p2?  Is it different
> between the older & newer kernel?
2.6.20.7 (good)
 254     0 1250241792 md_d0
 254     1 1250144138 md_d0p1
 254     2      97652 md_d0p2

2.6.25.3 (bad)
 254     0 1250241792 md_d0
 254     1 1250144138 md_d0p1
 254     2      97652 md_d0p2

2.6.25.4 (bad)
 254     0 1250241792 md_d0
 254     1 1250144138 md_d0p1
 254     2      97652 md_d0p2

So nothing obvious there then...

> 
> What does xfs_info /mount/point say about the filesystem when you mount
> it under the older kernel?  Or... if you can't mount it,
teak:~# xfs_info /media/
meta-data=/dev/md_d0p1           isize=256    agcount=32, agsize=9766751 blks
         =                       sectsz=512   attr=0
data     =                       bsize=4096   blocks=312536032, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096
log      =external               bsize=4096   blocks=24413, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=0
realtime =none                   extsz=65536  blocks=0, rtextents=0




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Regression- XFS won't mount on partitioned md array
  2008-05-17 14:46   ` David Greaves
@ 2008-05-17 15:15     ` Eric Sandeen
  2008-05-17 19:10       ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Sandeen @ 2008-05-17 15:15 UTC (permalink / raw)
  To: David Greaves
  Cc: David Chinner, LinuxRaid, xfs,
	'linux-kernel@vger.kernel.org'

David Greaves wrote:
> Eric Sandeen wrote:

>>> I get:
>>>  md_d0: p1 p2
>>> XFS mounting filesystem md_d0p1
>>> attempt to access beyond end of device
>>> md_d0p2: rw=0, want=195311, limit=195304
>> what does /proc/partitions say about md_d0p1 and p2?  Is it different
>> between the older & newer kernel?

...

> 2.6.25.4 (bad)
>  254     0 1250241792 md_d0
>  254     1 1250144138 md_d0p1
>  254     2      97652 md_d0p2
> 
> So nothing obvious there then...
> 
>> What does xfs_info /mount/point say about the filesystem when you mount
>> it under the older kernel?  Or... if you can't mount it,
> teak:~# xfs_info /media/
> meta-data=/dev/md_d0p1           isize=256    agcount=32, agsize=9766751 blks
>          =                       sectsz=512   attr=0
> data     =                       bsize=4096   blocks=312536032, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096
> log      =external               bsize=4096   blocks=24413, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=0
> realtime =none                   extsz=65536  blocks=0, rtextents=0

ok, and with:

> Partition Table for /dev/md_d0
> 
>                First       Last
>  # Type       Sector      Sector   Offset    Length   Filesystem Type (ID) Flag
> -- ------- ----------- ----------- ------ ----------- -------------------- ----
>  1 Primary           0  2500288279      4  2500288280 Linux (83)           None
>  2 Primary  2500288280  2500483583      0      195304 Non-FS data (DA)     None

So, xfs thinks the external log is 24413 4k blocks (from the sb geometry
printed by xfs_info).  This is 97652 1k units (matching your
/proc/partitions output) and 195304 512-byte sectors (matching the
partition table output).  So that all looks consistent.

So if xfs is doing:

>>> md_d0p2: rw=0, want=195311, limit=195304
>>> XFS: empty log check failed

it surely does seem to be trying to read past the end of what even it
thinks is the end of its log.

And, with your geometry I can reproduce this w/o md, partitioned or not.
 So looks like xfs itself is busted:

loop5: rw=0, want=195311, limit=195304

I'll see if I have a little time today to track down the problem.

Thanks,

-Eric

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 15:15     ` Eric Sandeen
@ 2008-05-17 19:10       ` David Lethe
  2008-05-17 19:29         ` Peter Rabbitson
                           ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: David Lethe @ 2008-05-17 19:10 UTC (permalink / raw)
  To: LinuxRaid, linux-kernel

I'm trying to figure out a mechanism to safely repair a stripe of data
when I know a particular disk has a unrecoverable read error at a
certain physical block (for 2.6 kernels) 

My original plan was to figure out the range of blocks in md device that
utilizes the known bad block and force a raw read on physical device
that covers the entire chunk and let the md driver do all of the work.  

Well, this didn't pan out. Problems include issues where if bad block
maps to the parity block in a stripe then md won't necessarily
read/verify parity, and in cases where you are running RAID1, then load
balancing might result in the kernel reading the bad block from the good
disk.

So the degree of difficulty is much higher than I expected.  I prefer
not to patch kernels due to maintenance issues as well as desire for the
technique to work across numerous kernels and  patch revisions, and
frankly, the odds are I would screw it up.  An application-level program
that can be invoked as necessary would be ideal.

As such, anybody up to the challenge of writing the code?  I want it
enough to paypal somebody $500 who can write it, and will gladly open
source the solution.  

(And to clarify why, I know physical block x on disk y is bad before the
O/S reads the block, and just want to rebuild the stripe, not the entire
md device when this happens. I must not compromise any file system data,
cached or non-cached that is built on the md device.  I have system with
>100TB and if I did a rebuild every time I discovered a bad block
somewhere, then a full parity repair would never complete before another
physical bad block is discovered.)

Contact me offline for the financial details, but I would certainly
appreciate some thread discussion on an appropriate architecture.  At
least it is my opinion that such capability should eventually be native
Linux, but as long as there is a program that can be run on demand that
doesn't require rebuilding or patching kernels then that is all I need.

David @ santools.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 19:10       ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
@ 2008-05-17 19:29         ` Peter Rabbitson
  2008-05-17 20:26         ` Guy Watkins
  2008-05-19  2:54         ` Neil Brown
  2 siblings, 0 replies; 7+ messages in thread
From: Peter Rabbitson @ 2008-05-17 19:29 UTC (permalink / raw)
  To: David Lethe; +Cc: LinuxRaid, linux-kernel

David Lethe wrote:
> I'm trying to figure out a mechanism to safely repair a stripe of data
> when I know a particular disk has a unrecoverable read error at a
> certain physical block (for 2.6 kernels) 
> 
> <snip>
> 
> As such, anybody up to the challenge of writing the code?  I want it
> enough to paypal somebody $500 who can write it, and will gladly open
> source the solution.  
> 

Damn, here goes $500 :) Unfortunately the only thing I can bring to the table 
is a thread[1] about a mechanism that would fit your request nicely. Hopefully 
someone will pick this stuff up and make it a reality.

Peter

[1] http://marc.info/?l=linux-raid&m=120605458309825

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 19:10       ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
  2008-05-17 19:29         ` Peter Rabbitson
@ 2008-05-17 20:26         ` Guy Watkins
  2008-05-26 11:17           ` Jan Engelhardt
  2008-05-19  2:54         ` Neil Brown
  2 siblings, 1 reply; 7+ messages in thread
From: Guy Watkins @ 2008-05-17 20:26 UTC (permalink / raw)
  To: 'David Lethe', 'LinuxRaid', linux-kernel

} -----Original Message-----
} From: linux-raid-owner@vger.kernel.org [mailto:linux-raid-
} owner@vger.kernel.org] On Behalf Of David Lethe
} Sent: Saturday, May 17, 2008 3:10 PM
} To: LinuxRaid; linux-kernel@vger.kernel.org
} Subject: Mechanism to safely force repair of single md stripe w/o hurting
} data integrity of file system
} 
} I'm trying to figure out a mechanism to safely repair a stripe of data
} when I know a particular disk has a unrecoverable read error at a
} certain physical block (for 2.6 kernels)
} 
} My original plan was to figure out the range of blocks in md device that
} utilizes the known bad block and force a raw read on physical device
} that covers the entire chunk and let the md driver do all of the work.
} 
} Well, this didn't pan out. Problems include issues where if bad block
} maps to the parity block in a stripe then md won't necessarily
} read/verify parity, and in cases where you are running RAID1, then load
} balancing might result in the kernel reading the bad block from the good
} disk.
} 
} So the degree of difficulty is much higher than I expected.  I prefer
} not to patch kernels due to maintenance issues as well as desire for the
} technique to work across numerous kernels and  patch revisions, and
} frankly, the odds are I would screw it up.  An application-level program
} that can be invoked as necessary would be ideal.
} 
} As such, anybody up to the challenge of writing the code?  I want it
} enough to paypal somebody $500 who can write it, and will gladly open
} source the solution.
} 
} (And to clarify why, I know physical block x on disk y is bad before the
} O/S reads the block, and just want to rebuild the stripe, not the entire
} md device when this happens. I must not compromise any file system data,
} cached or non-cached that is built on the md device.  I have system with
} >100TB and if I did a rebuild every time I discovered a bad block
} somewhere, then a full parity repair would never complete before another
} physical bad block is discovered.)
} 
} Contact me offline for the financial details, but I would certainly
} appreciate some thread discussion on an appropriate architecture.  At
} least it is my opinion that such capability should eventually be native
} Linux, but as long as there is a program that can be run on demand that
} doesn't require rebuilding or patching kernels then that is all I need.
} 
} David @ santools.com

I thought this would cause md to read all blocks in an array:
echo repair > /sys/block/md0/md/sync_action

And rewrite any blocks that can't be read.

In the old days, md would kick out a disk on a read error.  When you added
it back, md would rewrite everything on that disk, which corrected read
errors.

Guy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 20:26         ` Guy Watkins
@ 2008-05-26 11:17           ` Jan Engelhardt
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Engelhardt @ 2008-05-26 11:17 UTC (permalink / raw)
  To: Guy Watkins; +Cc: 'David Lethe', 'LinuxRaid', linux-kernel


On Saturday 2008-05-17 22:26, Guy Watkins wrote:
>
>I thought this would cause md to read all blocks in an array:
>echo repair > /sys/block/md0/md/sync_action
>
>And rewrite any blocks that can't be read.
>
>In the old days, md would kick out a disk on a read error.  When you added
>it back, md would rewrite everything on that disk, which corrected read
>errors.

With a read bitmap (`mdadm -G /dev/mdX -b internal`, or during -C),
it should resync less after an unwarranted kick.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system
  2008-05-17 19:10       ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
  2008-05-17 19:29         ` Peter Rabbitson
  2008-05-17 20:26         ` Guy Watkins
@ 2008-05-19  2:54         ` Neil Brown
  2 siblings, 0 replies; 7+ messages in thread
From: Neil Brown @ 2008-05-19  2:54 UTC (permalink / raw)
  To: David Lethe; +Cc: LinuxRaid, linux-kernel

On Saturday May 17, david@santools.com wrote:
> I'm trying to figure out a mechanism to safely repair a stripe of data
> when I know a particular disk has a unrecoverable read error at a
> certain physical block (for 2.6 kernels) 
> 
> My original plan was to figure out the range of blocks in md device that
> utilizes the known bad block and force a raw read on physical device
> that covers the entire chunk and let the md driver do all of the work.  
> 
> Well, this didn't pan out. Problems include issues where if bad block
> maps to the parity block in a stripe then md won't necessarily
> read/verify parity, and in cases where you are running RAID1, then load
> balancing might result in the kernel reading the bad block from the good
> disk.
> 
> So the degree of difficulty is much higher than I expected.  I prefer
> not to patch kernels due to maintenance issues as well as desire for the
> technique to work across numerous kernels and  patch revisions, and
> frankly, the odds are I would screw it up.  An application-level program
> that can be invoked as necessary would be ideal.

This shouldn't be a problem.
You write a patch, submit it for review, it gets reviewed and
eventually submitted to mainline.
Then it will work on all new kernels, and any screw ups that you make
will be caught by someone else (me possibly).

> 
> As such, anybody up to the challenge of writing the code?  I want it
> enough to paypal somebody $500 who can write it, and will gladly open
> source the solution.  

It is largely done.
If you write a number to /sys/block/mdXX/md/sync_max, then recovery
will stop when it gets there.
If you write 'check' to /sys/block/mdXX/md/sync_action, then it will
read all blocks and auto-correct any unrecoverable read errors.

You just need some way to set the start point of the resync.
Probably just create a sync_min attribute - see lightly tested patch below.

If this fits your needs, I'm sure www.compassion.com would be happy
with your $500.

To use this:

 1/ Write the end address (sectors) to sync_max
 2/ Write the start address (sectors) to sync_min
 3/ Write 'check' to sync_action
 4/ Monitor sync_completed until it reaches sync_max
 5/ Write 'idle' to sync_action

NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./drivers/md/md.c           |   46 +++++++++++++++++++++++++++++++++++++++++---
 ./include/linux/raid/md_k.h |    2 +
 2 files changed, 45 insertions(+), 3 deletions(-)

diff .prev/drivers/md/md.c ./drivers/md/md.c
--- .prev/drivers/md/md.c	2008-05-19 11:04:11.000000000 +1000
+++ ./drivers/md/md.c	2008-05-19 12:43:29.000000000 +1000
@@ -277,6 +277,7 @@ static mddev_t * mddev_find(dev_t unit)
 	spin_lock_init(&new->write_lock);
 	init_waitqueue_head(&new->sb_wait);
 	new->reshape_position = MaxSector;
+	new->resync_min = 0;
 	new->resync_max = MaxSector;
 	new->level = LEVEL_NONE;
 
@@ -3074,6 +3075,37 @@ sync_completed_show(mddev_t *mddev, char
 static struct md_sysfs_entry md_sync_completed = __ATTR_RO(sync_completed);
 
 static ssize_t
+min_sync_show(mddev_t *mddev, char *page)
+{
+	return sprintf(page, "%llu\n",
+		       (unsigned long long)mddev->resync_min);
+}
+static ssize_t
+min_sync_store(mddev_t *mddev, const char *buf, size_t len)
+{
+	char *ep;
+	unsigned long long min = simple_strtoull(buf, &ep, 10);
+	if (ep == buf || (*ep != 0 && *ep != '\n'))
+		return -EINVAL;
+	if (min > mddev->resync_max)
+		return -EINVAL;
+	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		return -EBUSY;
+
+	/* Must be a multiple of chunk_size */
+	if (mddev->chunk_size) {
+		if (min & (sector_t)((mddev->chunk_size>>9)-1))
+			return -EINVAL;
+	}
+	mddev->resync_min = min;
+
+	return len;
+}
+
+static struct md_sysfs_entry md_min_sync =
+__ATTR(sync_min, S_IRUGO|S_IWUSR, min_sync_show, min_sync_store);
+
+static ssize_t
 max_sync_show(mddev_t *mddev, char *page)
 {
 	if (mddev->resync_max == MaxSector)
@@ -3092,6 +3124,9 @@ max_sync_store(mddev_t *mddev, const cha
 		unsigned long long max = simple_strtoull(buf, &ep, 10);
 		if (ep == buf || (*ep != 0 && *ep != '\n'))
 			return -EINVAL;
+		if (max < mddev->resync_min)
+			return -EINVAL;
+
 		if (max < mddev->resync_max &&
 		    test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
 			return -EBUSY;
@@ -3103,7 +3138,8 @@ max_sync_store(mddev_t *mddev, const cha
 		}
 		mddev->resync_max = max;
 	}
-	wake_up(&mddev->recovery_wait);
+	if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
+		wake_up(&mddev->recovery_wait);
 	return len;
 }
 
@@ -3221,6 +3257,7 @@ static struct attribute *md_redundancy_a
 	&md_sync_speed.attr,
 	&md_sync_force_parallel.attr,
 	&md_sync_completed.attr,
+	&md_min_sync.attr,
 	&md_max_sync.attr,
 	&md_suspend_lo.attr,
 	&md_suspend_hi.attr,
@@ -3776,6 +3813,7 @@ static int do_md_stop(mddev_t * mddev, i
 		mddev->size = 0;
 		mddev->raid_disks = 0;
 		mddev->recovery_cp = 0;
+		mddev->resync_min = 0;
 		mddev->resync_max = MaxSector;
 		mddev->reshape_position = MaxSector;
 		mddev->external = 0;
@@ -5622,9 +5660,11 @@ void md_do_sync(mddev_t *mddev)
 		max_sectors = mddev->resync_max_sectors;
 		mddev->resync_mismatches = 0;
 		/* we don't use the checkpoint if there's a bitmap */
-		if (!mddev->bitmap &&
-		    !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+		if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
+			j = mddev->resync_min;
+		else if (!mddev->bitmap)
 			j = mddev->recovery_cp;
+
 	} else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
 		max_sectors = mddev->size << 1;
 	else {

diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h
--- .prev/include/linux/raid/md_k.h	2008-05-19 11:04:11.000000000 +1000
+++ ./include/linux/raid/md_k.h	2008-05-19 12:35:52.000000000 +1000
@@ -227,6 +227,8 @@ struct mddev_s
 	atomic_t			recovery_active; /* blocks scheduled, but not written */
 	wait_queue_head_t		recovery_wait;
 	sector_t			recovery_cp;
+	sector_t			resync_min;	/* user request sync starts
+							 * here */
 	sector_t			resync_max;	/* resync should pause
 							 * when it gets here */
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2008-05-26 11:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-05-17 21:30 Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
2008-05-17 23:16 ` Roger Heflin
  -- strict thread matches above, loose matches on Subject: below --
2008-05-16 17:11 Regression- XFS won't mount on partitioned md array David Greaves
2008-05-16 18:59 ` Eric Sandeen
2008-05-17 14:46   ` David Greaves
2008-05-17 15:15     ` Eric Sandeen
2008-05-17 19:10       ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe
2008-05-17 19:29         ` Peter Rabbitson
2008-05-17 20:26         ` Guy Watkins
2008-05-26 11:17           ` Jan Engelhardt
2008-05-19  2:54         ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).