* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system @ 2008-05-17 21:30 David Lethe 2008-05-17 23:16 ` Roger Heflin 0 siblings, 1 reply; 7+ messages in thread From: David Lethe @ 2008-05-17 21:30 UTC (permalink / raw) To: Guy Watkins, 'LinuxRaid', linux-kernel It will. But that defeats the purpose. I want to limit repair to only the raid stripe that utilizes a specifiv disk with a block that I know has a unrecoverable reas error. -----Original Message----- From: "Guy Watkins" <linux-raid@watkins-home.com> Subj: RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system Date: Sat May 17, 2008 3:28 pm Size: 2K To: "'David Lethe'" <david@santools.com>; "'LinuxRaid'" <linux-raid@vger.kernel.org>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of David Lethe } Sent: Saturday, May 17, 2008 3:10 PM } To: LinuxRaid; linux-kernel@vger.kernel.org } Subject: Mechanism to safely force repair of single md stripe w/o hurting } data integrity of file system } } I'm trying to figure out a mechanism to safely repair a stripe of data } when I know a particular disk has a unrecoverable read error at a } certain physical block (for 2.6 kernels) } } My original plan was to figure out the range of blocks in md device that } utilizes the known bad block and force a raw read on physical device } that covers the entire chunk and let the md driver do all of the work. } } Well, this didn't pan out. Problems include issues where if bad block } maps to the parity block in a stripe then md won't necessarily } read/verify parity, and in cases where you are running RAID1, then load } balancing might result in the kernel reading the bad block from the good } disk. } } So the degree of difficulty is much higher than I expected. I prefer } not to patch kernels due to maintenance issues as well as desire for the } technique to work across numerous kernels and patch revisions, and } frankly, the odds are I would screw it up. An application-level program } that can be invoked as necessary would be ideal. } } As such, anybody up to the challenge of writing the code? I want it } enough to paypal somebody $500 who can write it, and will gladly open } source the solution. } } (And to clarify why, I know physical block x on disk y is bad before the } O/S reads the block, and just want to rebuild the stripe, not the entire } md device when this happens. I must not compromise any file system data, } cached or non-cached that is built on the md device. I have system with } >100TB and if I did a rebuild every time I discovered a bad block } somewhere, then a full parity repair would never complete before another } physical bad block is discovered.) } } Contact me offline for the financial details, but I would certainly } appreciate some thread discussion on an appropriate architecture. At } least it is my opinion that such capability should eventually be native } Linux, but as long as there is a program that can be run on demand that } doesn't require rebuilding or patching kernels then that is all I need. } } David @ santools.com I thought this would cause md to read all blocks in an array: echo repair > /sys/block/md0/md/sync_action And rewrite any blocks that can't be read. In the old days, md would kick out a disk on a read error. When you added it back, md would rewrite everything on that disk, which corrected read errors. Guy ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 21:30 Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe @ 2008-05-17 23:16 ` Roger Heflin 0 siblings, 0 replies; 7+ messages in thread From: Roger Heflin @ 2008-05-17 23:16 UTC (permalink / raw) To: David; +Cc: Guy Watkins, 'LinuxRaid', linux-kernel David Lethe wrote: > It will. But that defeats the purpose. I want to limit repair to only the raid stripe that utilizes a specifiv disk with a block that I know has a unrecoverable reas error. > > -----Original Message----- > > From: "Guy Watkins" <linux-raid@watkins-home.com> > Subj: RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system > Date: Sat May 17, 2008 3:28 pm > Size: 2K > To: "'David Lethe'" <david@santools.com>; "'LinuxRaid'" <linux-raid@vger.kernel.org>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org> > > } -----Original Message----- > } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > } owner@vger.kernel.org] On Behalf Of David Lethe > } Sent: Saturday, May 17, 2008 3:10 PM > } To: LinuxRaid; linux-kernel@vger.kernel.org > } Subject: Mechanism to safely force repair of single md stripe w/o hurting > } data integrity of file system > } > } I'm trying to figure out a mechanism to safely repair a stripe of data > } when I know a particular disk has a unrecoverable read error at a > } certain physical block (for 2.6 kernels) > } > } My original plan was to figure out the range of blocks in md device that > } utilizes the known bad block and force a raw read on physical device > } that covers the entire chunk and let the md driver do all of the work. > } > } Well, this didn't pan out. Problems include issues where if bad block > } maps to the parity block in a stripe then md won't necessarily > } read/verify parity, and in cases where you are running RAID1, then load > } balancing might result in the kernel reading the bad block from the good > } disk. > } > } So the degree of difficulty is much higher than I expected. I prefer > } not to patch kernels due to maintenance issues as well as desire for the > } technique to work across numerous kernels and patch revisions, and > } frankly, the odds are I would screw it up. An application-level program > } that can be invoked as necessary would be ideal. > } > } As such, anybody up to the challenge of writing the code? I want it > } enough to paypal somebody $500 who can write it, and will gladly open > } source the solution. > } > } (And to clarify why, I know physical block x on disk y is bad before the > } O/S reads the block, and just want to rebuild the stripe, not the entire > } md device when this happens. I must not compromise any file system data, > } cached or non-cached that is built on the md device. I have system with > } >100TB and if I did a rebuild every time I discovered a bad block > } somewhere, then a full parity repair would never complete before another > } physical bad block is discovered.) > } > } Contact me offline for the financial details, but I would certainly > } appreciate some thread discussion on an appropriate architecture. At > } least it is my opinion that such capability should eventually be native > } Linux, but as long as there is a program that can be run on demand that > } doesn't require rebuilding or patching kernels then that is all I need. > } > } David @ santools.com > > I thought this would cause md to read all blocks in an array: > echo repair > /sys/block/md0/md/sync_action > > And rewrite any blocks that can't be read. > > In the old days, md would kick out a disk on a read error. When you added > it back, md would rewrite everything on that disk, which corrected read > errors. > > Guy > I bet $500 is well below minimum wage in the US for the number of hours it would take someone to do this. And I would say that if you have > 100TB in a single raid5/6 that would mean you had to have at least 100 disks in that array, and most people get nervous at >8-16 disks in either raid5 or raid6 arrays, and the statistics of disks going bad, and the chance of a rebuild succeeding before another disk/block goes bad gets smaller and smaller as the number of disks increase, as you have noted you are at the point that it becomes unlikely that the rebuild will ever complete even with good disks in the array. Most people build a number of smaller raid5/raid6 arrays and then LVM them together to get around this issue. And on top of that the larger number of disks the greater the IO required to do a rebuild so the slower the rebuild potentially is. And that is assuming that you don't have a bad batch of disks that has an abnormally high failure rate. I know of a hardware disk arrays that handle the bad block issue by allocating (on initial array construction) a set of spare blocks on each disk. On finding a bad block on a disk they relocated and rebuild just the bad block on the disk with the bad block from the stripe/parity and somehow note that the block on the bad disk has been relocated, and after some number of bad blocks on a given disk, they note that the given disk has too many bad blocks, and you that should "clone" and then fail the original disk over to the cloned disk once the clone is finished, but this sort of thing would seem to be rather non-trivial, though if someone would setup a clone of the bad disk, and rebuild the bad sector this would probably cut down the amount of time/IO required to complete a rebuild, though it would still take several hours, and things would get more complicated if you had another failure during that process. Roger ^ permalink raw reply [flat|nested] 7+ messages in thread
* Regression- XFS won't mount on partitioned md array
@ 2008-05-16 17:11 David Greaves
2008-05-16 18:59 ` Eric Sandeen
0 siblings, 1 reply; 7+ messages in thread
From: David Greaves @ 2008-05-16 17:11 UTC (permalink / raw)
To: David Chinner; +Cc: LinuxRaid, xfs, 'linux-kernel@vger.kernel.org'
I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer
mounts my xfs filesystem.
I bisected it to around
a67d7c5f5d25d0b13a4dfb182697135b014fa478
[XFS] Move platform specific mount option parse out of core XFS code
I have a RAID5 array with partitions:
Partition Table for /dev/md_d0
First Last
# Type Sector Sector Offset Length Filesystem Type (ID) Flag
-- ------- ----------- ----------- ------ ----------- -------------------- ----
1 Primary 0 2500288279 4 2500288280 Linux (83) None
2 Primary 2500288280 2500483583 0 195304 Non-FS data (DA) None
when I attempt to mount /media:
/dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0
I get:
md_d0: p1 p2
XFS mounting filesystem md_d0p1
attempt to access beyond end of device
md_d0p2: rw=0, want=195311, limit=195304
I/O error in filesystem ("md_d0p1") meta-data dev md_d0p2 block 0x2fae7
("xlog_bread") error 5 buf count 512
XFS: empty log check failed
XFS: log mount/recovery failed: error 5
XFS: log mount failed
A repair:
xfs_repair /dev/md_d0p1 -l /dev/md_d0p2
gives no errors.
Phase 1 - find and verify superblock...
Phase 2 - using external log on /dev/md_d0p2
- zero log...
- scan filesystem freespace and inode maps...
- found root inode chunk
...
David
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: Regression- XFS won't mount on partitioned md array 2008-05-16 17:11 Regression- XFS won't mount on partitioned md array David Greaves @ 2008-05-16 18:59 ` Eric Sandeen 2008-05-17 14:46 ` David Greaves 0 siblings, 1 reply; 7+ messages in thread From: Eric Sandeen @ 2008-05-16 18:59 UTC (permalink / raw) To: David Greaves Cc: David Chinner, LinuxRaid, xfs, 'linux-kernel@vger.kernel.org' David Greaves wrote: > I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer > mounts my xfs filesystem. > > I bisected it to around > a67d7c5f5d25d0b13a4dfb182697135b014fa478 > [XFS] Move platform specific mount option parse out of core XFS code around that... not exactly? That commit should have been largely a code move, which is not to say that it can't contain a bug... :) > I have a RAID5 array with partitions: > > Partition Table for /dev/md_d0 > > First Last > # Type Sector Sector Offset Length Filesystem Type (ID) Flag > -- ------- ----------- ----------- ------ ----------- -------------------- ---- > 1 Primary 0 2500288279 4 2500288280 Linux (83) None > 2 Primary 2500288280 2500483583 0 195304 Non-FS data (DA) None > > > when I attempt to mount /media: > /dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0 mythbox? :) Hm, so it's the external log size that it doesn't much like... > I get: > md_d0: p1 p2 > XFS mounting filesystem md_d0p1 > attempt to access beyond end of device > md_d0p2: rw=0, want=195311, limit=195304 what does /proc/partitions say about md_d0p1 and p2? Is it different between the older & newer kernel? What does xfs_info /mount/point say about the filesystem when you mount it under the older kernel? Or... if you can't mount it, xfs_db -r -c "sb 0" -c p /dev/md_d0p1 -Eric ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Regression- XFS won't mount on partitioned md array 2008-05-16 18:59 ` Eric Sandeen @ 2008-05-17 14:46 ` David Greaves 2008-05-17 15:15 ` Eric Sandeen 0 siblings, 1 reply; 7+ messages in thread From: David Greaves @ 2008-05-17 14:46 UTC (permalink / raw) To: Eric Sandeen Cc: David Chinner, LinuxRaid, xfs, 'linux-kernel@vger.kernel.org' Eric Sandeen wrote: > David Greaves wrote: >> I just attempted a kernel upgrade from 2.6.20.7 to 2.6.25.3 and it no longer >> mounts my xfs filesystem. >> >> I bisected it to around >> a67d7c5f5d25d0b13a4dfb182697135b014fa478 >> [XFS] Move platform specific mount option parse out of core XFS code > > around that... not exactly? That commit should have been largely a code > move, which is not to say that it can't contain a bug... :) I got to within 4 on the bisect and my xfs partition containing the kernel src and the bisect history blew up telling me that files were directories and then exploding in a heap of lost+found/ fragments. Quite, erm, "interesting" really. At that point I decided I was close enough to ask for advice, looked at the commits and took this one as the most likely to cause the bug :) But, thinking about it, I can decode the kernel extraversion tags in /boot. From that I think my bounds were: 40ebd81d1a7635cf92a59c387a599fce4863206b [XFS] Use kernel-supplied "roundup_pow_of_two" for simplicity and: 3ed6526441053d79b85d206b14d75125e6f51cc2 [XFS] Implement fallocate. so those bound: [XFS] Remove the BPCSHIFT and NB* based macros from XFS. [XFS] Remove bogus assert [XFS] optimize XFS_IS_REALTIME_INODE w/o realtime config [XFS] Move platform specific mount option parse out of core XFS code and just glancing through the patches I didn't see any changes that looked likely in the others... > >> I have a RAID5 array with partitions: >> >> Partition Table for /dev/md_d0 >> >> First Last >> # Type Sector Sector Offset Length Filesystem Type (ID) Flag >> -- ------- ----------- ----------- ------ ----------- -------------------- ---- >> 1 Primary 0 2500288279 4 2500288280 Linux (83) None >> 2 Primary 2500288280 2500483583 0 195304 Non-FS data (DA) None >> >> >> when I attempt to mount /media: >> /dev/md_d0p1 /media xfs rw,nobarrier,noatime,logdev=/dev/md_d0p2,allocsize=512m 0 0 > > mythbox? :) Hey - we test some interesting corner cases... :) My *wife* just told *me* to buy, and I quote "No more than 10" 1Tb Samsung drives... I decided 5 would be plenty. > Hm, so it's the external log size that it doesn't much like... Yep - I noticed that - and ISTR that Neil has been fiddling in the md partitioning code over the last 6 months or so. I wondered where it got the larger figure from and if, somehow, md was changing the partition size somehow... >> I get: >> md_d0: p1 p2 >> XFS mounting filesystem md_d0p1 >> attempt to access beyond end of device >> md_d0p2: rw=0, want=195311, limit=195304 > > what does /proc/partitions say about md_d0p1 and p2? Is it different > between the older & newer kernel? 2.6.20.7 (good) 254 0 1250241792 md_d0 254 1 1250144138 md_d0p1 254 2 97652 md_d0p2 2.6.25.3 (bad) 254 0 1250241792 md_d0 254 1 1250144138 md_d0p1 254 2 97652 md_d0p2 2.6.25.4 (bad) 254 0 1250241792 md_d0 254 1 1250144138 md_d0p1 254 2 97652 md_d0p2 So nothing obvious there then... > > What does xfs_info /mount/point say about the filesystem when you mount > it under the older kernel? Or... if you can't mount it, teak:~# xfs_info /media/ meta-data=/dev/md_d0p1 isize=256 agcount=32, agsize=9766751 blks = sectsz=512 attr=0 data = bsize=4096 blocks=312536032, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 log =external bsize=4096 blocks=24413, version=2 = sectsz=512 sunit=0 blks, lazy-count=0 realtime =none extsz=65536 blocks=0, rtextents=0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Regression- XFS won't mount on partitioned md array 2008-05-17 14:46 ` David Greaves @ 2008-05-17 15:15 ` Eric Sandeen 2008-05-17 19:10 ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe 0 siblings, 1 reply; 7+ messages in thread From: Eric Sandeen @ 2008-05-17 15:15 UTC (permalink / raw) To: David Greaves Cc: David Chinner, LinuxRaid, xfs, 'linux-kernel@vger.kernel.org' David Greaves wrote: > Eric Sandeen wrote: >>> I get: >>> md_d0: p1 p2 >>> XFS mounting filesystem md_d0p1 >>> attempt to access beyond end of device >>> md_d0p2: rw=0, want=195311, limit=195304 >> what does /proc/partitions say about md_d0p1 and p2? Is it different >> between the older & newer kernel? ... > 2.6.25.4 (bad) > 254 0 1250241792 md_d0 > 254 1 1250144138 md_d0p1 > 254 2 97652 md_d0p2 > > So nothing obvious there then... > >> What does xfs_info /mount/point say about the filesystem when you mount >> it under the older kernel? Or... if you can't mount it, > teak:~# xfs_info /media/ > meta-data=/dev/md_d0p1 isize=256 agcount=32, agsize=9766751 blks > = sectsz=512 attr=0 > data = bsize=4096 blocks=312536032, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 > log =external bsize=4096 blocks=24413, version=2 > = sectsz=512 sunit=0 blks, lazy-count=0 > realtime =none extsz=65536 blocks=0, rtextents=0 ok, and with: > Partition Table for /dev/md_d0 > > First Last > # Type Sector Sector Offset Length Filesystem Type (ID) Flag > -- ------- ----------- ----------- ------ ----------- -------------------- ---- > 1 Primary 0 2500288279 4 2500288280 Linux (83) None > 2 Primary 2500288280 2500483583 0 195304 Non-FS data (DA) None So, xfs thinks the external log is 24413 4k blocks (from the sb geometry printed by xfs_info). This is 97652 1k units (matching your /proc/partitions output) and 195304 512-byte sectors (matching the partition table output). So that all looks consistent. So if xfs is doing: >>> md_d0p2: rw=0, want=195311, limit=195304 >>> XFS: empty log check failed it surely does seem to be trying to read past the end of what even it thinks is the end of its log. And, with your geometry I can reproduce this w/o md, partitioned or not. So looks like xfs itself is busted: loop5: rw=0, want=195311, limit=195304 I'll see if I have a little time today to track down the problem. Thanks, -Eric ^ permalink raw reply [flat|nested] 7+ messages in thread
* Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 15:15 ` Eric Sandeen @ 2008-05-17 19:10 ` David Lethe 2008-05-17 19:29 ` Peter Rabbitson ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: David Lethe @ 2008-05-17 19:10 UTC (permalink / raw) To: LinuxRaid, linux-kernel I'm trying to figure out a mechanism to safely repair a stripe of data when I know a particular disk has a unrecoverable read error at a certain physical block (for 2.6 kernels) My original plan was to figure out the range of blocks in md device that utilizes the known bad block and force a raw read on physical device that covers the entire chunk and let the md driver do all of the work. Well, this didn't pan out. Problems include issues where if bad block maps to the parity block in a stripe then md won't necessarily read/verify parity, and in cases where you are running RAID1, then load balancing might result in the kernel reading the bad block from the good disk. So the degree of difficulty is much higher than I expected. I prefer not to patch kernels due to maintenance issues as well as desire for the technique to work across numerous kernels and patch revisions, and frankly, the odds are I would screw it up. An application-level program that can be invoked as necessary would be ideal. As such, anybody up to the challenge of writing the code? I want it enough to paypal somebody $500 who can write it, and will gladly open source the solution. (And to clarify why, I know physical block x on disk y is bad before the O/S reads the block, and just want to rebuild the stripe, not the entire md device when this happens. I must not compromise any file system data, cached or non-cached that is built on the md device. I have system with >100TB and if I did a rebuild every time I discovered a bad block somewhere, then a full parity repair would never complete before another physical bad block is discovered.) Contact me offline for the financial details, but I would certainly appreciate some thread discussion on an appropriate architecture. At least it is my opinion that such capability should eventually be native Linux, but as long as there is a program that can be run on demand that doesn't require rebuilding or patching kernels then that is all I need. David @ santools.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 19:10 ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe @ 2008-05-17 19:29 ` Peter Rabbitson 2008-05-17 20:26 ` Guy Watkins 2008-05-19 2:54 ` Neil Brown 2 siblings, 0 replies; 7+ messages in thread From: Peter Rabbitson @ 2008-05-17 19:29 UTC (permalink / raw) To: David Lethe; +Cc: LinuxRaid, linux-kernel David Lethe wrote: > I'm trying to figure out a mechanism to safely repair a stripe of data > when I know a particular disk has a unrecoverable read error at a > certain physical block (for 2.6 kernels) > > <snip> > > As such, anybody up to the challenge of writing the code? I want it > enough to paypal somebody $500 who can write it, and will gladly open > source the solution. > Damn, here goes $500 :) Unfortunately the only thing I can bring to the table is a thread[1] about a mechanism that would fit your request nicely. Hopefully someone will pick this stuff up and make it a reality. Peter [1] http://marc.info/?l=linux-raid&m=120605458309825 ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 19:10 ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe 2008-05-17 19:29 ` Peter Rabbitson @ 2008-05-17 20:26 ` Guy Watkins 2008-05-26 11:17 ` Jan Engelhardt 2008-05-19 2:54 ` Neil Brown 2 siblings, 1 reply; 7+ messages in thread From: Guy Watkins @ 2008-05-17 20:26 UTC (permalink / raw) To: 'David Lethe', 'LinuxRaid', linux-kernel } -----Original Message----- } From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- } owner@vger.kernel.org] On Behalf Of David Lethe } Sent: Saturday, May 17, 2008 3:10 PM } To: LinuxRaid; linux-kernel@vger.kernel.org } Subject: Mechanism to safely force repair of single md stripe w/o hurting } data integrity of file system } } I'm trying to figure out a mechanism to safely repair a stripe of data } when I know a particular disk has a unrecoverable read error at a } certain physical block (for 2.6 kernels) } } My original plan was to figure out the range of blocks in md device that } utilizes the known bad block and force a raw read on physical device } that covers the entire chunk and let the md driver do all of the work. } } Well, this didn't pan out. Problems include issues where if bad block } maps to the parity block in a stripe then md won't necessarily } read/verify parity, and in cases where you are running RAID1, then load } balancing might result in the kernel reading the bad block from the good } disk. } } So the degree of difficulty is much higher than I expected. I prefer } not to patch kernels due to maintenance issues as well as desire for the } technique to work across numerous kernels and patch revisions, and } frankly, the odds are I would screw it up. An application-level program } that can be invoked as necessary would be ideal. } } As such, anybody up to the challenge of writing the code? I want it } enough to paypal somebody $500 who can write it, and will gladly open } source the solution. } } (And to clarify why, I know physical block x on disk y is bad before the } O/S reads the block, and just want to rebuild the stripe, not the entire } md device when this happens. I must not compromise any file system data, } cached or non-cached that is built on the md device. I have system with } >100TB and if I did a rebuild every time I discovered a bad block } somewhere, then a full parity repair would never complete before another } physical bad block is discovered.) } } Contact me offline for the financial details, but I would certainly } appreciate some thread discussion on an appropriate architecture. At } least it is my opinion that such capability should eventually be native } Linux, but as long as there is a program that can be run on demand that } doesn't require rebuilding or patching kernels then that is all I need. } } David @ santools.com I thought this would cause md to read all blocks in an array: echo repair > /sys/block/md0/md/sync_action And rewrite any blocks that can't be read. In the old days, md would kick out a disk on a read error. When you added it back, md would rewrite everything on that disk, which corrected read errors. Guy ^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 20:26 ` Guy Watkins @ 2008-05-26 11:17 ` Jan Engelhardt 0 siblings, 0 replies; 7+ messages in thread From: Jan Engelhardt @ 2008-05-26 11:17 UTC (permalink / raw) To: Guy Watkins; +Cc: 'David Lethe', 'LinuxRaid', linux-kernel On Saturday 2008-05-17 22:26, Guy Watkins wrote: > >I thought this would cause md to read all blocks in an array: >echo repair > /sys/block/md0/md/sync_action > >And rewrite any blocks that can't be read. > >In the old days, md would kick out a disk on a read error. When you added >it back, md would rewrite everything on that disk, which corrected read >errors. With a read bitmap (`mdadm -G /dev/mdX -b internal`, or during -C), it should resync less after an unwarranted kick. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system 2008-05-17 19:10 ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe 2008-05-17 19:29 ` Peter Rabbitson 2008-05-17 20:26 ` Guy Watkins @ 2008-05-19 2:54 ` Neil Brown 2 siblings, 0 replies; 7+ messages in thread From: Neil Brown @ 2008-05-19 2:54 UTC (permalink / raw) To: David Lethe; +Cc: LinuxRaid, linux-kernel On Saturday May 17, david@santools.com wrote: > I'm trying to figure out a mechanism to safely repair a stripe of data > when I know a particular disk has a unrecoverable read error at a > certain physical block (for 2.6 kernels) > > My original plan was to figure out the range of blocks in md device that > utilizes the known bad block and force a raw read on physical device > that covers the entire chunk and let the md driver do all of the work. > > Well, this didn't pan out. Problems include issues where if bad block > maps to the parity block in a stripe then md won't necessarily > read/verify parity, and in cases where you are running RAID1, then load > balancing might result in the kernel reading the bad block from the good > disk. > > So the degree of difficulty is much higher than I expected. I prefer > not to patch kernels due to maintenance issues as well as desire for the > technique to work across numerous kernels and patch revisions, and > frankly, the odds are I would screw it up. An application-level program > that can be invoked as necessary would be ideal. This shouldn't be a problem. You write a patch, submit it for review, it gets reviewed and eventually submitted to mainline. Then it will work on all new kernels, and any screw ups that you make will be caught by someone else (me possibly). > > As such, anybody up to the challenge of writing the code? I want it > enough to paypal somebody $500 who can write it, and will gladly open > source the solution. It is largely done. If you write a number to /sys/block/mdXX/md/sync_max, then recovery will stop when it gets there. If you write 'check' to /sys/block/mdXX/md/sync_action, then it will read all blocks and auto-correct any unrecoverable read errors. You just need some way to set the start point of the resync. Probably just create a sync_min attribute - see lightly tested patch below. If this fits your needs, I'm sure www.compassion.com would be happy with your $500. To use this: 1/ Write the end address (sectors) to sync_max 2/ Write the start address (sectors) to sync_min 3/ Write 'check' to sync_action 4/ Monitor sync_completed until it reaches sync_max 5/ Write 'idle' to sync_action NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./drivers/md/md.c | 46 +++++++++++++++++++++++++++++++++++++++++--- ./include/linux/raid/md_k.h | 2 + 2 files changed, 45 insertions(+), 3 deletions(-) diff .prev/drivers/md/md.c ./drivers/md/md.c --- .prev/drivers/md/md.c 2008-05-19 11:04:11.000000000 +1000 +++ ./drivers/md/md.c 2008-05-19 12:43:29.000000000 +1000 @@ -277,6 +277,7 @@ static mddev_t * mddev_find(dev_t unit) spin_lock_init(&new->write_lock); init_waitqueue_head(&new->sb_wait); new->reshape_position = MaxSector; + new->resync_min = 0; new->resync_max = MaxSector; new->level = LEVEL_NONE; @@ -3074,6 +3075,37 @@ sync_completed_show(mddev_t *mddev, char static struct md_sysfs_entry md_sync_completed = __ATTR_RO(sync_completed); static ssize_t +min_sync_show(mddev_t *mddev, char *page) +{ + return sprintf(page, "%llu\n", + (unsigned long long)mddev->resync_min); +} +static ssize_t +min_sync_store(mddev_t *mddev, const char *buf, size_t len) +{ + char *ep; + unsigned long long min = simple_strtoull(buf, &ep, 10); + if (ep == buf || (*ep != 0 && *ep != '\n')) + return -EINVAL; + if (min > mddev->resync_max) + return -EINVAL; + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + return -EBUSY; + + /* Must be a multiple of chunk_size */ + if (mddev->chunk_size) { + if (min & (sector_t)((mddev->chunk_size>>9)-1)) + return -EINVAL; + } + mddev->resync_min = min; + + return len; +} + +static struct md_sysfs_entry md_min_sync = +__ATTR(sync_min, S_IRUGO|S_IWUSR, min_sync_show, min_sync_store); + +static ssize_t max_sync_show(mddev_t *mddev, char *page) { if (mddev->resync_max == MaxSector) @@ -3092,6 +3124,9 @@ max_sync_store(mddev_t *mddev, const cha unsigned long long max = simple_strtoull(buf, &ep, 10); if (ep == buf || (*ep != 0 && *ep != '\n')) return -EINVAL; + if (max < mddev->resync_min) + return -EINVAL; + if (max < mddev->resync_max && test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) return -EBUSY; @@ -3103,7 +3138,8 @@ max_sync_store(mddev_t *mddev, const cha } mddev->resync_max = max; } - wake_up(&mddev->recovery_wait); + if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) + wake_up(&mddev->recovery_wait); return len; } @@ -3221,6 +3257,7 @@ static struct attribute *md_redundancy_a &md_sync_speed.attr, &md_sync_force_parallel.attr, &md_sync_completed.attr, + &md_min_sync.attr, &md_max_sync.attr, &md_suspend_lo.attr, &md_suspend_hi.attr, @@ -3776,6 +3813,7 @@ static int do_md_stop(mddev_t * mddev, i mddev->size = 0; mddev->raid_disks = 0; mddev->recovery_cp = 0; + mddev->resync_min = 0; mddev->resync_max = MaxSector; mddev->reshape_position = MaxSector; mddev->external = 0; @@ -5622,9 +5660,11 @@ void md_do_sync(mddev_t *mddev) max_sectors = mddev->resync_max_sectors; mddev->resync_mismatches = 0; /* we don't use the checkpoint if there's a bitmap */ - if (!mddev->bitmap && - !test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) + if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) + j = mddev->resync_min; + else if (!mddev->bitmap) j = mddev->recovery_cp; + } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) max_sectors = mddev->size << 1; else { diff .prev/include/linux/raid/md_k.h ./include/linux/raid/md_k.h --- .prev/include/linux/raid/md_k.h 2008-05-19 11:04:11.000000000 +1000 +++ ./include/linux/raid/md_k.h 2008-05-19 12:35:52.000000000 +1000 @@ -227,6 +227,8 @@ struct mddev_s atomic_t recovery_active; /* blocks scheduled, but not written */ wait_queue_head_t recovery_wait; sector_t recovery_cp; + sector_t resync_min; /* user request sync starts + * here */ sector_t resync_max; /* resync should pause * when it gets here */ ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-05-26 11:17 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-05-17 21:30 Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe 2008-05-17 23:16 ` Roger Heflin -- strict thread matches above, loose matches on Subject: below -- 2008-05-16 17:11 Regression- XFS won't mount on partitioned md array David Greaves 2008-05-16 18:59 ` Eric Sandeen 2008-05-17 14:46 ` David Greaves 2008-05-17 15:15 ` Eric Sandeen 2008-05-17 19:10 ` Mechanism to safely force repair of single md stripe w/o hurting data integrity of file system David Lethe 2008-05-17 19:29 ` Peter Rabbitson 2008-05-17 20:26 ` Guy Watkins 2008-05-26 11:17 ` Jan Engelhardt 2008-05-19 2:54 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).