* 3-way mirrors @ 2010-09-07 14:19 George Spelvin 2010-09-07 16:07 ` Iordan Iordanov ` (4 more replies) 0 siblings, 5 replies; 14+ messages in thread From: George Spelvin @ 2010-09-07 14:19 UTC (permalink / raw) To: linux-raid; +Cc: linux After some frustration with RAID-5 finding mismatches and not being able to figure out which drive has the problem, I'm setting up a rather intricate 5-way mirrored (x 2-way striped) system. The intention is that 3 copies will be on line at any time (dropping to 2 in case of disk failure), while copies 4 and 5 will be kept off-site. Occasionally one will come in, be re-synced, and then removed again. (The file system can be quiesced briefly to permit a clean split.) Anyway, one nice property of a 2-drive redundancy (3+-way mirror or RAID-6) is error detection: in case of a mismatch, it's possible to finger the offending drive. My understanding of the current code is that it just copies one mirror (the first readable?) to the others. Does someone have a patch to vote on the data? If not, can someone point me at the relevant bit of code and orient me enough that I can create it? (The other thing I'd love is a more advanced sync_action that can accept a block number found by "check" as a parameter to "repair" so I don't have to wait while the array is re-scanned. Um... I suppose this depends on a local patch I have that logs the sector numbers of mismatches.) Another thing I'm a bit worried about is the kernel's tendency to add drives in the lowest-numbered open slot in a RAID. When used in multiply-mirrored RAID-10, this tends to fill up the first stripe hallf before starting on the second. I'm worried that someone not paying attention will --add rather than --re-add the off-site backup drives and create mirrors 4 and 5 of the first stripe half, thus producing an incomplete backup. Any suggestions on how to mitigate this risk? And if it happens, how do I recover? Is there a way to force a drive to be added as 9/10, even if 5/10 is currently empty? Thank you very much! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 14:19 3-way mirrors George Spelvin @ 2010-09-07 16:07 ` Iordan Iordanov 2010-09-07 18:49 ` George Spelvin 2010-09-07 18:31 ` Aryeh Gregor ` (3 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Iordan Iordanov @ 2010-09-07 16:07 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid Hi George, Due to the widely reported mismatch problems with RAID5, we also went with a 3-way mirror design. We have not yet developed a good way of dealing with the inevitable mismatches which will occur with some drive in a 3-way mirror, but we have some (crude) ideas. George Spelvin wrote: > Anyway, one nice property of a 2-drive redundancy (3+-way mirror or > RAID-6) is error detection: in case of a mismatch, it's possible to > finger the offending drive. When we see a mismatch_cnt > 0, we would run a dd/cmp script which would detect the drive and sector which is mismatched (i.e. we would craft a script which runs three dd processes in parallel, reading from each drive, and compares the data). When an inconsistency is discovered, we would have the sector which doesn't match, and which drive it's on. However, even at 60MB/s, this would take 5 hours to perform with our 1TB drives. So, it would be much better if we could do this while we are up, somehow. Once we have the drive and sector, we can take the array down, and quickly dd the sector from one of the drives onto the one with the mismatch. > My understanding of the current code is that it just copies one mirror > (the first readable?) to the others. Does someone have a patch to vote > on the data? If not, can someone point me at the relevant bit of code > and orient me enough that I can create it? Resyncing an entire drive is probably not necessary with a mismatch, because you already know the rest of the drive is synced and can simply manually force a particular sector to match. > (The other thing I'd love is a more advanced that can accept a > block number found by "check" as a parameter to "repair" so I don't have > to wait while the array is re-scanned. Um... I suppose this depends on > a local patch I have that logs the sector numbers of mismatches.) Yes, but don't you run the risk of syncing the "bad" data from the mismatch drive to the other two drives if you do this automatically? Don't you also need a parameter to specify which drive to sync from? At any rate, if the mismatch sector(s) are also logged during the array check, then resyncing this sector by hand would be easy and fast with minimal downtime. It would be great to have this functionality to start with. Cheers! Iordan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 16:07 ` Iordan Iordanov @ 2010-09-07 18:49 ` George Spelvin 2010-09-07 19:55 ` Keld Jørn Simonsen 0 siblings, 1 reply; 14+ messages in thread From: George Spelvin @ 2010-09-07 18:49 UTC (permalink / raw) To: iordan, linux; +Cc: linux-raid George Spelvin wrote: >> Anyway, one nice property of a 2-drive redundancy (3+-way mirror or >> RAID-6) is error detection: in case of a mismatch, it's possible to >> finger the offending drive. > When we see a mismatch_cnt > 0, we would run a dd/cmp script which would > detect the drive and sector which is mismatched (i.e. we would craft a > script which runs three dd processes in parallel, reading from each > drive, and compares the data). > When an inconsistency is discovered, we would have the sector which > doesn't match, and which drive it's on. However, even at 60MB/s, this > would take 5 hours to perform with our 1TB drives. So, it would be much > better if we could do this while we are up, somehow. That was my hope, for the md software to do it automatically. >> My understanding of the current code is that it just copies one mirror >> (the first readable?) to the others. Does someone have a patch to vote >> on the data? If not, can someone point me at the relevant bit of code >> and orient me enough that I can create it? > Resyncing an entire drive is probably not necessary with a mismatch, > because you already know the rest of the drive is synced and can simply > manually force a particular sector to match. Ideally, I'd like ZFS-like checksums on the data, with a mismatch triggering a read of all mirrors and a reconstruction attempt. With that, a silently corrupted sector on RAID-5 can be pinpointed and fixed. But in the meantime, I'd like check/repair passes to tell me if 2 of the 3 mirrors agree, so I can blame the third. >> (The other thing I'd love is a more advanced that can accept a >> block number found by "check" as a parameter to "repair" so I don't have >> to wait while the array is re-scanned. Um... I suppose this depends on >> a local patch I have that logs the sector numbers of mismatches.) > Yes, but don't you run the risk of syncing the "bad" data from the > mismatch drive to the other two drives if you do this automatically? > Don't you also need a parameter to specify which drive to sync from? That's why I wanted the voting, so the RAID software could decide automatically. I don't see a practical way to identify the correct block contents in isolation, although mapping up to a logical file may find a file which can be checked for consistency. (But debugfs takes forever to run icheck + ncheck on a large filesystem.) > At any rate, if the mismatch sector(s) are also logged during the array > check, then resyncing this sector by hand would be easy and fast with > minimal downtime. It would be great to have this functionality to start > with. I use the following patch. Note that it reports the offset in 512-byte sectors within a single component; multiply by the number of data drives and divide by sectors per block to get a block offset within the RAID array. diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index d1d6891..2dcffcd 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -1363,6 +1363,8 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio) break; if (j == vcnt) continue; + printk(KERN_INFO "%s: Mismatch at sector %llu\n", + mdname(mddev), (unsigned long long)r10_bio->sector); mddev->resync_mismatches += r10_bio->sectors; } if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 96c6902..a0a0b08 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2732,6 +2732,8 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, */ set_bit(STRIPE_INSYNC, &sh->state); else { +printk(KERN_INFO "%s: Mismatch at sector %llu\n", mdname(conf->mddev), + (unsigned long long)sh->sector); conf->mddev->resync_mismatches += STRIPE_SECTORS; if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery)) /* don't try to repair!! */ ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 18:49 ` George Spelvin @ 2010-09-07 19:55 ` Keld Jørn Simonsen 0 siblings, 0 replies; 14+ messages in thread From: Keld Jørn Simonsen @ 2010-09-07 19:55 UTC (permalink / raw) To: George Spelvin; +Cc: iordan, linux-raid On Tue, Sep 07, 2010 at 02:49:17PM -0400, George Spelvin wrote: > George Spelvin wrote: > > But in the meantime, I'd like check/repair passes to tell me if 2 of the 3 > mirrors agree, so I can blame the third. I would like to check the error logs of the disks to see if one of the disagreeing blocks has had an anormaly. This would also work when you only have 2 copies. Or some reporting to higher up levels, and the ability to then check out manually which copy to keep. Best regards keld ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 14:19 3-way mirrors George Spelvin 2010-09-07 16:07 ` Iordan Iordanov @ 2010-09-07 18:31 ` Aryeh Gregor 2010-09-07 19:02 ` George Spelvin 2010-09-07 22:01 ` Neil Brown ` (2 subsequent siblings) 4 siblings, 1 reply; 14+ messages in thread From: Aryeh Gregor @ 2010-09-07 18:31 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On Tue, Sep 7, 2010 at 10:19 AM, George Spelvin <linux@horizon.com> wrote: > Anyway, one nice property of a 2-drive redundancy (3+-way mirror or > RAID-6) is error detection: in case of a mismatch, it's possible to > finger the offending drive. > > My understanding of the current code is that it just copies one mirror > (the first readable?) to the others. Does someone have a patch to vote > on the data? If not, can someone point me at the relevant bit of code > and orient me enough that I can create it? This might be useful reading: http://neil.brown.name/blog/20100211050355 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 18:31 ` Aryeh Gregor @ 2010-09-07 19:02 ` George Spelvin 2010-09-08 22:28 ` Bill Davidsen 0 siblings, 1 reply; 14+ messages in thread From: George Spelvin @ 2010-09-07 19:02 UTC (permalink / raw) To: linux, Simetrical+list; +Cc: linux-raid > This might be useful reading: > > http://neil.brown.name/blog/20100211050355 An interesting point of view, BUT... If I am seeing repeated unexplained mismatches (despite being on a good UPS and having no unclean shutdowns), then some part of my hardware is failing, and I'd like to know *what part*. Even if it doesn't halp me get the current data sector back, if I see that drive #2 keeps having one opinion on the contents of a block while drives #1 and #3 have a different opinion, then it's a useful piece of diagnostic information. It certainly is true that, if my file system doesn't change too fast, I can pull the mismatching sector out of the logs and do a manual compare using dd. But it's a lot nicer to avoid race conditions by placing the code inside md. As for an option to read the whole stripe and check it, actually you only need to read 2 copies. If they agree, all is well. If they don't, recovery is required. The arguments about blocks magically changing under the file system don't really hold water as long as RAID-1 distributes reads across the component drives. As long as that is the case, a mismatch can result in a silent change. A true fix (in the absence of a higher-level checksum to validate the data) requires multiple reads. As for unclean shutdowns, I expect that the RAID code holds off barriers until all copies are written, so I still expect that a majority vote will produce a consistent file system. Thank you for the pointer! ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 19:02 ` George Spelvin @ 2010-09-08 22:28 ` Bill Davidsen 0 siblings, 0 replies; 14+ messages in thread From: Bill Davidsen @ 2010-09-08 22:28 UTC (permalink / raw) To: George Spelvin; +Cc: Simetrical+list, linux-raid George Spelvin wrote: >> This might be useful reading: >> >> http://neil.brown.name/blog/20100211050355 >> > > An interesting point of view, BUT... > > If I am seeing repeated unexplained mismatches (despite being on a good > UPS and having no unclean shutdowns), then some part of my hardware is > failing, and I'd like to know *what part*. > How about your disk enclosure? I would think that if vibration caused a silent bit flip *on write* you would get a read error (CRC) reading it back. but... I am going on the theory that any factor which causes measurable errors on read is not doing anything good for writes either. See: http://www.zdnet.com/blog/storage/bad-bad-bad-vibrations/896 -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 14:19 3-way mirrors George Spelvin 2010-09-07 16:07 ` Iordan Iordanov 2010-09-07 18:31 ` Aryeh Gregor @ 2010-09-07 22:01 ` Neil Brown 2010-09-08 1:33 ` Neil Brown 2010-09-08 14:52 ` George Spelvin 2010-09-08 9:40 ` RAID mismatches (and reporting thereof) Tim Small 2010-09-28 16:42 ` 3-way mirrors Tim Small 4 siblings, 2 replies; 14+ messages in thread From: Neil Brown @ 2010-09-07 22:01 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On 7 Sep 2010 10:19:04 -0400 "George Spelvin" <linux@horizon.com> wrote: > After some frustration with RAID-5 finding mismatches and not being > able to figure out which drive has the problem, I'm setting up a rather > intricate 5-way mirrored (x 2-way striped) system. > > The intention is that 3 copies will be on line at any time (dropping to > 2 in case of disk failure), while copies 4 and 5 will be kept off-site. > Occasionally one will come in, be re-synced, and then removed again. > (The file system can be quiesced briefly to permit a clean split.) > > Anyway, one nice property of a 2-drive redundancy (3+-way mirror or > RAID-6) is error detection: in case of a mismatch, it's possible to > finger the offending drive. > > My understanding of the current code is that it just copies one mirror > (the first readable?) to the others. Does someone have a patch to vote > on the data? If not, can someone point me at the relevant bit of code > and orient me enough that I can create it? > The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of sync_request_write() in drivers/md/raid1.c Look for "memcmp". This code runs when you "echo repair > /sys/block/mdXXX/md/sync_action It has already read all blocks and now compares them to see if they are the same. If not it copies the first to any that are different. You possibly want to factor out that code into a separate function before tryin to add any 'voting' code. > (The other thing I'd love is a more advanced sync_action that can accept a > block number found by "check" as a parameter to "repair" so I don't have > to wait while the array is re-scanned. Um... I suppose this depends on > a local patch I have that logs the sector numbers of mismatches.) This is already possible via the sync_min and sync_max sysfs files. Write a number of sectors to sync_max and a lower number to sync_min. Then write 'repair' to 'sync_action'. When sync_completed reaches sync_max, the repair will pause. You can then let it continue by writing a larger number to sync_max, or tell it to finish by writing 'idle' to 'sync_action'. If you have patches that you think are generally useful, feel free to submit them to me for consideration for upstream inclusion. > > > Another thing I'm a bit worried about is the kernel's tendency to > add drives in the lowest-numbered open slot in a RAID. When used in > multiply-mirrored RAID-10, this tends to fill up the first stripe hallf > before starting on the second. This is controlled by raid10_add_disk in drivers/md/raid10.c. I would happily accept a patch which made a more balanced choice about where to add the new disk. > > I'm worried that someone not paying attention will --add rather than > --re-add the off-site backup drives and create mirrors 4 and 5 of > the first stripe half, thus producing an incomplete backup. It is already on my to-do list for mdadm-3.2 to reject a --add that looks like it should be a --re-add. You will need --force to make it a spare, or --zero it first. > > Any suggestions on how to mitigate this risk? And if it happens, > how do I recover? Is there a way to force a drive to be added > as 9/10, even if 5/10 is currently empty? 1/ hack at mdadm or wait for mdadm-3.2, or feed people more coffee:-) 2/ You probably cannot recover with any amount of certainty. 3/ That is entirely a kernel decision - 'fix' the kernel. NeilBrown > > > Thank you very much! > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 22:01 ` Neil Brown @ 2010-09-08 1:33 ` Neil Brown 2010-09-08 14:52 ` George Spelvin 1 sibling, 0 replies; 14+ messages in thread From: Neil Brown @ 2010-09-08 1:33 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On Wed, 8 Sep 2010 08:01:55 +1000 Neil Brown <neilb@suse.de> wrote: > On 7 Sep 2010 10:19:04 -0400 "George Spelvin" <linux@horizon.com> wrote: > > > > I'm worried that someone not paying attention will --add rather than > > --re-add the off-site backup drives and create mirrors 4 and 5 of > > the first stripe half, thus producing an incomplete backup. > > It is already on my to-do list for mdadm-3.2 to reject a --add that looks > like it should be a --re-add. You will need --force to make it a spare, or > --zero it first. > I just realised I had this slightly wrong. mdadm will already perform a --re-add if asked to --add a device that can be re-added. So you should be safe from people accidentally using --add when they should have used --re-add. The change on my to-do list is that if it looks like a re-add might be possible but the re-add fails, then don't do a normal --add without extra encouragement. The case where this is interesting is if you have a doubly-degraded RAID5 and the devices just had a temporary failure. It would seem logical to just add the disks back. The --re-add attempt will fail of course, so mdadm will currently make the devices spares which isn't what is wanted. Rather mdadm should fail and suggest a 'stop' followed by '--assemble --force'. For raid1 my planned change won't make any difference - you should be safe as you are. NeilBrown > > > > > Any suggestions on how to mitigate this risk? And if it happens, > > how do I recover? Is there a way to force a drive to be added > > as 9/10, even if 5/10 is currently empty? > > 1/ hack at mdadm or wait for mdadm-3.2, or feed people more coffee:-) > 2/ You probably cannot recover with any amount of certainty. > 3/ That is entirely a kernel decision - 'fix' the kernel. > > NeilBrown > > > > > > > > Thank you very much! > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 22:01 ` Neil Brown 2010-09-08 1:33 ` Neil Brown @ 2010-09-08 14:52 ` George Spelvin 2010-09-08 23:04 ` Neil Brown 1 sibling, 1 reply; 14+ messages in thread From: George Spelvin @ 2010-09-08 14:52 UTC (permalink / raw) To: linux, neilb; +Cc: linux-raid > The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of > sync_request_write() in drivers/md/raid1.c > Look for "memcmp". Okay, so the data is in r1_bio->bios[i]->bi_io_vec[j].bv_page, for 0 <= i < mddev->raid_disks, and 0 <= j < vcnt (the number of 4K pages in the chunk). Okay, so the first for() loop sets primary to the lowest disk number that was completely readable (.bi_end_io == end_sync_read && test_bit(BIO_UPTODATE). Then the second loop compares all the data to the primary's data and, if it doesn't match, re-initializes the mirror's sbio to write it back. I could probably figure this out with a lot of RTFSing, but if you don't mind me asking: - What does it mean if r1_bio->bios[i]->bi_end_io != end_sync_read. Does that case only avoid testing the primary again, or are there other cases where it might be true. If there are, why not count them as a mismatch? - What does it mean if !test_bit(BIO_UPTODATE, &sbio->bi_flags)? - How does the need to write back a particular disk get communicated from the sbio setup code to the "schedule writes" section? (On a tangential note, why the heck are bi_flags and bi_rw "unsigned long" rather than "u32"? You'd have to change "if test_bit(BIO_UPTODATE" to "if bio_flagged(sbio, BIO_UPTODATE."... untested patch appended.) > You possibly want to factor out that code into a separate function before > trying to add any 'voting' code. Indeed, the first thing I'd like to do is add some much more detailed logging. What part of the chunk is mismatched? One sector, one page, or the whole chunk? Are just a few bits flipped, or is it a gross mismatch? Which disks are mismatched? > This is controlled by raid10_add_disk in drivers/md/raid10.c. I would > happily accept a patch which made a more balanced choice about where to add > the new disk. Thank you very much for the encouragement! The tricky cases are when the number of drives is not a multiple of the number of data copies. If I have -n3 and 7 drives, there are many possible subsets of 3 that will operate. Suppose I have U__U_U_. What order should drives 4..7 be added? (That's something of a rhetorical question; I expect to figure out the answer myself, although you're welcome to chime in if you have any ideas. I'm thinking of some kind of score where I consider the n/gcd(n,k) stripe start positions and rank possible solutions based on the minimum redundancy level and the number of stripes at that level. The question is, is there ever a case where the locations I'd like to add *two* disks differ from the location I'd like to add one? If there were, it would be nasty.) Proof-of-concept patch to shrink bi_flags filed on 64-bit: diff --git a/include/linux/bio.h b/include/linux/bio.h index 7fc5606..8cababe 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -64,8 +64,8 @@ struct bio { sectors */ struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; - unsigned long bi_flags; /* status, command, etc */ - unsigned long bi_rw; /* bottom bits READ/WRITE, + unsigned int bi_flags; /* status, command, etc */ + unsigned int bi_rw; /* bottom bits READ/WRITE, * top bits priority */ diff --git a/block/blk-barrier.c b/block/blk-barrier.c index 0d710c9..aed45dd 100644 --- a/block/blk-barrier.c +++ b/block/blk-barrier.c @@ -283,8 +283,8 @@ static void bio_end_empty_barrier(struct bio *bio, int err) { if (err) { if (err == -EOPNOTSUPP) - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); - clear_bit(BIO_UPTODATE, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_EOPNOTSUPP); + bio->bi_flags &= ~(1<<BIO_UPTODATE); } if (bio->bi_private) complete(bio->bi_private); diff --git a/block/blk-core.c b/block/blk-core.c index f0640d7..dfca463 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -138,8 +138,8 @@ static void req_bio_endio(struct request *rq, struct bio *bio, if (&q->bar_rq != rq) { if (error) - clear_bit(BIO_UPTODATE, &bio->bi_flags); - else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) + bio->bi_flags &= ~(1<<BIO_UPTODATE); + else if (bio_flagged(bio, BIO_UPTODATE)) error = -EIO; if (unlikely(nbytes > bio->bi_size)) { @@ -149,7 +149,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio, } if (unlikely(rq->cmd_flags & REQ_QUIET)) - set_bit(BIO_QUIET, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_QUIET); bio->bi_size -= nbytes; bio->bi_sector += (nbytes >> 9); @@ -1329,13 +1329,13 @@ static void handle_bad_sector(struct bio *bio) char b[BDEVNAME_SIZE]; printk(KERN_INFO "attempt to access beyond end of device\n"); - printk(KERN_INFO "%s: rw=%ld, want=%Lu, limit=%Lu\n", + printk(KERN_INFO "%s: rw=%u, want=%Lu, limit=%Lu\n", bdevname(bio->bi_bdev, b), bio->bi_rw, (unsigned long long)bio->bi_sector + bio_sectors(bio), (long long)(bio->bi_bdev->bd_inode->i_size >> 9)); - set_bit(BIO_EOF, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_EOF); } #ifdef CONFIG_FAIL_MAKE_REQUEST diff --git a/block/blk-lib.c b/block/blk-lib.c index d0216b9..ee1f2d3 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -13,8 +13,8 @@ static void blkdev_discard_end_io(struct bio *bio, int err) { if (err) { if (err == -EOPNOTSUPP) - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); - clear_bit(BIO_UPTODATE, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_EOPNOTSUPP); + bio->bi_flags &= ~(1<<BIO_UPTODATE); } if (bio->bi_private) diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 8a549db..ce4a6a0 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -1450,7 +1450,7 @@ static void pkt_finish_packet(struct packet_data *pkt, int uptodate) static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data *pkt) { - int uptodate; + bool uptodate; VPRINTK("run_state_machine: pkt %d\n", pkt->id); @@ -1480,7 +1480,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data if (atomic_read(&pkt->io_wait) > 0) return; - if (test_bit(BIO_UPTODATE, &pkt->w_bio->bi_flags)) { + if (bio_flagged(pkt->w_bio, BIO_UPTODATE)) { pkt_set_state(pkt, PACKET_FINISHED_STATE); } else { pkt_set_state(pkt, PACKET_RECOVERY_STATE); @@ -1497,7 +1497,7 @@ static void pkt_run_state_machine(struct pktcdvd_device *pd, struct packet_data break; case PACKET_FINISHED_STATE: - uptodate = test_bit(BIO_UPTODATE, &pkt->w_bio->bi_flags); + uptodate = bio_flagged(pkt->w_bio, BIO_UPTODATE); pkt_finish_packet(pkt, uptodate); return; diff --git a/drivers/md/md.c b/drivers/md/md.c index cb20d0b..58162b1 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -296,7 +296,7 @@ static void md_end_barrier(struct bio *bio, int err) mdk_rdev_t *rdev = bio->bi_private; mddev_t *mddev = rdev->mddev; if (err == -EOPNOTSUPP && mddev->barrier != POST_REQUEST_BARRIER) - set_bit(BIO_EOPNOTSUPP, &mddev->barrier->bi_flags); + mddev->barrier->bi_flags |= (1<<BIO_EOPNOTSUPP); rdev_dec_pending(rdev, mddev); @@ -347,7 +347,7 @@ static void md_submit_barrier(struct work_struct *ws) atomic_set(&mddev->flush_pending, 1); - if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags)) + if (bio_flagged(bio, BIO_EOPNOTSUPP)) bio_endio(bio, -EOPNOTSUPP); else if (bio->bi_size == 0) /* an empty barrier - all done */ @@ -629,10 +629,10 @@ static void super_written(struct bio *bio, int error) mdk_rdev_t *rdev = bio->bi_private; mddev_t *mddev = rdev->mddev; - if (error || !test_bit(BIO_UPTODATE, &bio->bi_flags)) { + if (error || !bio_flagged(bio, BIO_UPTODATE)) { printk("md: super_written gets error=%d, uptodate=%d\n", - error, test_bit(BIO_UPTODATE, &bio->bi_flags)); - WARN_ON(test_bit(BIO_UPTODATE, &bio->bi_flags)); + error, bio_flagged(bio, BIO_UPTODATE)); + WARN_ON(bio_flagged(bio, BIO_UPTODATE)); md_error(mddev, rdev); } @@ -647,7 +647,7 @@ static void super_written_barrier(struct bio *bio, int error) mdk_rdev_t *rdev = bio2->bi_private; mddev_t *mddev = rdev->mddev; - if (!test_bit(BIO_UPTODATE, &bio->bi_flags) && + if (!bio_flagged(bio, BIO_UPTODATE) && error == -EOPNOTSUPP) { unsigned long flags; /* barriers don't appear to be supported :-( */ @@ -747,7 +747,7 @@ int sync_page_io(struct block_device *bdev, sector_t sector, int size, submit_bio(rw, bio); wait_for_completion(&event); - ret = test_bit(BIO_UPTODATE, &bio->bi_flags); + ret = bio_flagged(bio, BIO_UPTODATE); bio_put(bio); return ret; } diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index 410fb60..f57fc90 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -84,7 +84,7 @@ static void multipath_end_bh_io (struct multipath_bh *mp_bh, int err) static void multipath_end_request(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + int uptodate = bio_flagged(bio, BIO_UPTODATE); struct multipath_bh *mp_bh = bio->bi_private; multipath_conf_t *conf = mp_bh->mddev->private; mdk_rdev_t *rdev = conf->multipaths[mp_bh->path].rdev; diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index a948da8..8e43334 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -262,7 +262,7 @@ static inline void update_head_pos(int disk, r1bio_t *r1_bio) static void raid1_end_read_request(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r1bio_t *r1_bio = bio->bi_private; int mirror; conf_t *conf = r1_bio->mddev->private; @@ -285,7 +285,7 @@ static void raid1_end_read_request(struct bio *bio, int error) if (r1_bio->mddev->degraded == conf->raid_disks || (r1_bio->mddev->degraded == conf->raid_disks-1 && !test_bit(Faulty, &conf->mirrors[mirror].rdev->flags))) - uptodate = 1; + uptodate = true; spin_unlock_irqrestore(&conf->device_lock, flags); } @@ -308,7 +308,7 @@ static void raid1_end_read_request(struct bio *bio, int error) static void raid1_end_write_request(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r1bio_t *r1_bio = bio->bi_private; int mirror, behind = test_bit(R1BIO_BehindIO, &r1_bio->state); conf_t *conf = r1_bio->mddev->private; @@ -1244,7 +1244,7 @@ static void end_sync_read(struct bio *bio, int error) * or re-read if the read failed. * We don't do much here, just schedule handling by raid1d */ - if (test_bit(BIO_UPTODATE, &bio->bi_flags)) + if (bio_flagged(bio, BIO_UPTODATE)) set_bit(R1BIO_Uptodate, &r1_bio->state); if (atomic_dec_and_test(&r1_bio->remaining)) @@ -1253,7 +1253,7 @@ static void end_sync_read(struct bio *bio, int error) static void end_sync_write(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r1bio_t *r1_bio = bio->bi_private; mddev_t *mddev = r1_bio->mddev; conf_t *conf = mddev->private; @@ -1318,7 +1318,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio) } for (primary=0; primary<mddev->raid_disks; primary++) if (r1_bio->bios[primary]->bi_end_io == end_sync_read && - test_bit(BIO_UPTODATE, &r1_bio->bios[primary]->bi_flags)) { + bio_flagged(r1_bio->bios[primary], BIO_UPTODATE)) { r1_bio->bios[primary]->bi_end_io = NULL; rdev_dec_pending(conf->mirrors[primary].rdev, mddev); break; @@ -1331,7 +1331,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio) struct bio *pbio = r1_bio->bios[primary]; struct bio *sbio = r1_bio->bios[i]; - if (test_bit(BIO_UPTODATE, &sbio->bi_flags)) { + if (bio_flagged(sbio, BIO_UPTODATE)) { for (j = vcnt; j-- ; ) { struct page *p, *s; p = pbio->bi_io_vec[j].bv_page; @@ -1346,7 +1346,7 @@ static void sync_request_write(mddev_t *mddev, r1bio_t *r1_bio) if (j >= 0) mddev->resync_mismatches += r1_bio->sectors; if (j < 0 || (test_bit(MD_RECOVERY_CHECK, &mddev->recovery) - && test_bit(BIO_UPTODATE, &sbio->bi_flags))) { + && bio_flagged(sbio, BIO_UPTODATE))) { sbio->bi_end_io = NULL; rdev_dec_pending(conf->mirrors[i].rdev, mddev); } else { diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index 42e64e4..4ae0e20 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -255,7 +255,7 @@ static inline void update_head_pos(int slot, r10bio_t *r10_bio) static void raid10_end_read_request(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r10bio_t *r10_bio = bio->bi_private; int slot, dev; conf_t *conf = r10_bio->mddev->private; @@ -297,7 +297,7 @@ static void raid10_end_read_request(struct bio *bio, int error) static void raid10_end_write_request(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r10bio_t *r10_bio = bio->bi_private; int slot, dev; conf_t *conf = r10_bio->mddev->private; @@ -1230,7 +1230,7 @@ static void end_sync_read(struct bio *bio, int error) update_head_pos(i, r10_bio); d = r10_bio->devs[i].devnum; - if (test_bit(BIO_UPTODATE, &bio->bi_flags)) + if (bio_flagged(bio, BIO_UPTODATE)) set_bit(R10BIO_Uptodate, &r10_bio->state); else { atomic_add(r10_bio->sectors, @@ -1255,7 +1255,7 @@ static void end_sync_read(struct bio *bio, int error) static void end_sync_write(struct bio *bio, int error) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); r10bio_t *r10_bio = bio->bi_private; mddev_t *mddev = r10_bio->mddev; conf_t *conf = mddev->private; @@ -1313,7 +1313,7 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio) /* find the first device with a block */ for (i=0; i<conf->copies; i++) - if (test_bit(BIO_UPTODATE, &r10_bio->devs[i].bio->bi_flags)) + if (bio_flagged(r10_bio->devs[i].bio, BIO_UPTODATE)) break; if (i == conf->copies) @@ -1333,7 +1333,7 @@ static void sync_request_write(mddev_t *mddev, r10bio_t *r10_bio) continue; if (i == first) continue; - if (test_bit(BIO_UPTODATE, &r10_bio->devs[i].bio->bi_flags)) { + if (bio_flagged(r10_bio->devs[i].bio, BIO_UPTODATE)) { /* We know that the bi_io_vec layout is the same for * both 'first' and 'i', so we just compare them. * All vec entries are PAGE_SIZE; @@ -2027,7 +2027,7 @@ static sector_t sync_request(mddev_t *mddev, sector_t sector_nr, int *skipped, i int d = r10_bio->devs[i].devnum; bio = r10_bio->devs[i].bio; bio->bi_end_io = NULL; - clear_bit(BIO_UPTODATE, &bio->bi_flags); + bio->bi_flags &= ~(1<<BIO_UPTODATE); if (conf->mirrors[d].rdev == NULL || test_bit(Faulty, &conf->mirrors[d].rdev->flags)) continue; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 96c6902..b92baad 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -537,7 +537,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) set_bit(STRIPE_IO_STARTED, &sh->state); bi->bi_bdev = rdev->bdev; - pr_debug("%s: for %llu schedule op %ld on disc %d\n", + pr_debug("%s: for %llu schedule op %u on disc %d\n", __func__, (unsigned long long)sh->sector, bi->bi_rw, i); atomic_inc(&sh->count); @@ -559,7 +559,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) } else { if (rw == WRITE) set_bit(STRIPE_DEGRADED, &sh->state); - pr_debug("skip op %ld on disc %d for sector %llu\n", + pr_debug("skip op %u on disc %d for sector %llu\n", bi->bi_rw, i, (unsigned long long)sh->sector); clear_bit(R5_LOCKED, &sh->dev[i].flags); set_bit(STRIPE_HANDLE, &sh->state); @@ -1557,7 +1557,7 @@ static void raid5_end_read_request(struct bio * bi, int error) struct stripe_head *sh = bi->bi_private; raid5_conf_t *conf = sh->raid_conf; int disks = sh->disks, i; - int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); + bool uptodate = bio_flagged(bi, BIO_UPTODATE); char b[BDEVNAME_SIZE]; mdk_rdev_t *rdev; @@ -1591,7 +1591,7 @@ static void raid5_end_read_request(struct bio * bi, int error) atomic_set(&conf->disks[i].rdev->read_errors, 0); } else { const char *bdn = bdevname(conf->disks[i].rdev->bdev, b); - int retry = 0; + bool retry = false; rdev = conf->disks[i].rdev; clear_bit(R5_UPTODATE, &sh->dev[i].flags); @@ -1619,7 +1619,7 @@ static void raid5_end_read_request(struct bio * bi, int error) "md/raid:%s: Too many read errors, failing device %s.\n", mdname(conf->mddev), bdn); else - retry = 1; + retry = true; if (retry) set_bit(R5_ReadError, &sh->dev[i].flags); else { @@ -1639,7 +1639,7 @@ static void raid5_end_write_request(struct bio *bi, int error) struct stripe_head *sh = bi->bi_private; raid5_conf_t *conf = sh->raid_conf; int disks = sh->disks, i; - int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); + bool uptodate = bio_flagged(bi, BIO_UPTODATE); for (i=0 ; i<disks; i++) if (bi == &sh->dev[i].req) @@ -2251,7 +2251,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh, while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS) { struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector); - clear_bit(BIO_UPTODATE, &bi->bi_flags); + bi->bi_flags &= ~(1<<BIO_UPTODATE); if (!raid5_dec_bi_phys_segments(bi)) { md_write_end(conf->mddev); bi->bi_next = *return_bi; @@ -2266,7 +2266,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh, while (bi && bi->bi_sector < sh->dev[i].sector + STRIPE_SECTORS) { struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector); - clear_bit(BIO_UPTODATE, &bi->bi_flags); + bi->bi_flags &= ~(1<<BIO_UPTODATE); if (!raid5_dec_bi_phys_segments(bi)) { md_write_end(conf->mddev); bi->bi_next = *return_bi; @@ -2290,7 +2290,7 @@ handle_failed_stripe(raid5_conf_t *conf, struct stripe_head *sh, sh->dev[i].sector + STRIPE_SECTORS) { struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector); - clear_bit(BIO_UPTODATE, &bi->bi_flags); + bi->bi_flags &= ~(1<<BIO_UPTODATE); if (!raid5_dec_bi_phys_segments(bi)) { bi->bi_next = *return_bi; *return_bi = bi; @@ -3787,7 +3787,7 @@ static void raid5_align_endio(struct bio *bi, int error) struct bio* raid_bi = bi->bi_private; mddev_t *mddev; raid5_conf_t *conf; - int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); + bool uptodate = bio_flagged(bi, BIO_UPTODATE); mdk_rdev_t *rdev; bio_put(bi); @@ -4089,7 +4089,7 @@ static int make_request(mddev_t *mddev, struct bio * bi) release_stripe(sh); } else { /* cannot get stripe for read-ahead, just give-up */ - clear_bit(BIO_UPTODATE, &bi->bi_flags); + bi->bi_flags &= ~(1<<BIO_UPTODATE); finish_wait(&conf->wait_for_overlap, &w); break; } diff --git a/fs/bio.c b/fs/bio.c index e7bf6ca..76192ca 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1423,8 +1423,8 @@ EXPORT_SYMBOL(bio_flush_dcache_pages); void bio_endio(struct bio *bio, int error) { if (error) - clear_bit(BIO_UPTODATE, &bio->bi_flags); - else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) + bio->bi_flags &= ~(1<<BIO_UPTODATE); + else if (!bio_flagged(bio, BIO_UPTODATE)) error = -EIO; if (bio->bi_end_io) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index d74e6af..c58bef8 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1759,17 +1759,17 @@ static void end_bio_extent_writepage(struct bio *bio, int err) */ static void end_bio_extent_readpage(struct bio *bio, int err) { - int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec_end = bio->bi_io_vec + bio->bi_vcnt - 1; struct bio_vec *bvec = bio->bi_io_vec; struct extent_io_tree *tree; u64 start; u64 end; - int whole_page; + bool whole_page; int ret; if (err) - uptodate = 0; + uptodate = false; do { struct page *page = bvec->bv_page; @@ -1780,9 +1780,9 @@ static void end_bio_extent_readpage(struct bio *bio, int err) end = start + bvec->bv_len - 1; if (bvec->bv_offset == 0 && bvec->bv_len == PAGE_CACHE_SIZE) - whole_page = 1; + whole_page = true; else - whole_page = 0; + whole_page = false; if (++bvec <= bvec_end) prefetchw(&bvec->bv_page->flags); @@ -1791,17 +1791,16 @@ static void end_bio_extent_readpage(struct bio *bio, int err) ret = tree->ops->readpage_end_io_hook(page, start, end, NULL); if (ret) - uptodate = 0; + uptodate = false; } if (!uptodate && tree->ops && tree->ops->readpage_io_failed_hook) { ret = tree->ops->readpage_io_failed_hook(bio, page, start, end, NULL); if (ret == 0) { - uptodate = - test_bit(BIO_UPTODATE, &bio->bi_flags); + uptodate = bio_flagged(bio, BIO_UPTODATE); if (err) - uptodate = 0; + uptodate = false; continue; } } @@ -1841,7 +1840,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) */ static void end_bio_extent_preparewrite(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; struct extent_io_tree *tree; u64 start; diff --git a/fs/buffer.c b/fs/buffer.c index d54812b..94af2b9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2997,14 +2997,14 @@ static void end_bio_bh_io_sync(struct bio *bio, int err) struct buffer_head *bh = bio->bi_private; if (err == -EOPNOTSUPP) { - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_EOPNOTSUPP); set_bit(BH_Eopnotsupp, &bh->b_state); } - if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags))) + if (unlikely (bio_flagged(bio, BIO_QUIET))) set_bit(BH_Quiet, &bh->b_state); - bh->b_end_io(bh, test_bit(BIO_UPTODATE, &bio->bi_flags)); + bh->b_end_io(bh, bio_flagged(bio, BIO_UPTODATE)); bio_put(bio); } diff --git a/fs/direct-io.c b/fs/direct-io.c index 7600aac..c7d3a0f 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -425,7 +425,7 @@ static struct bio *dio_await_one(struct dio *dio) */ static int dio_bio_complete(struct dio *dio, struct bio *bio) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec = bio->bi_io_vec; int page_no; diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 377309c..f6d3216 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -2598,7 +2598,7 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex) submit_bio(WRITE, bio); wait_for_completion(&event); - if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { + if (!bio_flagged(bio, BIO_UPTODATE)) { bio_put(bio); return -EIO; } diff --git a/fs/jfs/jfs_logmgr.c b/fs/jfs/jfs_logmgr.c index c51af2a..b37ee3e 100644 --- a/fs/jfs/jfs_logmgr.c +++ b/fs/jfs/jfs_logmgr.c @@ -2216,7 +2216,7 @@ static void lbmIODone(struct bio *bio, int error) bp->l_flag |= lbmDONE; - if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { + if (!bio_flagged(bio, BIO_UPTODATE)) { bp->l_flag |= lbmERROR; jfs_err("lbmIODone: I/O error in JFS log"); diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 48b44bd..9222e06 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -287,7 +287,7 @@ static void metapage_read_end_io(struct bio *bio, int err) { struct page *page = bio->bi_private; - if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { + if (!bio_flagged(bio, BIO_UPTODATE)) { printk(KERN_ERR "metapage_read_end_io: I/O error\n"); SetPageError(page); } @@ -344,7 +344,7 @@ static void metapage_write_end_io(struct bio *bio, int err) BUG_ON(!PagePrivate(page)); - if (! test_bit(BIO_UPTODATE, &bio->bi_flags)) { + if (!bio_flagged(bio, BIO_UPTODATE)) { printk(KERN_ERR "metapage_write_end_io: I/O error\n"); SetPageError(page); } diff --git a/fs/logfs/dev_bdev.c b/fs/logfs/dev_bdev.c index 9bd2ce2..ea48736 100644 --- a/fs/logfs/dev_bdev.c +++ b/fs/logfs/dev_bdev.c @@ -41,7 +41,7 @@ static int sync_request(struct page *page, struct block_device *bdev, int rw) submit_bio(rw, &bio); generic_unplug_device(bdev_get_queue(bdev)); wait_for_completion(&complete); - return test_bit(BIO_UPTODATE, &bio.bi_flags) ? 0 : -EIO; + return bio_flagged(bio, BIO_UPTODATE) ? 0 : -EIO; } static int bdev_readpage(void *_sb, struct page *page) @@ -66,7 +66,7 @@ static DECLARE_WAIT_QUEUE_HEAD(wq); static void writeseg_end_io(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; struct super_block *sb = bio->bi_private; struct logfs_super *super = logfs_super(sb); @@ -174,7 +174,7 @@ static void bdev_writeseg(struct super_block *sb, u64 ofs, size_t len) static void erase_end_io(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct super_block *sb = bio->bi_private; struct logfs_super *super = logfs_super(sb); diff --git a/fs/mpage.c b/fs/mpage.c index fd56ca2..3be5895 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -42,7 +42,7 @@ */ static void mpage_end_io_read(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; do { @@ -64,7 +64,7 @@ static void mpage_end_io_read(struct bio *bio, int err) static void mpage_end_io_write(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct bio_vec *bvec = bio->bi_io_vec + bio->bi_vcnt - 1; do { diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index 2e6a272..d3ef05c 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -349,11 +349,11 @@ void nilfs_add_checksums_on_logs(struct list_head *logs, u32 seed) */ static void nilfs_end_bio_write(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct nilfs_segment_buffer *segbuf = bio->bi_private; if (err == -EOPNOTSUPP) { - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); + bio->bi_flags |= (1<<BIO_EOPNOTSUPP); bio_put(bio); /* to be detected by submit_seg_bio() */ } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 34640d6..055de11 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -351,7 +351,7 @@ xfs_end_bio( xfs_ioend_t *ioend = bio->bi_private; ASSERT(atomic_read(&bio->bi_cnt) >= 1); - ioend->io_error = test_bit(BIO_UPTODATE, &bio->bi_flags) ? 0 : error; + ioend->io_error = bio_flagged(bio, BIO_UPTODATE) ? 0 : error; /* Toss bio and pass work off to an xfsdatad thread */ bio->bi_private = NULL; diff --git a/mm/bounce.c b/mm/bounce.c index 13b6dad..7a435fd 100644 --- a/mm/bounce.c +++ b/mm/bounce.c @@ -127,8 +127,7 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err) struct bio_vec *bvec, *org_vec; int i; - if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags)) - set_bit(BIO_EOPNOTSUPP, &bio_orig->bi_flags); + bio->bi_flags |= bio_orig->bi_flags & (1<<BIO_EOPNOTSUPP); /* * free up bounce indirect pages used @@ -161,7 +160,7 @@ static void __bounce_end_io_read(struct bio *bio, mempool_t *pool, int err) { struct bio *bio_orig = bio->bi_private; - if (test_bit(BIO_UPTODATE, &bio->bi_flags)) + if (bio_flagged(bio, BIO_UPTODATE)) copy_to_high_bio_irq(bio_orig, bio); bounce_end_io(bio, pool, err); diff --git a/mm/page_io.c b/mm/page_io.c index 31a3b96..11a16b0 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -42,7 +42,7 @@ static struct bio *get_swap_bio(gfp_t gfp_flags, static void end_swap_bio_write(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct page *page = bio->bi_io_vec[0].bv_page; if (!uptodate) { @@ -68,7 +68,7 @@ static void end_swap_bio_write(struct bio *bio, int err) void end_swap_bio_read(struct bio *bio, int err) { - const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); + const bool uptodate = bio_flagged(bio, BIO_UPTODATE); struct page *page = bio->bi_io_vec[0].bv_page; if (!uptodate) { ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-08 14:52 ` George Spelvin @ 2010-09-08 23:04 ` Neil Brown 0 siblings, 0 replies; 14+ messages in thread From: Neil Brown @ 2010-09-08 23:04 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On 8 Sep 2010 10:52:32 -0400 "George Spelvin" <linux@horizon.com> wrote: > > The relevant bit of code is in the MD_RECOVERY_REQUESTED branch of > > sync_request_write() in drivers/md/raid1.c > > Look for "memcmp". > > Okay, so the data is in r1_bio->bios[i]->bi_io_vec[j].bv_page, > for 0 <= i < mddev->raid_disks, and 0 <= j < vcnt (the number > of 4K pages in the chunk). > > Okay, so the first for() loop sets primary to the lowest disk > number that was completely readable (.bi_end_io == end_sync_read > && test_bit(BIO_UPTODATE). > > Then the second loop compares all the data to the primary's data > and, if it doesn't match, re-initializes the mirror's sbio to > write it back. > > I could probably figure this out with a lot of RTFSing, but if you > don't mind me asking: > - What does it mean if r1_bio->bios[i]->bi_end_io != end_sync_read. > Does that case only avoid testing the primary again, or are there > other cases where it might be true. If there are, why not count > them as a mismatch? bi_end_io is set up in sync_request(). A non-NULL value mean that a 'nr_pending' reference is held on the device. If that value is end_sync_read, then a read was attempted. If it is end_sync_write, then no read was attempted as we would not expect the data to be valid (typically during a rebuild). So: NULL -> device is failed or doesn't exist or otherwise should be ignored e.g. during recovery we read from one device, write to one, and ignore the rest. end_sync_write -> device is working but is not in-sync. Probably doesn't happen for check/repair cycles. end_sync_read -> we read this block so we need to test the content. > - What does it mean if !test_bit(BIO_UPTODATE, &sbio->bi_flags)? The read request failed. > - How does the need to write back a particular disk get communicated > from the sbio setup code to the "schedule writes" section? It is the other way around. We signal "don't write this block" by setting bi_end_io to NULL. The default is to write to every working disk that isn't the first once we read from and that isn't being ignored (this reflects that fact that the code originally just did resync and recovery, and check/repair was added later). > > (On a tangential note, why the heck are bi_flags and bi_rw "unsigned long" > rather than "u32"? You'd have to change "if test_bit(BIO_UPTODATE" to > "if bio_flagged(sbio, BIO_UPTODATE."... untested patch appended.) Hysterical Raisins? You would need to take that up with Jens Axboe. > > > You possibly want to factor out that code into a separate function before > > trying to add any 'voting' code. > > Indeed, the first thing I'd like to do is add some much more detailed > logging. What part of the chunk is mismatched? One sector, one page, > or the whole chunk? Are just a few bits flipped, or is it a gross > mismatch? Which disks are mismatched? Sounds good. Keep it brief and easy to parse. Probably for each time memcmp fails for a requested pass, print one line that identifies the 2 devices, the sector/size of the block, the first and last byte that are different, and the first 16 bytes of the differing range from each device ??? > > > This is controlled by raid10_add_disk in drivers/md/raid10.c. I would > > happily accept a patch which made a more balanced choice about where to add > > the new disk. > > Thank you very much for the encouragement! The tricky cases are when > the number of drives is not a multiple of the number of data copies. > If I have -n3 and 7 drives, there are many possible subsets of 3 that will > operate. Suppose I have U__U_U_. What order should drives 4..7 be added? You don't need to make the code perfect, just better. If you only change the order for adding spares in the simple/common case, that would be enough improvement to be very worth while. > > (That's something of a rhetorical question; I expect to figure out the > answer myself, although you're welcome to chime in if you have any ideas. > I'm thinking of some kind of score where I consider the n/gcd(n,k) stripe > start positions and rank possible solutions based on the minimum redundancy > level and the number of stripes at that level. The question is, is there > ever a case where the locations I'd like to add *two* disks differ from the > location I'd like to add one? If there were, it would be nasty.) > > Thanks, NeilBrown ^ permalink raw reply [flat|nested] 14+ messages in thread
* RAID mismatches (and reporting thereof) 2010-09-07 14:19 3-way mirrors George Spelvin ` (2 preceding siblings ...) 2010-09-07 22:01 ` Neil Brown @ 2010-09-08 9:40 ` Tim Small 2010-09-08 12:35 ` George Spelvin 2010-09-28 16:42 ` 3-way mirrors Tim Small 4 siblings, 1 reply; 14+ messages in thread From: Tim Small @ 2010-09-08 9:40 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On 07/09/10 15:19, George Spelvin wrote: > After some frustration with RAID-5 finding mismatches and not being > able to figure out which drive has the problem, I'm setting up a rather > intricate 5-way mirrored (x 2-way striped) system. > Out of interest, what systems are you seeing mismatches on? Most of the ones I've seen are on LSI1068* SAS controllers (with SATA drives, but not sure if that counts for anything, don't use many SAS drives) including the Dell SAS5* and SAS6* series. I suspect there are some corner cases where they corrupt data on disk. Should open a kernel.org bug really, so that LSI can ignore the issue in public... Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: RAID mismatches (and reporting thereof) 2010-09-08 9:40 ` RAID mismatches (and reporting thereof) Tim Small @ 2010-09-08 12:35 ` George Spelvin 0 siblings, 0 replies; 14+ messages in thread From: George Spelvin @ 2010-09-08 12:35 UTC (permalink / raw) To: linux, tim; +Cc: linux-raid > Out of interest, what systems are you seeing mismatches on? Most of the > ones I've seen are on LSI1068* SAS controllers (with SATA drives, but > not sure if that counts for anything, don't use many SAS drives) > including the Dell SAS5* and SAS6* series. I suspect there are some > corner cases where they corrupt data on disk. Should open a kernel.org > bug really, so that LSI can ignore the issue in public... MS-7376 ("MSI K9A2 Platinum") motherboard, with 2500 MHz quad-core Phenom & 8 GiB ECC DDR2. There are 6 SATA ports, 4 on the SB600 and 2 on a Promise PDC42819: 00:14.1 IDE interface [0101]: ATI Technologies Inc SB600 IDE [1002:438c] 04:00.0 RAID bus controller [0104]: Promise Technology, Inc. PDC42819 [FastTrak TX2650/TX4650] [105a:3f20] I used to have a different motherboard, with 3x SiI 3132 PCIe adapters: 01:00.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller [1095:3132] (rev 01) 02:00.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3132 Serial ATA Raid II Controller [1095:3132] (rev 01) The drives are all ST3400832AS, installed in a SuperMicro SC833 case's hot-swap bays. I have a clone machine (same MB, CPU, and RAM, but different case and ST3750330AS drives) that's giving me no problems. Thus the recent decision to swap drives and rebuild the array. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 3-way mirrors 2010-09-07 14:19 3-way mirrors George Spelvin ` (3 preceding siblings ...) 2010-09-08 9:40 ` RAID mismatches (and reporting thereof) Tim Small @ 2010-09-28 16:42 ` Tim Small 4 siblings, 0 replies; 14+ messages in thread From: Tim Small @ 2010-09-28 16:42 UTC (permalink / raw) To: George Spelvin; +Cc: linux-raid On 07/09/10 15:19, George Spelvin wrote: > After some frustration with RAID-5 finding mismatches and not being > able to figure out which drive has the problem, I'm setting up a rather > intricate 5-way mirrored (x 2-way striped) system. > I know that this doesn't solve your current problem, but I wondered if the fact that mismatch_cnt is not a reliable indication of corruption on RAID1 and RAID10 is a problem with your proposed solution? I don't know how difficult it would be to fix that whilst you are at it (add a data copy in the write path). Whilst I think about it, perhaps mismatch_cnt should be dropped from RAID1 / RAID10 entirely, as it doesn't seem to be particularly useful as-is.... Perhaps the data-copy mode could be a runtime option, and mismatch_cnt would only appear when it was switched on (and a repair forced when making the transition from no-copy mode to copy mode?). Cheers, Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2010-09-28 16:42 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-09-07 14:19 3-way mirrors George Spelvin 2010-09-07 16:07 ` Iordan Iordanov 2010-09-07 18:49 ` George Spelvin 2010-09-07 19:55 ` Keld Jørn Simonsen 2010-09-07 18:31 ` Aryeh Gregor 2010-09-07 19:02 ` George Spelvin 2010-09-08 22:28 ` Bill Davidsen 2010-09-07 22:01 ` Neil Brown 2010-09-08 1:33 ` Neil Brown 2010-09-08 14:52 ` George Spelvin 2010-09-08 23:04 ` Neil Brown 2010-09-08 9:40 ` RAID mismatches (and reporting thereof) Tim Small 2010-09-08 12:35 ` George Spelvin 2010-09-28 16:42 ` 3-way mirrors Tim Small
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).