From: Namhyung Kim <namhyung@gmail.com>
To: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Subject: Re: [md PATCH 10/36] md/raid1: avoid writing to known-bad blocks on known-bad drives.
Date: Wed, 27 Jul 2011 13:09:07 +0900 [thread overview]
Message-ID: <87vcuogyak.fsf@gmail.com> (raw)
In-Reply-To: <20110721025848.8422.25836.stgit@notabene.brown> (NeilBrown's message of "Thu, 21 Jul 2011 12:58:48 +1000")
NeilBrown <neilb@suse.de> writes:
> If we have seen any write error on a drive, then don't write to
> any known-bad blocks on that drive.
> If necessary, we divide the write request up into pieces just
> like we do for reads, so each piece is either all written or
> all not written to any given drive.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
and a nit below
> ---
>
> drivers/md/raid1.c | 152 +++++++++++++++++++++++++++++++++++++++-------------
> 1 files changed, 115 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 4d40d9d..44277dc 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -772,6 +772,9 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> mdk_rdev_t *blocked_rdev;
> int plugged;
> + int first_clone;
> + int sectors_handled;
> + int max_sectors;
>
> /*
> * Register the new request and wait if the reconstruction
> @@ -832,7 +835,6 @@ static int make_request(mddev_t *mddev, struct bio * bio)
> /*
> * read balancing logic:
> */
> - int max_sectors;
> int rdisk;
>
> read_again:
> @@ -872,7 +874,6 @@ read_again:
> /* could not read all from this device, so we will
> * need another r1_bio.
> */
> - int sectors_handled;
>
> sectors_handled = (r1_bio->sector + max_sectors
> - bio->bi_sector);
> @@ -906,9 +907,15 @@ read_again:
> /*
> * WRITE:
> */
> - /* first select target devices under spinlock and
> + /* first select target devices under rcu_lock and
> * inc refcount on their rdev. Record them by setting
> * bios[x] to bio
> + * If there are known/acknowledged bad blocks on any device on
> + * which we have seen a write error, we want to avoid writing those
> + * blocks.
> + * This potentially requires several writes to write around
> + * the bad blocks. Each set of writes gets it's own r1bio
> + * with a set of bios attached.
> */
> plugged = mddev_check_plugged(mddev);
>
> @@ -916,6 +923,7 @@ read_again:
> retry_write:
> blocked_rdev = NULL;
> rcu_read_lock();
> + max_sectors = r1_bio->sectors;
> for (i = 0; i < disks; i++) {
> mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev);
> if (rdev && unlikely(test_bit(Blocked, &rdev->flags))) {
> @@ -923,17 +931,57 @@ read_again:
> blocked_rdev = rdev;
> break;
> }
> - if (rdev && !test_bit(Faulty, &rdev->flags)) {
> - atomic_inc(&rdev->nr_pending);
> - if (test_bit(Faulty, &rdev->flags)) {
> + r1_bio->bios[i] = NULL;
> + if (!rdev || test_bit(Faulty, &rdev->flags)) {
> + set_bit(R1BIO_Degraded, &r1_bio->state);
> + continue;
> + }
> +
> + atomic_inc(&rdev->nr_pending);
> + if (test_bit(WriteErrorSeen, &rdev->flags)) {
> + sector_t first_bad;
> + int bad_sectors;
> + int is_bad;
> +
> + is_bad = is_badblock(rdev, r1_bio->sector,
> + max_sectors,
> + &first_bad, &bad_sectors);
> + if (is_bad < 0) {
> + /* mustn't write here until the bad block is
> + * acknowledged*/
> + set_bit(BlockedBadBlocks, &rdev->flags);
> + blocked_rdev = rdev;
> + break;
> + }
> + if (is_bad && first_bad <= r1_bio->sector) {
> + /* Cannot write here at all */
> + bad_sectors -= (r1_bio->sector - first_bad);
> + if (bad_sectors < max_sectors)
> + /* mustn't write more than bad_sectors
> + * to other devices yet
> + */
> + max_sectors = bad_sectors;
> rdev_dec_pending(rdev, mddev);
> - r1_bio->bios[i] = NULL;
> - } else {
> - r1_bio->bios[i] = bio;
> - targets++;
> + /* We don't set R1BIO_Degraded as that
> + * only applies if the disk is
> + * missing, so it might be re-added,
> + * and we want to know to recover this
> + * chunk.
> + * In this case the device is here,
> + * and the fact that this chunk is not
> + * in-sync is recorded in the bad
> + * block log
> + */
> + continue;
> }
> - } else
> - r1_bio->bios[i] = NULL;
> + if (is_bad) {
> + int good_sectors = first_bad - r1_bio->sector;
> + if (good_sectors < max_sectors)
> + max_sectors = good_sectors;
> + }
> + }
> + r1_bio->bios[i] = bio;
> + targets++;
Looks like variable 'targets' is not needed anymore.
> }
> rcu_read_unlock();
>
> @@ -944,48 +992,56 @@ read_again:
> for (j = 0; j < i; j++)
> if (r1_bio->bios[j])
> rdev_dec_pending(conf->mirrors[j].rdev, mddev);
> -
> + r1_bio->state = 0;
> allow_barrier(conf);
> md_wait_for_blocked_rdev(blocked_rdev, mddev);
> wait_barrier(conf);
> goto retry_write;
> }
>
> - if (targets < conf->raid_disks) {
> - /* array is degraded, we will not clear the bitmap
> - * on I/O completion (see raid1_end_write_request) */
> - set_bit(R1BIO_Degraded, &r1_bio->state);
> + if (max_sectors < r1_bio->sectors) {
> + /* We are splitting this write into multiple parts, so
> + * we need to prepare for allocating another r1_bio.
> + */
> + r1_bio->sectors = max_sectors;
> + spin_lock_irq(&conf->device_lock);
> + if (bio->bi_phys_segments == 0)
> + bio->bi_phys_segments = 2;
> + else
> + bio->bi_phys_segments++;
> + spin_unlock_irq(&conf->device_lock);
> }
> -
> - /* do behind I/O ?
> - * Not if there are too many, or cannot allocate memory,
> - * or a reader on WriteMostly is waiting for behind writes
> - * to flush */
> - if (bitmap &&
> - (atomic_read(&bitmap->behind_writes)
> - < mddev->bitmap_info.max_write_behind) &&
> - !waitqueue_active(&bitmap->behind_wait))
> - alloc_behind_pages(bio, r1_bio);
> + sectors_handled = r1_bio->sector + max_sectors - bio->bi_sector;
>
> atomic_set(&r1_bio->remaining, 1);
> atomic_set(&r1_bio->behind_remaining, 0);
>
> - bitmap_startwrite(bitmap, bio->bi_sector, r1_bio->sectors,
> - test_bit(R1BIO_BehindIO, &r1_bio->state));
> + first_clone = 1;
> for (i = 0; i < disks; i++) {
> struct bio *mbio;
> if (!r1_bio->bios[i])
> continue;
>
> mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
> - r1_bio->bios[i] = mbio;
> -
> - mbio->bi_sector = r1_bio->sector + conf->mirrors[i].rdev->data_offset;
> - mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
> - mbio->bi_end_io = raid1_end_write_request;
> - mbio->bi_rw = WRITE | do_flush_fua | do_sync;
> - mbio->bi_private = r1_bio;
> -
> + md_trim_bio(mbio, r1_bio->sector - bio->bi_sector, max_sectors);
> +
> + if (first_clone) {
> + /* do behind I/O ?
> + * Not if there are too many, or cannot
> + * allocate memory, or a reader on WriteMostly
> + * is waiting for behind writes to flush */
> + if (bitmap &&
> + (atomic_read(&bitmap->behind_writes)
> + < mddev->bitmap_info.max_write_behind) &&
> + !waitqueue_active(&bitmap->behind_wait))
> + alloc_behind_pages(mbio, r1_bio);
> +
> + bitmap_startwrite(bitmap, r1_bio->sector,
> + r1_bio->sectors,
> + test_bit(R1BIO_BehindIO,
> + &r1_bio->state));
> + first_clone = 0;
> + }
> if (r1_bio->behind_pages) {
> struct bio_vec *bvec;
> int j;
> @@ -1003,6 +1059,15 @@ read_again:
> atomic_inc(&r1_bio->behind_remaining);
> }
>
> + r1_bio->bios[i] = mbio;
> +
> + mbio->bi_sector = (r1_bio->sector +
> + conf->mirrors[i].rdev->data_offset);
> + mbio->bi_bdev = conf->mirrors[i].rdev->bdev;
> + mbio->bi_end_io = raid1_end_write_request;
> + mbio->bi_rw = WRITE | do_flush_fua | do_sync;
> + mbio->bi_private = r1_bio;
> +
> atomic_inc(&r1_bio->remaining);
> spin_lock_irqsave(&conf->device_lock, flags);
> bio_list_add(&conf->pending_bio_list, mbio);
> @@ -1013,6 +1078,19 @@ read_again:
> /* In case raid1d snuck in to freeze_array */
> wake_up(&conf->wait_barrier);
>
> + if (sectors_handled < (bio->bi_size >> 9)) {
> + /* We need another r1_bio. It has already been counted
> + * in bio->bi_phys_segments
> + */
> + r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> + r1_bio->master_bio = bio;
> + r1_bio->sectors = (bio->bi_size >> 9) - sectors_handled;
> + r1_bio->state = 0;
> + r1_bio->mddev = mddev;
> + r1_bio->sector = bio->bi_sector + sectors_handled;
> + goto retry_write;
> + }
> +
> if (do_sync || !bitmap || !plugged)
> md_wakeup_thread(mddev->thread);
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2011-07-27 4:09 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-21 2:58 [md PATCH 00/36] md patches for 3.1 - part 2: bad block logs NeilBrown
2011-07-21 2:58 ` [md PATCH 04/36] md: load/store badblock list from v1.x metadata NeilBrown
2011-07-22 16:34 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 03/36] md: don't allow arrays to contain devices with bad blocks NeilBrown
2011-07-22 15:47 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 02/36] md/bad-block-log: add sysfs interface for accessing bad-block-log NeilBrown
2011-07-22 15:43 ` Namhyung Kim
2011-07-26 2:29 ` NeilBrown
2011-07-26 5:17 ` Namhyung Kim
2011-07-26 8:48 ` Namhyung Kim
2011-07-26 15:03 ` [PATCH v2] md: add documentation for bad block log Namhyung Kim
2011-07-27 1:05 ` [md PATCH 02/36] md/bad-block-log: add sysfs interface for accessing bad-block-log NeilBrown
2011-07-21 2:58 ` [md PATCH 01/36] md: beginnings of bad block management NeilBrown
2011-07-22 15:03 ` Namhyung Kim
2011-07-26 2:26 ` NeilBrown
2011-07-26 5:17 ` Namhyung Kim
2011-07-22 16:52 ` Namhyung Kim
2011-07-26 3:20 ` NeilBrown
2011-07-21 2:58 ` [md PATCH 06/36] md/raid1: avoid reading from known bad blocks NeilBrown
2011-07-26 14:06 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 11/36] md/raid1: clear bad-block record when write succeeds NeilBrown
2011-07-27 5:05 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 12/36] md/raid1: store behind-write pages in bi_vecs NeilBrown
2011-07-27 15:16 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 09/36] md: make it easier to wait for bad blocks to be acknowledged NeilBrown
2011-07-26 16:04 ` Namhyung Kim
2011-07-27 1:18 ` NeilBrown
2011-07-21 2:58 ` [md PATCH 10/36] md/raid1: avoid writing to known-bad blocks on known-bad drives NeilBrown
2011-07-27 4:09 ` Namhyung Kim [this message]
2011-07-27 4:19 ` NeilBrown
2011-07-21 2:58 ` [md PATCH 05/36] md: Disable bad blocks and v0.90 metadata NeilBrown
2011-07-22 17:02 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 14/36] md/raid1: record badblocks found during resync etc NeilBrown
2011-07-27 15:39 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 07/36] md/raid1: avoid reading known bad blocks during resync NeilBrown
2011-07-26 14:25 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 13/36] md/raid1: Handle write errors by updating badblock log NeilBrown
2011-07-27 15:28 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 08/36] md: add 'write_error' flag to component devices NeilBrown
2011-07-26 15:22 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 17/36] md/raid5: avoid reading from known bad blocks NeilBrown
2011-07-21 2:58 ` [md PATCH 15/36] md/raid1: improve handling of read failure during recovery NeilBrown
2011-07-27 15:45 ` Namhyung Kim
2011-07-21 2:58 ` [md PATCH 21/36] md/raid5: Clear bad blocks on successful write NeilBrown
2011-07-21 2:58 ` [md PATCH 20/36] md/raid5. Don't write to known bad block on doubtful devices NeilBrown
2011-07-21 2:58 ` [md PATCH 18/36] md/raid5: use bad-block log to improve handling of uncorrectable read errors NeilBrown
2011-07-21 2:58 ` [md PATCH 22/36] md/raid10: simplify/reindent some loops NeilBrown
2011-07-21 2:58 ` [md PATCH 23/36] md/raid10: Split handle_read_error out from raid10d NeilBrown
2011-07-21 2:58 ` [md PATCH 19/36] md/raid5: write errors should be recorded as bad blocks if possible NeilBrown
2011-07-21 2:58 ` [md PATCH 16/36] md/raid1: factor several functions out or raid1d() NeilBrown
2011-07-27 15:55 ` Namhyung Kim
2011-07-28 1:39 ` NeilBrown
2011-07-21 2:58 ` [md PATCH 24/36] md/raid10: avoid reading from known bad blocks - part 1 NeilBrown
2011-07-21 2:58 ` [md PATCH 30/36] md/raid10: clear bad-block record when write succeeds NeilBrown
2011-07-21 2:58 ` [md PATCH 34/36] md/raid10: simplify read error handling during recovery NeilBrown
2011-07-21 2:58 ` [md PATCH 25/36] md/raid10: avoid reading from known bad blocks - part 2 NeilBrown
2011-07-21 2:58 ` [md PATCH 28/36] md/raid10 record bad blocks as needed during recovery NeilBrown
2011-07-21 2:58 ` [md PATCH 31/36] md/raid10: Handle write errors by updating badblock log NeilBrown
2011-07-21 2:58 ` [md PATCH 26/36] md/raid10 - avoid reading from known bad blocks - part 3 NeilBrown
2011-07-21 2:58 ` [md PATCH 32/36] md/raid10: attempt to fix read errors during resync/check NeilBrown
2011-07-21 2:58 ` [md PATCH 27/36] md/raid10: avoid reading known bad blocks during resync/recovery NeilBrown
2011-07-21 2:58 ` [md PATCH 29/36] md/raid10: avoid writing to known bad blocks on known bad drives NeilBrown
2011-07-21 2:58 ` [md PATCH 33/36] md/raid10: record bad blocks due to write errors during resync/recovery NeilBrown
2011-07-21 2:58 ` [md PATCH 35/36] md/raid10: Handle read errors during recovery better NeilBrown
2011-07-21 2:58 ` [md PATCH 36/36] md/raid10: handle further errors during fix_read_error better NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87vcuogyak.fsf@gmail.com \
--to=namhyung@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).