linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	alan@lxorguk.ukuu.org.uk, "George Spelvin" <linux@horizon.com>,
	NeilBrown <neilb@suse.de>
Subject: [ 11/20] md/raid10: close race that lose writes lost when replacement completes.
Date: Thu,  6 Dec 2012 16:54:26 -0800	[thread overview]
Message-ID: <20121207005235.942443092@linuxfoundation.org> (raw)
In-Reply-To: <20121207005232.756641002@linuxfoundation.org>

3.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: NeilBrown <neilb@suse.de>

commit e7c0c3fa29280d62aa5e11101a674bb3064bd791 upstream.

When a replacement operation completes there is a small window
when the original device is marked 'faulty' and the replacement
still looks like a replacement.  The faulty should be removed and
the replacement moved in place very quickly, bit it isn't instant.

So the code write out to the array must handle the possibility that
the only working device for some slot in the replacement - but it
doesn't.  If the primary device is faulty it just gives up.  This
can lead to corruption.

So make the code more robust: if either  the primary or the
replacement is present and working, write to them.  Only when
neither are present do we give up.

This bug has been present since replacement was introduced in
3.3, so it is suitable for any -stable kernel since then.

Reported-by: "George Spelvin" <linux@horizon.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 drivers/md/raid10.c |  104 +++++++++++++++++++++++++++-------------------------
 1 file changed, 55 insertions(+), 49 deletions(-)

--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1182,18 +1182,21 @@ retry_write:
 			blocked_rdev = rrdev;
 			break;
 		}
+		if (rdev && (test_bit(Faulty, &rdev->flags)
+			     || test_bit(Unmerged, &rdev->flags)))
+			rdev = NULL;
 		if (rrdev && (test_bit(Faulty, &rrdev->flags)
 			      || test_bit(Unmerged, &rrdev->flags)))
 			rrdev = NULL;
 
 		r10_bio->devs[i].bio = NULL;
 		r10_bio->devs[i].repl_bio = NULL;
-		if (!rdev || test_bit(Faulty, &rdev->flags) ||
-		    test_bit(Unmerged, &rdev->flags)) {
+
+		if (!rdev && !rrdev) {
 			set_bit(R10BIO_Degraded, &r10_bio->state);
 			continue;
 		}
-		if (test_bit(WriteErrorSeen, &rdev->flags)) {
+		if (rdev && test_bit(WriteErrorSeen, &rdev->flags)) {
 			sector_t first_bad;
 			sector_t dev_sector = r10_bio->devs[i].addr;
 			int bad_sectors;
@@ -1235,8 +1238,10 @@ retry_write:
 					max_sectors = good_sectors;
 			}
 		}
-		r10_bio->devs[i].bio = bio;
-		atomic_inc(&rdev->nr_pending);
+		if (rdev) {
+			r10_bio->devs[i].bio = bio;
+			atomic_inc(&rdev->nr_pending);
+		}
 		if (rrdev) {
 			r10_bio->devs[i].repl_bio = bio;
 			atomic_inc(&rrdev->nr_pending);
@@ -1292,51 +1297,52 @@ retry_write:
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
 		int d = r10_bio->devs[i].devnum;
-		if (!r10_bio->devs[i].bio)
-			continue;
-
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].bio = mbio;
-
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   conf->mirrors[d].rdev->data_offset);
-		mbio->bi_bdev = conf->mirrors[d].rdev->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
-
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-
-		if (!r10_bio->devs[i].repl_bio)
-			continue;
+		if (r10_bio->devs[i].bio) {
+			struct md_rdev *rdev = conf->mirrors[d].rdev;
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr+
+					   rdev->data_offset);
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
+
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+		}
 
-		mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
-		md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
-			    max_sectors);
-		r10_bio->devs[i].repl_bio = mbio;
-
-		/* We are actively writing to the original device
-		 * so it cannot disappear, so the replacement cannot
-		 * become NULL here
-		 */
-		mbio->bi_sector	= (r10_bio->devs[i].addr+
-				   conf->mirrors[d].replacement->data_offset);
-		mbio->bi_bdev = conf->mirrors[d].replacement->bdev;
-		mbio->bi_end_io	= raid10_end_write_request;
-		mbio->bi_rw = WRITE | do_sync | do_fua;
-		mbio->bi_private = r10_bio;
-
-		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		if (r10_bio->devs[i].repl_bio) {
+			struct md_rdev *rdev = conf->mirrors[d].replacement;
+			if (rdev == NULL) {
+				/* Replacement just got moved to main 'rdev' */
+				smp_mb();
+				rdev = conf->mirrors[d].rdev;
+			}
+			mbio = bio_clone_mddev(bio, GFP_NOIO, mddev);
+			md_trim_bio(mbio, r10_bio->sector - bio->bi_sector,
+				    max_sectors);
+			r10_bio->devs[i].repl_bio = mbio;
+
+			mbio->bi_sector	= (r10_bio->devs[i].addr+
+					   rdev->data_offset);
+			mbio->bi_bdev = rdev->bdev;
+			mbio->bi_end_io	= raid10_end_write_request;
+			mbio->bi_rw = WRITE | do_sync | do_fua;
+			mbio->bi_private = r10_bio;
+
+			atomic_inc(&r10_bio->remaining);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+		}
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until



  parent reply	other threads:[~2012-12-07  1:10 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-12-07  0:54 [ 00/20] 3.4.23-stable review Greg Kroah-Hartman
2012-12-07  0:54 ` [ 01/20] Dove: Attempt to fix PMU/RTC interrupts Greg Kroah-Hartman
2012-12-07  0:54 ` [ 02/20] Dove: Fix irq_to_pmu() Greg Kroah-Hartman
2012-12-07  0:54 ` [ 03/20] drm/radeon/dce4+: dont use radeon_crtc for vblank callback Greg Kroah-Hartman
2012-12-07  0:54 ` [ 04/20] drm/radeon: properly handle mc_stop/mc_resume on evergreen+ (v2) Greg Kroah-Hartman
2012-12-07  0:54 ` [ 05/20] drm/radeon: properly track the crtc not_enabled case evergreen_mc_stop() Greg Kroah-Hartman
2012-12-07  0:54 ` [ 06/20] mm/vmemmap: fix wrong use of virt_to_page Greg Kroah-Hartman
2012-12-07  0:54 ` [ 07/20] mm: soft offline: split thp at the beginning of soft_offline_page() Greg Kroah-Hartman
2012-12-07  0:54 ` [ 08/20] ARM: Kirkwood: Update PCI-E fixup Greg Kroah-Hartman
2012-12-07  0:54 ` [ 09/20] x86, fpu: Avoid FPU lazy restore after suspend Greg Kroah-Hartman
2012-12-07  0:54 ` [ 10/20] workqueue: exit rescuer_thread() as TASK_RUNNING Greg Kroah-Hartman
2012-12-07  0:54 ` Greg Kroah-Hartman [this message]
2012-12-07  0:54 ` [ 12/20] i7300_edac: Fix error flag testing Greg Kroah-Hartman
2012-12-07  0:54 ` [ 13/20] Revert "sched, autogroup: Stop going ahead if autogroup is disabled" Greg Kroah-Hartman
2012-12-07  0:54 ` [ 14/20] bnx2x: remove redundant warning log Greg Kroah-Hartman
2012-12-07  0:54 ` [ 15/20] s390/mm: have 16 byte aligned struct pages Greg Kroah-Hartman
2012-12-07  9:59   ` Heiko Carstens
2012-12-07  0:54 ` [ 16/20] ACPI: missing break Greg Kroah-Hartman
2012-12-07  0:54 ` [ 17/20] i915: Quirk no_lvds on Gigabyte GA-D525TUD ITX motherboard Greg Kroah-Hartman
2012-12-07  0:54 ` [ 18/20] drm/i915: Add no-lvds quirk for Supermicro X7SPA-H Greg Kroah-Hartman
2012-12-07  0:54 ` [ 19/20] pnfsblock: fix partial page buffer wirte Greg Kroah-Hartman
2012-12-07  0:54 ` [ 20/20] kbuild: Do not package /boot and /lib in make tar-pkg Greg Kroah-Hartman
2012-12-08  0:49 ` [ 00/20] 3.4.23-stable review Shuah Khan
2012-12-08  0:52   ` Shuah Khan
2012-12-08  0:59     ` Shuah Khan
2012-12-08 19:46       ` Greg Kroah-Hartman
2012-12-09  1:15         ` Shuah Khan
2012-12-08  5:10 ` satoru takeuchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121207005235.942443092@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@horizon.com \
    --cc=neilb@suse.de \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).