From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Mei Subject: Re: [PATCH] [md] raid5: check faulty flag for array status during recovery. Date: Thu, 19 Feb 2015 15:49:32 -0700 Message-ID: <54E6687C.6020408@gmail.com> References: <94BC57C5-6223-435B-96FD-7DA2F6B4E561@gmail.com> <20150220085147.03bb2247@notabene.brown> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: In-Reply-To: <20150220085147.03bb2247@notabene.brown> Sender: linux-raid-owner@vger.kernel.org To: NeilBrown Cc: linux-raid@vger.kernel.org, eric.mei@seagate.com List-Id: linux-raid.ids Hi Neil, You are absolutely right we need RCU lock for this. Thank you=20 so much! Eric On 2015-02-19 2:51 PM, NeilBrown wrote: > On Tue, 6 Jan 2015 15:24:24 -0700 Eric Mei wrote: > >> Hi Neil, >> >> In a MDRAID derived work we found and fixed a data corruption bug. W= e think this also affect vanilla MDRAID, but we didn=E2=80=99t directly= prove that by constructing a test to show the corruption. Following is= the theoretical analysis, please kindly review and see if I missed som= ething. >> >> To rebuild a stripe, MD checks whether array will be optimal after r= ebuild complete, if that=E2=80=99s true, we=E2=80=99ll mark the WIB bit= to be cleared, the purpose is to enable =E2=80=9Cincremental rebuild=E2= =80=9D. The code section is like this: >> >> /* Need to check if array will still be degraded after recovery/res= ync >> * We don't need to check the 'failed' flag as when that gets set, >> * recovery aborts. >> */ >> for (i =3D 0; i < conf->raid_disks; i++) >> if (conf->disks[i].rdev =3D=3D NULL) >> still_degraded =3D 1; >> >> The problem is that only checking rdev =3D=3D NULL might not be enou= gh. Suppose both 2 drives D0 and D1 failed and marked as Faulty; We imm= ediately removed D0 from array, but because some lingering IO on D1, it= remains in array with Faulty flags on. A new drive pulled in, rebuild = against D0 starts. Now because no rdev is NULL, MD thinks array will be= optimal. If some writes happened before rebuild reaches the region, th= eir dirty bits in WIB will be cleared. When later add D1 back into arra= y, we=E2=80=99ll skip rebuilding those stripes, thus data corruption. >> >> The attached patch (against 3.18.0-rc6) is supposed to fix this issu= e. >> >> Thanks >> Eric >> > Hi Eric, > sorry for the delay, and thanks for the reminder... > > The issue you described could only affect RAID6 as it requires the ar= ray to > continue with two failed drives. > > However in the RAID6 case I think you are correct - there is a chance= of > corruption if there is a double failure and a delay in removing one d= evice. > > Your patch isn't quite safe as conf->disks[i].rdev can become NULL at= any > moment, so it could become NULL between testing and de-referencing. > So I've modified it as follows. > > Thanks, > NeilBrown > > > > Author: Eric Mei > Date: Tue Jan 6 09:35:02 2015 -0800 > > raid5: check faulty flag for array status during recovery. > =20 > When we have more than 1 drive failure, it's possible we start > rebuild one drive while leaving another faulty drive in array. > To determine whether array will be optimal after building, curre= nt > code only check whether a drive is missing, which could potentia= lly > lead to data corruption. This patch is to add checking Faulty fl= ag. > =20 > Signed-off-by: NeilBrown > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index bc6d7595ad76..022a0d99e110 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -5120,12 +5120,17 @@ static inline sector_t sync_request(struct md= dev *mddev, sector_t sector_nr, int > schedule_timeout_uninterruptible(1); > } > /* Need to check if array will still be degraded after recovery/re= sync > - * We don't need to check the 'failed' flag as when that gets set, > - * recovery aborts. > + * Note in case of > 1 drive failures it's possible we're rebuildin= g > + * one drive while leaving another faulty drive in array. > */ > - for (i =3D 0; i < conf->raid_disks; i++) > - if (conf->disks[i].rdev =3D=3D NULL) > + rcu_read_lock(); > + for (i =3D 0; i < conf->raid_disks; i++) { > + struct md_rdev *rdev =3D ACCESS_ONCE(conf->disks[i].rdev); > + > + if (rdev =3D=3D NULL || test_bit(Faulty, &rdev->flags)) > still_degraded =3D 1; > + } > + rcu_read_unlock(); > =20 > bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_de= graded); > =20 -- To unsubscribe from this list: send the line "unsubscribe linux-raid" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html